<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#pandas" data-toc-modified-id="pandas-1">pandas</a></span><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1.1">Loading data</a></span><ul class="toc-item"><li><span><a href="#Text" data-toc-modified-id="Text-1.1.1">Text</a></span></li><li><span><a href="#Excel" data-toc-modified-id="Excel-1.1.2">Excel</a></span></li><li><span><a href="#The-internet" data-toc-modified-id="The-internet-1.1.3">The internet</a></span></li></ul></li><li><span><a href="#DataFrames" data-toc-modified-id="DataFrames-1.2">DataFrames</a></span></li></ul></li></ul></div>

In [1]:
import os

def cleanup():
    for filename in []:
        try:
            os.remove(filename)
        except FileNotFoundError:
            pass

os.chdir('examples')
cleanup()

# Data analysis

Our next special topic is how to analyze simple data sets in Python.

## pandas

As you may have noticed by now, the first step for almost any specialist task in Python is to [import](extras/glossary.ipynb#import) an additional [package](extras/glossary.ipynb#package) that provides some extra functionality. The most popular package for loading and working with tables of data is called `pandas`. We met it briefly already in the lesson on files, when we learned about [delimited text data files](files.ipynb#delimited-text).

### Loading data

#### Text

Recall that a delimited text file stores a table of data as plain text, with one particular character (most often the comma) reserved as a [separator](extras/glossary.ipynb#separator) to mark the boundaries between the columns of the table. Here is an example data file in which the separator is a comma (a *csv* file):

In [2]:
import os

filepath = os.path.join('data', 'penguins.csv')

with open(filepath) as f:
    for linenum in range(6):
        print(f.readline(), end='')

Bird,HeartRate,Depth,Duration
EP19,88.8,5,1.05
EP19,103.4,9,1.1833333
EP19,97.4,22,1.9166667
EP19,85.3,25.5,3.4666667
EP19,60.6,30.5,7.0833333


This file stores a table of data on the [heart rates of penguins during fishing dives](https://doi.org/10.1242/jeb.013235). Each row represents one dive, with the following columns:

* *Bird*: An ID code for the penguin making the dive.
* *HeartRate*: The penguin's heart rate during the dive, in beats per minute (bpm).
* *Depth*: The depth of the dive (in meters).
* *Duration*: The duration of the dive (in minutes).

Let's import `pandas` and use its `read_csv()` function to read this file:

In [3]:
import pandas

penguins = pandas.read_csv(filepath)

print(penguins)

       Bird  HeartRate  Depth   Duration
0      EP19       88.8    5.0   1.050000
1      EP19      103.4    9.0   1.183333
2      EP19       97.4   22.0   1.916667
3      EP19       85.3   25.5   3.466667
4      EP19       60.6   30.5   7.083333
..      ...        ...    ...        ...
120  EP3901       48.4  170.0  11.533333
121  EP3901       50.8   37.0   8.216667
122  EP3901       49.6  160.0  11.300000
123  EP3901       56.4  180.0  10.283333
124  EP3901       55.2  170.0  10.366667

[125 rows x 4 columns]


#### Excel

As well as `read_csv()`, `pandas` provides many functions for reading data from different sources. You may sometimes need to read from a Microsoft Excel spreadsheet file. There is a function for this:

In [4]:
penguins_excel = pandas.read_excel(os.path.join('data', 'penguins.xlsx'))

`read_excel()` can take additional [arguments](extras/glossary.ipynb#argument) for specifying which sheet of the Excel file to read, which rows or columns to skip, etc. in case your spreadsheet is a little messy. As always with new functions, it is a good idea to seek out and read the [online documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) to see what the possibilities are and whether there are any important defaults.

#### The internet

Some `pandas` [IO](extras/glossary.ipynb#io) functions can even read from a file stored online. The first [argument](extras/glossary.ipynb#argument) is the [URL](extras/glossary.ipynb#url) to the file:

In [5]:
url = 'https://raw.githubusercontent.com/luketudge/introduction-to-programming/master/content/examples/data/penguins.csv'

penguins_from_the_internet = pandas.read_csv(url)

### DataFrames

What [type](extras/glossary.ipynb#type) of object do `pandas`' data reading functions give us?

In [6]:
type(penguins)

pandas.core.frame.DataFrame

They give us a 'data frame'.

In [7]:
cleanup()