<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#pandas" data-toc-modified-id="pandas-1">pandas</a></span><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1.1">Loading data</a></span><ul class="toc-item"><li><span><a href="#Text" data-toc-modified-id="Text-1.1.1">Text</a></span></li><li><span><a href="#Excel" data-toc-modified-id="Excel-1.1.2">Excel</a></span></li><li><span><a href="#The-internet" data-toc-modified-id="The-internet-1.1.3">The internet</a></span></li></ul></li><li><span><a href="#DataFrames" data-toc-modified-id="DataFrames-1.2">DataFrames</a></span><ul class="toc-item"><li><span><a href="#Indexing" data-toc-modified-id="Indexing-1.2.1">Indexing</a></span><ul class="toc-item"><li><span><a href="#Columns" data-toc-modified-id="Columns-1.2.1.1">Columns</a></span></li></ul></li></ul></li></ul></li></ul></div>

In [1]:
import os

def cleanup():
    for filename in []:
        try:
            os.remove(filename)
        except FileNotFoundError:
            pass

os.chdir('examples')
cleanup()

# Data analysis

Our next special topic is how to analyze simple data sets in Python.

## pandas

As you may have noticed by now, the first step for almost any specialist task in Python is to [import](extras/glossary.ipynb#import) an additional [package](extras/glossary.ipynb#package) that provides some extra functionality. The most popular package for loading and working with tables of data is called `pandas`. We met it briefly already in the lesson on files, when we learned about [delimited text data files](files.ipynb#delimited-text).

### Loading data

#### Text

Recall that a delimited text file stores a table of data as plain text, with one particular character (most often the comma) reserved as a [separator](extras/glossary.ipynb#separator) to mark the boundaries between the columns of the table. Here is an example data file in which the separator is a comma (a *csv* file):

In [2]:
import os

filepath = os.path.join('data', 'penguins.csv')

with open(filepath) as f:
    for linenum in range(6):
        print(f.readline(), end='')

Bird,HeartRate,Depth,Duration
EP19,88.8,5,1.05
EP19,103.4,9,1.1833333
EP19,97.4,22,1.9166667
EP19,85.3,25.5,3.4666667
EP19,60.6,30.5,7.0833333


This file stores a table of data on the [heart rates of penguins during fishing dives](https://doi.org/10.1242/jeb.013235). Each row represents one dive, with the following columns:

* *Bird*: An ID code for the penguin making the dive.
* *HeartRate*: The penguin's heart rate during the dive, in beats per minute (bpm).
* *Depth*: The depth of the dive (in meters).
* *Duration*: The duration of the dive (in minutes).

Let's import `pandas` and use its `read_csv()` function to read this file:

In [3]:
import pandas

penguins = pandas.read_csv(filepath)

print(penguins)

       Bird  HeartRate  Depth   Duration
0      EP19       88.8    5.0   1.050000
1      EP19      103.4    9.0   1.183333
2      EP19       97.4   22.0   1.916667
3      EP19       85.3   25.5   3.466667
4      EP19       60.6   30.5   7.083333
..      ...        ...    ...        ...
120  EP3901       48.4  170.0  11.533333
121  EP3901       50.8   37.0   8.216667
122  EP3901       49.6  160.0  11.300000
123  EP3901       56.4  180.0  10.283333
124  EP3901       55.2  170.0  10.366667

[125 rows x 4 columns]


#### Excel

As well as `read_csv()`, `pandas` provides many functions for reading data from different sources. You may sometimes need to read from a Microsoft Excel spreadsheet file. There is a function for this:

In [4]:
penguins_excel = pandas.read_excel(os.path.join('data', 'penguins.xlsx'))

`read_excel()` can take additional [arguments](extras/glossary.ipynb#argument) for specifying which sheet of the Excel file to read, which rows or columns to skip, etc. in case your spreadsheet is a little messy. As always with new functions, it is a good idea to seek out and read the [online documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) to see what the possibilities are and whether there are any important defaults.

#### The internet

Some `pandas` [IO](extras/glossary.ipynb#io) functions can even read from a file stored online. The first [argument](extras/glossary.ipynb#argument) is the [URL](extras/glossary.ipynb#url) to the file:

In [5]:
url = 'https://raw.githubusercontent.com/luketudge/introduction-to-programming/master/content/examples/data/penguins.csv'

penguins_from_the_internet = pandas.read_csv(url)

### DataFrames

What [type](extras/glossary.ipynb#type) of object do `pandas`' read functions give us?

In [6]:
type(penguins)

pandas.core.frame.DataFrame

They give us a 'data frame'. A [data frame](extras/glossary.ipynb#dataframe) is very similar to a [matrix](extras/glossary.ipynb#matrix). Like a matrix, a data frame stores values in a 'grid' of rows and columns. But whereas the structure of a matrix may represent all sorts of things, the roles of rows and columns are slightly more specific in a data frame.

Each *row* of a data frame represents one 'observation'. An observation is some coherent entity or event for which information was recorded. Each *column* of a data frame represents one measurement or piece of information that was recorded for each of the observations. For example, the observations in a data frame might be people, for whom some physical or demographic measurements were made, or they might be events like purchases in a store, for each of which a customer ID, product, price, etc. were recorded. In the penguins data frame, the observations are dives. (Note that the observations are not penguins, as it might be tempting to think; depth and duration are clearly attributes of individual dives, not of penguins).

Each row of a data frame may contain values of heterogeneous [types](extras/glossary.ipynb#type), since not every piece of information recorded about each observation is necessarily of the same kind. For example, the penguins data frame records a [string](extras/glossary.ipynb#string) ID for the bird making the dive, but the other values are [floats](extras/glossary.ipynb#float). The values in a single column *are* necessarily of the same type, since we usually want to compare them with each other. For example, we would not be able to compare the dive depths of different penguins if some depths were recorded as floats, and others as strings such as `'quite deep'` or `'really deep'`.

#### Indexing

##### Columns

We can get individual columns from a `pandas.DataFrame` using the same square parenthesis [indexing](extras/glossary.ipynb#index) as for other Python types. For example, to get a single column, the index is just the name of the column, as a [string](extras/glossary.ipynb#string):

In [12]:
print(penguins['Depth'])

0        5.0
1        9.0
2       22.0
3       25.5
4       30.5
       ...  
120    170.0
121     37.0
122    160.0
123    180.0
124    170.0
Name: Depth, Length: 125, dtype: float64


As we can see in the output above, individual columns have a `dtype` [attribute](extras/glossary.ipynb#attribute) just as `numpy` [arrays](arrays.ipynb#data-types) do. We can fetch it in the same way:

In [9]:
penguins['Depth'].dtype

dtype('float64')

We can get multiple columns using a [list](extras/glossary.ipynb#list) of column names. Note the double square parentheses. The outer parentheses are for indexing, and the inner parentheses indicate that the index is a list:

In [11]:
print(penguins[['Depth', 'Duration']])

     Depth   Duration
0      5.0   1.050000
1      9.0   1.183333
2     22.0   1.916667
3     25.5   3.466667
4     30.5   7.083333
..     ...        ...
120  170.0  11.533333
121   37.0   8.216667
122  160.0  11.300000
123  180.0  10.283333
124  170.0  10.366667

[125 rows x 2 columns]


If we ever need to check what columns a `pandas.DataFrame` contains, the `columns` attribute can tell us:

In [13]:
penguins.columns

Index(['Bird', 'HeartRate', 'Depth', 'Duration'], dtype='object')

##### Rows

Whereas indexing columns from a `pandas.DataFrame` is relatively straightforward, indexing rows presents some frustrating subtleties that can often lead to serious mistakes. Let's take a careful look.

First of all, we can use a [slice](extras/glossary.ipynb#slice) index to get a range of rows, just as we would to get a range of entries from a [list](extras/glossary.ipynb#list):

In [22]:
print(penguins[5:10])

   Bird  HeartRate  Depth   Duration
5  EP19       77.6   32.5   4.766667
6  EP19       44.3   38.0   9.133333
7  EP19       32.8   32.0  11.000000
8  EP19       94.2    6.0   1.316667
9  EP19       99.8   10.5   1.483333


However, not all is as it seems here. The first frustration is that we *must* use a slice to index rows like this. We cannot get a single row using a single index (i.e. with no colon character `:`), as we would for a list. And to add insult to injury, the error message that we get if we try to do this is a horrible mess:

In [24]:
penguins[5]

KeyError: 5

We can of course patch up this limitation by using a slice that contains only one row. For example:

In [25]:
print(penguins[5:6])

   Bird  HeartRate  Depth  Duration
5  EP19       77.6   32.5  4.766667


But this isn't ideal.

##### pandas madness

The astute question to ask here is of course: Why? Why can't we get a single row of a `pandas.DataFrame` just as easily as getting a single entry from a list?

The reason (though whether you consider it a *good* reason is for you to decide) is that `pandas` somewhat tries to second-guess what the user wants when they index a data frame. If the index is a [slice](extras/glossary.ipynb#slice), then `pandas` assumes we want a subset of row numbers, but if the index is anything else, such as a single value or a list, then pandas assumes we want a column or subset of columns.

So when we asked for `penguins[5]` above, `pandas` actually tried to get a column whose name is the [integer](extras/glossary.ipynb#integer) `5`. And of course there is no such column in our data frame.

###### iloc

If we want to get rows, or indeed columns, by position rather than by label, it is best not to let `pandas` second-guess what we are trying to do. The `iloc` [attribute](extras/glossary.ipynb#attribute) of a `pandas.DataFrame` can be indexed, and the indices will always be interpreted as positions, just as for a list.

So the safer way to get row number `5` is:

In [39]:
penguins.iloc[5]

Bird            EP19
HeartRate       77.6
Depth           32.5
Duration     4.76667
Name: 5, dtype: object

The `iloc` attribute also allows for *row, column* -style indexing, just like a `numpy` [array](extras/glossary.ipynb#array). The first index is the row number, and the second index is the column number. So for example to get the value in the final column of row `5`:

In [40]:
penguins.iloc[5, -1]

4.7666667

But it isn't usually useful to get columns by position. Columns have meaningful names, so we should use those to be clear.

###### loc

In [42]:
penguins.loc[5, 'Duration']

4.7666667

If you enter a question or complaint about `pandas` indexing into an internet search engine, you will discover a rich literature of lamentations. Row indexing is among a handful of irritating but not fatal quirks in the design of the `pandas` package. It probably could have been thought through a little more carefully, but `pandas` is otherwise so useful and has become so widely used that it would be difficult for the developers to change it now and risk breaking all the great data analysis programs that have been written with `pandas` so far. So we are stuck with it.

#### Boolean indexing

In [36]:
print(penguins[penguins['Depth']>150])

         Bird  HeartRate  Depth   Duration
32   EP432001       47.8  170.0  10.033333
33   EP432001       44.9  160.0   9.983333
46   EP432001       48.6  160.0   7.466667
47   EP432001       43.8  160.0   8.000000
113    EP3901       77.5  225.0   7.466667
114    EP3901       71.6  225.0   8.616667
117    EP3901       50.6  175.0  10.783333
119    EP3901       42.1  165.0  13.533333
120    EP3901       48.4  170.0  11.533333
122    EP3901       49.6  160.0  11.300000
123    EP3901       56.4  180.0  10.283333
124    EP3901       55.2  170.0  10.366667


In [8]:
cleanup()