# DAML 03 - Pandas Introduction

Michal Grochmal <michal.grochmal@city.ac.uk>

Wrapper on top of `NumPy` (and `Matplotlib` to some extent) to make up for the shortcomings
of those two libraries in the context of real-world data.  Instead of working towards
numerical computing it attempts to make working with messy data less annoying.

First some accumulated boilerplate from previous lectures (and one new one):

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
mpl.rcParams['figure.figsize'] = (12.5, 6.0)
import pandas as pd
pd.options.display.max_rows = 12

## General ideas behind pandas

Originally built as an enhanced version of R's `data.frame`,
`pandas` incorporates several known APIs into a single structure.
The `DataFrame` includes APIs that make it easy for use from different perspectives.

* R `data.frame` like structure, extended by multi-indexes
* SQL-like joins, without need for external libraries (e.g. `sqldf`)
* Looks like a spreadsheet (yes, that is intentional)
* One can move between two and multidimensional representations (`stack`, `unstack`)
* Aggregation across dimensions with `groupby` (similar to SQL)

You will use `pandas` (rather than `NumPy`) for tasks around messy data.
`pandas` is built atop `NumPy`, and uses the continuous memory and broadcast operations
of `NumPy` arrays, we saw before, to boost its performance.  `pandas` excels at:

* Importing data (very resilient compared to `numpy.load`)
* Clean up messy data (`dropna` or `fillna`)
* Gain insight into data (`describe`)

Let's use some data about the British isles and the United Kingdom to demonstrate
some of the features:

In [None]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas']
area = [14130, 77933, 20779, 130279, 572]
population2017 = [1876695, 5404700, np.nan, 55268100, np.nan]
population2011 = [1810863, 5313600, 3063456, 53012456, 83314]

# Series

The main feature of `pandas` is its `DataFrame` but that is a collection of `Series` data structures.
A `Series` is pretty similar to a `NumPy` array: it is a list of several data for the same data type.
The difference is that the `Series` adds labels (an index) to the data.

In [None]:
series = pd.Series(area)
series

In [None]:
uk_area = pd.Series(area, index=country)
uk_area

### Selection from the index

Selecting from a `Series` works both as a list or as a dictionary.
You can say that a `Series.index` maps keys over `Series.values`.

In [None]:
uk_area.values, uk_area.values.dtype, uk_area.index

In [None]:
uk_area['Wales'], uk_area[2], uk_area.values[2]

Slicing works too, so does fancy indexing.

In [None]:
uk_area[0:3]

In [None]:
uk_area[['Wales', 'Scotland']]

### Sorted and unsorted indexes

Slicing also works on indexes
but it is only likely to produce meaningful results if the index is sorted.

Note: In older versions of `pandas` slicing over an unsorted index produced an error,
this still happens over a multi-index (outlined in a later section).

In [None]:
uk_area['England':'Scotland']  # oops

In [None]:
uk_area.sort_index(inplace=True)
uk_area['England':'Scotland']

### Implicit indexes

If you do not define an index `pandas` will create an implicit one.
This is what happened with our `series` variable above.
Slicing  may be counterintuitive.

In [None]:
uk_area

In [None]:
uk_area['England':'Scotland']  # Inclusive!

In [None]:
uk_area[0:3]  # Exclusive!

This can give us a headache with numerical indexes,
therefore `pandas` allows us to choose which index to select from:

* `loc` always refers to the explicit index
* `iloc` always refers to the implicit index
* `ix` is what is actually used when we do plain `[]` indexing (and you would normally not need to write it out)

In [None]:
series.index = [1, 2, 3, 4, 5]
series

In [None]:
series[1], series.loc[1], series.iloc[1]

In [None]:
list(series[1:3]), list(series.loc[1:3]), list(series.iloc[1:3])

### Like an array

The `NumPy` vectorized operations, selection and broadcasting work as if we were working on an array.

In [None]:
uk_area[uk_area > 20000]

In [None]:
uk_area * 0.386  # convert to square miles (1/1.61**2)

In [None]:
uk_area.sum()  # Isle of Man is not part of the UK!  We'll fix that later.

### More than an array

The `Series` align the indexes when performing operations.
For example what if we would like to know the population sum and population growth
between 2011 and 2017?

In [None]:
p11 = pd.Series(population2011, index=country)
p17 = pd.Series(population2017, index=country).dropna()  # disregard nulls, we will see more later

In [None]:
p11

In [None]:
p17

In [None]:
p17 - p11

## Data Frames

The `DataFrame` is just a collection of `Series` with a common index.
It can be understood as a two-dimensional representation of data,
similar to a spreadsheet.  One important thing to note is that,
contrary to two dimensional `NumPy` arrays, **indexing a data frame
produces the column** not the row.  Yet, indexing it with two numbers
produces the row and the column just like in a `NumPy` array.

In [None]:
array = np.array([area, capital, population2011, population2017]).T  # transpose
data = pd.DataFrame({'capital': capital,
                     'area': area,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country)

In [None]:
array

In [None]:
data

In [None]:
array[0]

In [None]:
data['area']  # get column

In [None]:
data.iloc[0]  # but `iloc` does the same as a `NumPy` array

In [None]:
data.area  # this works too

In [None]:
data.loc['England', 'area']  # still [row, column]

### Summarize

Data frames have several useful methods to give a feel for the data.
With a reasonable amount of data you'd rather not want thousands of rows to
be printed, moreover, looking at the beginning or end of sorted values will show outliers.
The `describe` and `info` methods print two distinct types of statistics about the data frame:
one gives the statistical view of each column, the other gives you a memory layout.
The data frame can also access plots (from `Matplotlib`) directly.

Let's see some examples but firsts let's order the index on the data frame.

In [None]:
data.sort_index(inplace=True)

In [None]:
data.head(3)

In [None]:
data.sort_values('area').tail(3)

In [None]:
len(data)  # number of rows

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
plot = data[['population 2011', 'population 2017']].plot(kind='bar')
ticks = ['%.0f M' % (x[1] / 1e6) for x in plot.yaxis.iter_ticks()]
plot.yaxis.set_ticklabels(ticks);  # just a hack to get nice ticks

In [None]:
plot = data.plot(kind='scatter', x='population 2011', y='area', loglog=True)
for k, v in data[['population 2011', 'area']].iterrows():
    plot.axes.annotate(k, v)

### String methods

Another extra feature that does not exist in `NumPy` arrays are methods that work
on string content, just like Python string methods.  The `str` object os a `Series`
(of a column of a data frame) is used to call string methods and produce a boolean
`Series` that can then be used to retrieve rows from the data frame.

Several regular expression methods are supported as well.

In [None]:
data['capital'].str.startswith('Be')

In [None]:
data[data.capital.str.contains('[oa]')]  # regex

In [None]:
data[data.index.str.startswith('Eng')]

### Missing data

More often than not real world data is incomplete in some way.
In `NumPy`, and therefore in `pandas`, missing data is represented using NaNs (not a number).
NaNs are actuall IEEE 754 float NaNs, therefore the data type of a `Series` (or `NumPy` array)
must be either a float or Python object.  `pandas` data frames have the `dropna` an `fillna`
methods that (unsurprisingly) drop or fill in values for NaNs.

Dropping can be done by row or column.  Filling can be performed in three different ways:
we can provide a value into `fillna` to substitute the NaNs for (e.g. `.fillna(0)`); or we can use
the `method=` argument to use a predefined way of filling the NaNs from the data itself.  The `method=`
can be either `pad`/`ffill` which will fill each NaN with a previous (non-NaN) value seen; or it can be
`backfill`/`bfill` which will fill a NaN from the next value.
Filling can be performed column or row wise.

In [None]:
data.dropna()  # We lost Wales and the Isle of Man!

In [None]:
data.dropna(axis='columns')  # that's better

In [None]:
data_full = data.fillna(method='ffill', axis='columns')
data_full

In [None]:
data_full.dtypes

In [None]:
data_full = data_full.apply(pd.to_numeric, errors='ignore')
data_full = data_full.astype(np.integer, errors='ignore')
data_full.dtypes

In [None]:
data_full