# DAML 04 - Pandas Introduction

Michal Grochmal <michal.grochmal@city.ac.uk>

Wrapper on top of `NumPy` (and `Matplotlib` to some extent) to make up for the shortcomings
of those two libraries when working with real-world data.
Instead of working towards efficient
numerical computing it attempts to make working with messy data less annoying.
The name Pandas comes from the term *Panel data* which is derived from econometrics.

Let's import it,
and also let's import `NumPy` to see how both libraries work with each other.
`pandas.options` holds several variables which are used when displaying the data.

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 12

## General ideas behind pandas

Originally built as an enhanced version of R's `data.frame`,
`pandas` incorporates several known APIs into a single structure.
The `DataFrame` includes APIs that make it easy for use from different perspectives.

* R `data.frame` like structure, extended by multi-indexes
* SQL-like joins, without need for external libraries (e.g. `sqldf` in R)
* Looks like a spreadsheet (yes, that is intentional)
* One can move between two and multidimensional representations (`stack`, `unstack`)
* Aggregation across dimensions with `groupby` (similar to SQL)
* Enhanced data types with many operations outside pure computation when compared with numpy.

You will use `pandas` (rather than `NumPy`) for tasks around messy data, or multiple streams of numeric and/or alpha-numeric data.
`pandas` (as most of the libraries in the Python Scientific Stack) is built on top of `NumPy`, and uses the continuous memory and broadcast operations
of `NumPy` arrays to boost its performance. It also integrates out-of-the-box graphing functionality through the `matplotlib` API.

`pandas` excels at:

* Importing data (very resilient compared to `numpy.loadtxt`)
* Clean up messy data (`dropna` or `fillna`)
* Gain insight into data (`describe`)
* Encapsulating different data types within the same structure (`DataFrame`)
* Using a plethora of bespoke functions that help with data analytics and summarization

Let's use some data about the British isles and the United Kingdom to demonstrate
some of the features:

In [None]:
country = ['Northern Ireland', 'Scotland', 'Wales', 'England', 'Isle of Man']
capital = ['Belfast', 'Edinburgh', 'Cardiff', 'London', 'Douglas']
area = np.array([14130, 77933, 20779, 130279, 572])
population2017 = np.array([1876695, 5404700, np.nan, 55268100, np.nan])
population2011 = np.array([1810863, 5313600, 3063456, 53012456, 83314])

# Series

The main feature of `pandas` is the `DataFrame` but that is just a collection of `Series` data structures.
A `Series` is pretty similar to a `NumPy` array: it is a list of several data of the same data type.
The difference is that the `Series` adds labels (an index) to the data.

In [None]:
series_area = pd.Series(area)
series_area

In [None]:
uk_area = pd.Series(area, index=country)
uk_area

### Selection from the index

Selecting from a `Series` works both as a list or as a dictionary.
You can say that a `Series.index` maps keys over `Series.values`.

In [None]:
uk_area.values, uk_area.values.dtype, uk_area.index

All the following three forms of indexing produce the same record.

In [None]:
uk_area['Wales'], uk_area[2], uk_area.values[2]

Slicing works too, so does fancy indexing.

In [None]:
uk_area[0:3]

In [None]:
uk_area[['Wales', 'Scotland']]

### Sorted and unsorted indexes

Slicing works on indexes (the labels of the Series)
but it is only likely to produce meaningful results if the index is sorted.

Note: In older versions of `pandas` slicing over an unsorted index produced an error,
this still happens over a multi-index (outlined in a later section).
Since we did not care about the order when constructing the data frame our index is unsorted,
therefore slicing it will produce strange results.

In [None]:
uk_area['England':'Scotland']  # oops!

If we sort the index,
the alphabetical order (or actually ASCIIbetical order) of the labels can be used for slicing.

In [None]:
uk_area.sort_index(inplace=True)
uk_area['England':'Scotland']

### Implicit indexes

If you do not define an index you can still select and slice series items.
This is because apart from the normal index an implicit, positional, index is created.
In other words, every `pandas` series has two indexes: the implicit and the explicit index.

In [None]:
series_area = pd.Series(area)
series_area[0:3]

Moreover, when the explicit index is non-numeric,
the implicit index is used for access.
Here is a series with a sorted index.

In [None]:
uk_area = pd.Series(area, index=country).sort_index()
uk_area

Most of the time both indexes work in the same fashion but slicing
is inconsistent between them:
The explicit index includes the last slice element (unlike Python list slicing),
whilst the implicit index performs slices in the same way as list slicing.

In [None]:
uk_area['England':'Scotland']  # Inclusive!

In [None]:
uk_area[0:3]  # Exclusive!

This can give us a headache with numerical indexes,
therefore `pandas` allows us to choose which index to select from:

* `loc` always refers to the explicit index
* `iloc` always refers to the implicit index
* `ix` is what is actually used when we do plain `[]` indexing (and you would normally not need to write it out)

In [None]:
series_area = pd.Series(area)
series_area.index = [1, 2, 3, 4, 5]
series_area

In [None]:
series_area[1], series_area.loc[1], series_area.iloc[1]

In [None]:
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])

Note that, by default, *numeric indexes use the implicit index*.

But there's more!
If one does not define an index at all `.loc` accesses the implicit index
but it uses the explicit index rules of slicing.

In [None]:
series_area = pd.Series(area)
series_area

In [None]:
series_area[1], series_area.loc[1], series_area.iloc[1]

In [None]:
list(series_area[1:3]), list(series_area.loc[1:3]), list(series_area.iloc[1:3])

Always cross-check slicing operations and use `.loc` or `.iloc` explicitly.
The same rules apply to data frames (seen in a moment).

### Like an array

The `NumPy` vectorized operations, selection and broadcasting work as if we were working on an array.

In [None]:
uk_area[uk_area > 20000]

Let's compute the area in square miles instead of square kilometers.

$$
0.386 \approx \frac{1}{1.61^2}
$$

In [None]:
uk_area * 0.386

And the total of the UK area in square miles.
(The Isle of Man is technically not part of the UK but it is negligible here.)

In [None]:
(uk_area * 0.386).sum()

### More than an array

The `Series` aligns the indexes when performing operations.
For example what if we would like to know the population growth between 2011 and 2017?

Note: Below, `.dropna()` removes rows containing `NULL`s (`NaN`s) fro the series.

In [None]:
p11 = pd.Series(population2011, index=country)
p17 = pd.Series(population2017, index=country).dropna()

In [None]:
p11

In [None]:
p17

When we perform the operation the indexes are matched,
where a number cannot be found (i.e. the operation contains a `NaN`), pandas automatically inserts a `NaN` (Not a Number).

In [None]:
p17 - p11

## Data Frames

The `DataFrame` is just a collection of `Series` with a common index.
It can be understood as a two-dimensional representation of data,
similar to a spreadsheet.  One important thing to note, is that,
contrary to a two dimensional `NumPy` array, **indexing a data frame
produces the entire column** and not the row.  Yet, indexing it with two descriptors
produces the row and the column just like in a `NumPy` array.

Let's build a `NumPy` array and a `DataFrame` that look the same.
Then we can have a look at how similar operations work on both.
Constructing the data frame can be performed in several ways,
below is the most common way of using a dictionary of arrays.
Each dictionary key-value pair becomes a column (a `Series`).

The `NumPy` array, when constructed from a list of arrays,
understands each part of the list as a row, therefore we need to transpose it.

In [None]:
array = np.array([area, capital, population2011, population2017]).T

data = pd.DataFrame({'area': area,
                     'capital': capital,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country)

In [None]:
array

In [None]:
data.dtypes

The first thing to note is that the `NumPy` array can only hold one data type.
The array casted every data type to a Unicode string.
In reality `NumPy` arrays support compound data types
but these are considerably more complicated to use the data frames.

Data in the data frame got converted too.
Each column can have different data types but somewhat the numbers
in "`population 2011`" and in "`population 2017`" look different.
The reason behind this being that we cannot have `NaN`s in a `Series` with `int`egers.

We have the same data in `NumPy` and `pandas`, and  we can index it.
In `NumPy` a plain index produces a *row*, in `pandas` it produces a *column*.

In [None]:
array[0]

In [None]:
data['area']

Yet, there is a twist.
Using the implicit index (`.iloc`) produces the same behavior as `NumPy`.

In [None]:
data.iloc[0]

Columns with simple names can be accessed as attributes.

In [None]:
data.area

Finally, multi-indexing works in the same way as `NumPy`:
One provides first the *row* and then the *column*.
And slicing works too.

In [None]:
data.loc['England', 'area':'capital']

### Summarize

Data frames have several useful methods to give a feel for the data.
With a reasonable amount of data you'd rather not want thousands of rows to
be printed.  What you want are methods to give you the data you are after quickly.

For example, looking at the beginning or end of sorted values will show outliers.

In [None]:
data = pd.DataFrame({'area': area,
                     'capital': capital,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country).sort_index()

The index is sorted, Therefore we get the countries in alphabetical order.

In [None]:
data.head(3)

Sorted by area, should give us the biggest countries.

In [None]:
data.sort_values('area').tail(3)

The length of a data frame is the number of rows it has.

In [None]:
len(data)

The `describe` and `info` methods print two distinct types of statistics about the data frame:
one gives the statistical view of each column, the other gives you a memory layout.

In [None]:
data.describe()

In [None]:
data.info()

The data frame can also display plots (using `Matplotlib`) directly.
That said, if we want to display the plots within the notebook or style them,
we need to perform the `matplotlib` setup ourselves.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')

We can see the population growth in a graph.

In [None]:
plot = data[['population 2011', 'population 2017']].plot(kind='bar', figsize=(14, 7))

And, on a logarithmic scale, we can see the relation between area and population.

Here we also use annotations, this is a `matplotlib` feature.
It annotates the string (first argument) over a point on the graph
(two coordinates - as a tuple, list or series).

In [None]:
plot = data.plot(kind='scatter', x='population 2011', y='area', loglog=True, figsize=(16, 8))
for k, v in data[['population 2011', 'area']].iterrows():
    plot.axes.annotate(k, xy=v, xytext=(v[0], v[1]*1.07), ha='center', size=12)

### String methods

Another extra feature that does not exist in `NumPy` arrays are methods that work
on string content, just like Python string methods.  The `str` object of a `Series`
(of a column of a data frame) is used to call string methods on each element, efficiently.
The result is either a boolean `Series` that can then be used to retrieve rows from the data frame,
or a new string `Series` modified by the operation. Something worthwhile mentioning here is that
we should convert the `object` data type to `Strig` for all columns that have String type data.
Using string methods on `object` data types still works, but is discouraged!

In [None]:
data['capital'].str.startswith('Be')

Several regular expression methods are supported as well.

In [None]:
data[data.capital.str.contains('[oa]')]

In [None]:
data[data.index.str.startswith('Eng')]

Most string Python methods are available.

In [None]:
data['capital'].str.upper()

To modify the data frame we can assign to a new column.
For example, first letter of the capital.

In [None]:
data['initial'] = data['capital'].str[0].str.upper()
data

Note above that `.str` has been used two times.

### Missing data

More often than not real world data is incomplete in some way.
In `NumPy`, and therefore in `pandas`, missing data is represented using NaNs (not a number).
NaNs used to be IEEE 754 float NaNs, therefore the data type of a `Series` (or `NumPy` array)
should of been either a float or a Python object. Since Pandas version 1.0.0 the NaNs got a
big update and now they can be used inside multiple data types of `Series`, you can read more
about this major upgrade here: https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data-na.
This is not the case for `NumPy`.

`Series` strings are just Python objects,
this is contrary to `NumPy`'s arrays; this means that a `Series` or a data frame can hold
NULLs (NaNs) for strings.

`pandas` data frames have the `dropna` an `fillna` methods that
(unsurprisingly) drop or fill in values for NaNs.
Dropping can be done by row or column.

In [None]:
data = pd.DataFrame({'area': area,
                     'capital': capital,
                     'population 2011': population2011,
                     'population 2017': population2017},
                    index=country).sort_index()

In [None]:
data.dropna()

We lost the data for the Isle of Man, despite the fact that it has data for 2011.
Instead we can drop the incomplete columns.

In [None]:
data.dropna(axis='columns')

That's better.

Instead of `NumPy`s `axis=0` and `axis=1`,
in `pandas` one can use `axis='index'` and `axis='columns'`.
That is, most of the time,
some `pandas` functions do accept `axis='row'` and `axis='col'`, beware.

Filling NaNs can be performed in three different ways:
we can provide a value into `fillna` to substitute the NaNs for (e.g. `.fillna(0)`); or we can use
the `method=` argument to use a predefined way of filling the NaNs from the data itself.  The `method=`
can be either `pad`/`ffill` which will fill each NaN with a previous (non-NaN) value seen; or it can be
`backfill`/`bfill` which will fill a NaN from the next value.
Filling can be performed column or row wise.

In [None]:
data_full = data.fillna(method='ffill', axis='columns')
data_full

That seems to have worked but not quite.
The numbers look wrong.  We better check the data types.

In [None]:
data_full.dtypes

Everything got converted to Python objects!
That is the caveat of filling NaNs between columns, i.e. we lose data types.
We can easily fix this now.

In [None]:
data_fixed = data_full.convert_dtypes()
data_fixed.dtypes

Or, we can fix it with some data munging.
We will separate the columns that need filling from the rest,
perform the filling, fix the data types on the reduced data frame and join things back.

In [None]:
data_partial = data[['population 2011', 'population 2017']].fillna(method='ffill', axis='columns')
# the following is slightly overkill for numbers but useful if we have more columns
data_partial = data_partial.apply(pd.to_numeric, errors='ignore')
data_partial = data_partial.astype(np.integer, errors='ignore')
data_full = pd.concat([data[['area', 'capital']], data_partial], axis='columns')
data_full.dtypes

That appears to be right.
But we used a new concept here: we split and joined back two data frames.
`pandas` has `.concat`, which allow for joins on `axis`,
it accepts a `join` parameter for "inner" or "outer" joins.
The join will happen on the index by default.

`pandas` function `merge` has left and right joins using columns as join keys.
And finally the data frame itself has a `join` method which can use either
`concat` or `merge` directly on the data frames.
All three methods are more-or-less interchangeable,
their difference is mostly how parameters are passed in.

But still, let's look whether the data in the joined frame is as correct
as the data types suggest.

In [None]:
data_full