# `pandas`

## `pandas`: The workhorse of PyData

The library most widely known in the PyData ecosystem is probably `pandas`. Developed initially by Wes McKinney, it has grown, building on top of `numpy` and many other Python libraries, to be *the* workhorse of almost any analysis you may wish to perform.

The API is therefore rather large, and instead of pretending to show it all off we'll present the main features and, when necessary, point to other resources.

> "You don't learn `pandas`, you discover `pandas`."
> 
> -- Paraprashing someone on the internet

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd

## Pandas basics

## Series

We start off with `Series`. Series are what makes up DataFrames and are, in a handwavy sense, nothing more than `numpy` arrays with an index.

In [None]:
se = pd.Series([5, 4, -1, -2])
se

In [None]:
se.values, se.index

## Accessing data

Every Series has an index, whether it's automatically assigned by `pandas` or whether it's specified by us.

In [None]:
se_2 = pd.Series([5, 4, -1, -2], index=['a', 'b', 'c', 'd'])
se_2

You can access elements by their index:

In [None]:
se_2['b']

In [None]:
se_2[['b', 'c']]

Or by their location as if it the Series was a NumPy array:

In [None]:
se_2[0]

We can work with Series like we would with NumPy arrays:

In [None]:
np.mean(se_2)

In [None]:
se_2 ** 2

In [None]:
se_2 > 0

In [None]:
se_2[se_2 > 0]

In [None]:
'a' in se_2

## Creating a Series

Data from dicts can also be loaded:

In [None]:
literacy_dict = {"NH": 0.5, "ZH": 0.4, "FL": 0.1, "GR": 0.6}
literacy_se = pd.Series(literacy_dict)

In [None]:
literacy_se

An index can also be passed. Elements from the index with no corresponding value will be given the special value `nan`, while values with no corresponding key from the index will be ignored:

In [None]:
literacy_index = ["NH", "ZH", "FL", "UT"]
literacy_se_2 = pd.Series(literacy_dict, index=literacy_index)
literacy_se_2

In [None]:
literacy_se_2.isnull()

We can give names to individual Series and to their indices:

In [None]:
literacy_se_2.name = 'Literacy'
literacy_se_2.index.name = 'Province'
literacy_se_2

The index can be changed in place! 

(Beware: state modifications are always bad!)

In [None]:
literacy_se_3 = literacy_se_2.copy()
literacy_se_3.index = ["OH", "MY,", "WTF", "HAPPENED?"]
literacy_se_3

As with dictionary keys, index values must be hashable!

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).



DataFrames can be created in different ways. One of them is from dicts with lists:

In [None]:
data = {"province": ["FL", "FL", "NH", "NH", "ZH"],
        "year": [2013, 2014, 2013, 2014, 2014],
        "literacy": [0.2, 0.1, 0.5, 0.3, 0.5]}
df = pd.DataFrame(data)
df

To have the columns in a certain order:

In [None]:
df = pd.DataFrame(data, columns=["province", "year" , "literacy"])
df

An `index` can be passed to the constructor (as with Series). Passing column names not existing will result in missing data.

Assigning values to new columns is easy:

In [None]:
df['nonsense'] = df.year / df.literacy
df

In [None]:
df['serie_aligned'] = pd.Series(range(3), index=[1, 3, 4])
df

## Reindexing
Reindexing can often be important, especially with time series. Let's assume we have the following DataFrame (or Series):

In [None]:
se_3 = pd.Series(range(10), range(20)[::2])
se_3

The index might represent days of the month. Suppose we want to have an entry for every day of, say, December:

In [None]:
se_3.reindex(range(31))[:22]

The `NaN`'s can be taken care of using `method`:

In [None]:
se_3.reindex(range(31), method='ffill')[:22]

Make lab_02

## Dropping data

Dropping from an axis is easy in `pandas`:

In [None]:
df.drop('nonsense', axis=1)

In [None]:
df.drop(4)

## Indexing and filtering
Selecting data can be done using names or conditions

In [None]:
df['province']

In [None]:
df[['province', 'nonsense']]

Indexing a single column with a string returns a Series while indexing with a list returns a DataFrame

In [None]:
print(type(df['province']))
print(type(df[['province']]))

In [None]:
df[df.nonsense > 5000]

Fancier indexing is also available:

In [None]:
df.loc[[0, 3], ['nonsense', 'literacy']]

## Functions
NumPy functions working on an array (element-wise) work just fine with DataFrames:

In [None]:
np.sqrt(df[["literacy", "nonsense"]])

Sometimes we want to apply custom functions on our DataFrame:

In [None]:
df

In [None]:
def milennium(g):
    try:
        return np.floor(g / 1000)
    except:
        return None

In [None]:
df['year'].apply(milennium) #what is a faster way to do this?

Make lab_04

## Sorting
Sorting a dataset is a piece of cake:

In [None]:
df.sort_index()

In [None]:
df.sort_index(axis=1)

Sorting can also happen at the column level:

In [None]:
df.sort_values(by='literacy')

## Descriptive statistics
What would a chapter on `pandas` be worth if descriptive statistics were left out?

In [None]:
df.nonsense.sum()

In [None]:
df.sum()  # Province is always the cutest!

In [None]:
df.sum(axis=1)  # psst: non numeric values are ignored!

A quick way to compute descriptive statistics on a dataframe is `describe`:

In [None]:
df.describe()

In [None]:
df.describe(include='all')

## Value counts, uniqueness, membership
Often, merely counting the values in a Series is invaluable:

In [None]:
df.province.value_counts()

In [None]:
df.province.value_counts().index.values

In [None]:
df.province.unique()

Vectorized membership is also pretty important and handy!

In [None]:
df[df.province.isin(['ZH', 'FL'])]

## Handling missing data

If every dataset you ever got your hands on never had missing data, you're probably in the wrong room. 

Assuming from now on that you're in the right room, let's talk about `pandas`' handling of missing data.

When a value is not present (or `None`), `pandas` replaces it with `np.nan`:

In [None]:
np.nan

There is a series of methods that ease working with missing data. Besides `isnull` and `notnull` there are `dropna` and `fillna`:

In [None]:
df.dropna()  # here every row with missing data in **some** column is removed

In [None]:
df.fillna(0)

In [None]:
df.fillna(np.e)

`dropna` can be a bit brutal sometimes, which is why it has an option to remove only rows where *every* column value is missing or where at least `thresh` values are present:

In [None]:
df.dropna(how='all')  # nothing gets dropped here!

In [None]:
df.dropna?

In [None]:
# keep rows with at least 5 non NA values
df.dropna(thresh=5)

## Exercise

* Write a few helper functions so that you can easily drop lines with more than `n` missing values
* A nice feature of `fillna` is that it can fill different columns with different values if you pass it a dict. Create a dataframe and test this feature
* As above, but use the `ffill` and `bfill` kwarg of `fillna`

## Hierarchical indexing
Hierarchical indexing is a rather important feature of `pandas`. It basically allows one to work with multidimensional data in a (two) dimensional form. Let's take a look:

In [None]:
data = pd.Series(np.random.randn(10),
                index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                       [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

This Series has a multi-index. The gaps that we see between one letter and the other are interpreted as: "use the value above".

In [None]:
data.index

In [None]:
data['b']  # a partial index is possible

In [None]:
data['b'].index  # feels like magic!

Data can also be selected using the inner labels:

In [None]:
data

In [None]:
data[:, 2]  # the ':' is necessary 

Later on we will introduce `stack()` and `unstack()`, whichs helps tremendously once you want to have tidy data:

In [None]:
data.unstack() #what kind of operation is unstack?

Not shown here but possible: use hierarchical indexing for columns!

## Exercises

* Create a dataframe with hierarchical indexing on both rows and columns
* Give names to all the indexes!
* Use `swaplevel()` to swap two levels
* Use `sortlevel()` after swapping to sort the results
* Compute statistics by level (hint: `df.sum(level='a')`)
* Create a new dataframe without assigning the index. Use then `df.set_index([column_name_1, column_name_2])`
* Remove the index (`df.reset_index()`)

Make lab_05

## Data wrangling
Getting the data in the shape that we want is the single most consuming task in the life of the data scientist. Sometimes it can be the most frustrating. 

So… do yourself a favor and pay attention to this chapter.

## Merge operations
By merging we mean combining different data sets by linking rows with one or more keys. The basic syntax is pretty simple:

In [None]:
df

In [None]:
df2 = pd.DataFrame({"province": ["FL", "NH", "ZH"], "population": ["100000", "200000", "300000"]})
df2

Let's say we want a dataset with year, literacy, province and population. We can create it from `df` and `df2`.

In [None]:
display(df, df2)

In [None]:
df.merge(df2)  # merge is smart! If there are overlapping names, it uses those for the merge

If the column names are different, you'll need to specify them explicitly:

In [None]:
df3 = pd.DataFrame({"PROVINCE": ["FL", "NH"],
                    "Population": ["100000", "200000"]})
df.merge(df3, left_on='province', right_on='PROVINCE')

What happened? Zuid Holland is gone!

By default `merge` does inner joins. If you want a different type of join, you can specify it:

In [None]:
df4 = pd.DataFrame({"province": ["FL", "NH", "UT"], 
                    "population": ["100000", "200000", "50000"]})
df.merge(df4, how='outer')

Check this out:

In [None]:
df5 = pd.DataFrame({"province": ["FL", "NH", "FL"], "population": 
                    ["100000", "200000", "50000"]})
df.merge(df5, how='outer')

This was a many-to-many merge. Even though, if you think about it, the behavior is what you expect, you might still not think about it and be surprised!

## Exercises

* Create two DataFrames that can be merged on multiple keys
* Merge them
* Create two DataFrames that can be merged on a key
* Make that key the index
* Merge them! (You might want to use: `left_index=True` and/or `right_index=True`)
* Pick the two DataFrames created in the first exercise. Convert the two columns of a DataFrame into a mult-index
* Merge them!

`pandas` also provides the `join` keyword for merging via the index. Experiment with it a bit.

## Combining data with overlap
Sometimes some data is missing but can be "patched" with another dataset. Let's take a look:

In [None]:
serie_a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])
serie_b = pd.Series(np.arange(len(serie_a), dtype=np.float64),
                 index=['f', 'e', 'd', 'c', 'b', 'a'])

In [None]:
serie_a

In [None]:
serie_b

Let's say we want to update `serie_a` with the values from `serie_b`. The non-Pythonic way to do that is:

In [None]:
pd.Series(np.where(pd.isnull(serie_a), serie_b, serie_a), index=serie_a.index)

That's a bit verbose for something so simple. What about this:

In [None]:
serie_a.combine_first(serie_b)

## Exercise

* Create two DataFrames with overlapping indices, one without `NaN`'s and one with some `NaN`'s, and combine them.

## Stacking and unstacking
Sometimes the data is not in a format that suits further analysis (see the Tidy Data paper by Hadley Wickham). To help there, `pandas` has some neat features, among other things `stack` and `unstack`. Let's explore them with an example comparing the effect of two treatments on patients:

In [3]:
patients = pd.DataFrame([[None, 3], [16, 11], [1, 3]], 
                        columns=["trt_a", "trt_b"], 
                        index=["John Smith", "Jane Doe", "Mary Johnson"])
patients

Unnamed: 0,trt_a,trt_b
John Smith,,3
Jane Doe,16.0,11
Mary Johnson,1.0,3


This structure makes it difficult to reason about the data and to add new variables. Let's say we want to add the weather as a variable. 

What would you do? Add 2 * (#different weather combinations) columns?

A better structure is the following:

In [4]:
patients.reset_index()

Unnamed: 0,index,trt_a,trt_b
0,John Smith,,3
1,Jane Doe,16.0,11
2,Mary Johnson,1.0,3


In [5]:
long_patients = pd.melt(patients.reset_index(), id_vars='index', var_name='treatment').dropna()
long_patients

Unnamed: 0,index,treatment,value
1,Jane Doe,trt_a,16.0
2,Mary Johnson,trt_a,1.0
3,John Smith,trt_b,3.0
4,Jane Doe,trt_b,11.0
5,Mary Johnson,trt_b,3.0


We immediately see that adding the weather is easy:

In [None]:
long_patients['weather'] = [3, 10, 9, 1, 2]
long_patients

We can go back to where we were (but we must give up the weather) with `pivot`:

In [None]:
long_patients

In [None]:
long_patients.pivot(index="index", columns="treatment", values="value")

## Removing duplicates
Removing duplicates is another extremly common operation in data analysis (the SQL equivalent is `SELECT DISTINCT ... FROM ...`):

In [None]:
df6 = pd.DataFrame({'literacy': {0: 0.2,
  1: 0.2,
  2: 0.5,
  3: 0.5,
  4: 0.5},
 'province': {0: 'FL', 1: 'FL', 2: 'NH', 3: 'NH', 4: 'ZH'},
 'year': {0: 2011, 1: 2011, 2: 2011, 3: 2012, 4: 2014}})
df6

In [None]:
df6.drop_duplicates()

Just as when working with SQL, duplicates might not be present when working with the full dataset but might arise when restricting our analysis to a subset of columns:

In [None]:
df6[["literacy", "province"]]

In [None]:
df6[["literacy", "province"]].drop_duplicates()

In [None]:
df6.drop_duplicates(subset=["literacy", "province"])

##  Applying a function to a DatafFame
These DataFrames are all nice and dandy but sometimes we need to do transform the values. `apply` is a function that applies a function along the axis of a DataFrame. Objects passed to functions are Series objects:

In [None]:
df6.apply(len)

In [None]:
df6.apply(len, axis=1)

Besides functions that work on Series as a whole, we can also pass functions that work on elements of the Series:

In [None]:
def if_float_add_two(el):
    try:
        return el + 2
    except:
        return el
    
df6.apply(if_float_add_two)

In [None]:
def try_me(it, names):
    try:
        if it.name in names:
            return it + 2
        else:
            return np.repeat(it.iloc[0], len(it))
    except:
        return it
display(df)
df.apply(lambda x: try_me(x, ["year", "serie_aligned"]))

## Exercises

* Write a function that replaces the province's two-letter code with the province's name
* Create a new column in `df6` whose values are the resulting mappings (the province's names)


## Replacing values
Sometimes we wish to replace values from a DataFrame (or columns in the DataFrame): 

In [None]:
test_df = (
    pd.DataFrame(np.random.rand(10, 5))
    .applymap(lambda ar: np.where(np.random.rand() < 0.2, 999, ar))
)
test_df

In the above example we immediately see that the `999`'s do not really belong to the dataset. `replace` to the rescue!

In [None]:
test_df.replace(999, {col: test_df[col].median() for col in test_df.columns})

## Exercises [bonus]

* The above method is suboptimal for what we want to do. Write a function that computes the mean, ignoring values larges than `x`
* Use the function to perform a more efficient replacement of `test_df`

In [None]:
def mean_ignore(col, max_val=100):
    return col[col <= max_val].mean()

test_df.mask(test_df == 999, other=mean_ignore, axis=1)

Make lab_06

## Plotting in pandas

`pandas` doesn't have a kitchen sink, but it can, among the myriad other things, plot. It uses `matplotlib` under the hood to do that, so the usual import holds:

In [None]:
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

The `plot` method of a dataframe is just a wrapper around `plt.plot`. We already saw above that it can do basic plotting pretty easily:

In [None]:
sector = np.repeat(["automotive", "energy", "transportation", 
                    "infrastructure"], 5)
company = np.random.randint(0, 1000, size=len(sector))
revenue = np.random.random(size=len(sector)) * 10 ** 9
size = revenue * 10 **(-5) + np.random.random(size=len(sector)) *  10 ** 3.5
clients = pd.DataFrame({"sector": sector, "company": company, "revenue": revenue, "size": size})
clients.plot(x="size", y="revenue", kind='scatter');

Plotting also works with Series:

In [None]:
pd.Series(np.random.randn(10).cumsum(), 
          index=np.arange(0, 100, 10)).plot();

If it finds multiple columns, it tries to plot them all in the same plot (with the index as the `x` coordinate):

In [None]:
ts = pd.Series(np.random.randn(1000), 
               index=pd.date_range('1/1/2000', 
                                   periods=1000)).cumsum()
pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
             columns=list('ABCD')).cumsum().plot();

Barplots are also possible:

In [None]:
pd.Series(np.random.randint(10, 20, 10000)).value_counts().plot(kind='bar');

In [None]:
pd.Series(np.random.randint(10, 20, 10000)).value_counts().plot(kind='barh');

Stacked plots are also possible (and very useful when working with non-random data 😁):

In [None]:
(
    pd.DataFrame(np.random.randint(1, 5, 40).reshape(10, 4), columns=list("ABCD"))
    .plot(kind='bar', stacked=True)
);

Histograms and density plots are also a breeze to produce:

In [None]:
pd.Series(np.random.randint(0, 100, 10000)).hist(bins=30);

## Data aggregation and grouping
Grouping data is one of the most-used features in data analysis. SQL, for instance, has a very expressive and simple syntax to allow it:

```sql
SELECT 
  category, avg(price)
FROM
  inventory
GROUP BY
  category
```

While the syntax is simple and expressive in SQL, the variety of aggregate functions (like `sum, avg, std`) is somewhat limited. Writing new aggregate functions can be cumbersome (even though you can write such functions in Python for some databases such as Postgres). Here `pandas` really shines. But let's first see how grouping in `pandas` works:

In [None]:
clients.head()

In [None]:
clients.groupby("sector")['revenue'].agg(np.median)

The syntax is in principle pretty simple:

```python
(
    df
    .groupby(column or [columns])
    <select columns where you want to perform the aggregate>
    .agg(aggregate function)
)
```

Some functions are readily available and there's no need to use `agg`:

In [None]:
(
    clients
    .groupby("sector")
    [['revenue', 'size']]
    .mean()
)

In [None]:
(
    clients
    .groupby("sector")
    [['revenue', 'size']]
    .mean()
    .stack()
    .to_frame()
)

Other useful methods are `count, min, max, median`, etc.

Another feature is that you don't need to specify an aggregation function but can get the groups as they are:

In [None]:
group_obj = clients.groupby("sector")[['revenue', 'size']]
for name, group in group_obj:
    print(name)
    print(group)
    break

We can also apply dictionary mapping when grouping. Let's say we have the following data set and we want to group by odd and even rows:

In [None]:
to_group = pd.DataFrame(np.random.randint(1, 5, 80).reshape(10, 8), columns=list("ABCDEFGH"))
to_group

This is easy to do with pandas if we have a mapping dictionary:

In [None]:
grouping_dict = dict([(el, "even") if el % 2 == 0 else (el, "odd") for el in range(10)])
grouping_dict

In [None]:
to_group.groupby(grouping_dict).mean()

The grouping criterion can also be determined using a function, which is called on each value of the object's index. The output of the function then becomes the grouping key:

In [None]:
to_group.groupby(lambda x: x % 2).mean()

Note that `pandas` applied `groupby` to the rows. If we want it instead to apply it to the columns, we can pass `axis=1`.

Grouping by index level is also a possibility. Let's say we have the following DataFrame:

In [None]:
group_index = pd.MultiIndex.from_arrays([np.repeat(["FL", "ZH", "NH"], 3), range(9)], names=["province", "day"])
to_group = pd.DataFrame(np.random.randint(1, 10, 90).reshape(10, 9), columns=group_index)
to_group

To group the data by province we use:

In [None]:
to_group.groupby(level='province', axis=1).mean()

##  Custom aggregate functions

Besides using all NumPy and `pandas` functions that produce aggregate results, you can define your own functions:

In [None]:
def extremes_distance(ar):
    return ar.max() - ar.min()

clients.groupby("sector")[['revenue', 'size']].agg(extremes_distance)

Some other methods, unexpectedly, also work, even though they're not *per se* aggregate methods:

In [None]:
clients.groupby("sector")['revenue'].describe()

When needed (and desired), multiple aggregate functions can be passed:

In [None]:
clients.groupby("sector")[['revenue', 'size']].agg([np.mean, np.median, np.std])

Or we can pass specific functions for specific columns:

In [None]:
clients.groupby("sector").agg({"revenue": np.mean, "size": np.median})

Another useful bit: if you don't want the grouped-by keys to be the index of the returned DataFrame, use `as_index=False`:

In [None]:
clients.groupby("sector", as_index=False).agg({"revenue": np.mean, "size": np.median})

Besides `aggregation`, there is a `transform` keyword that applies a transformation to each group and then places the results in a Series or DataFrame with the "correct" index. An example is worth a thousand words here:

In [None]:
clients['sector_avg_size'] = clients.groupby('sector')['size'].transform(np.mean)
clients

This is, at the end of the day, similar to a SQL window function!

`transform` allows for nice operations on datasets, like de-meaning:

In [None]:
demean = lambda ar: ar - ar.mean() 
clients['sector_demean_size'] = clients.groupby('sector')['size'].transform(demean)
clients

Another function which is extremly useful is `apply`. It goes a step further than `transform` because it basically accepts a DataFrame as input and tries to glue together the results:

In [None]:
def top_n(df, n, column):
    return df.sort_values(column, ascending=False)[:n]

clients.groupby("sector").apply(top_n, 2, "size")

Enjoy a brief moment of solitude and silence and try to understand what just happended.

What has happened here? The top_n function is called on each piece of the DataFrame, then the results are glued together using `pandas.concat`, labeling the pieces with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.

The group name can be disabled by using `group_keys=False`:

In [None]:
clients.groupby("sector", group_keys=False).apply(top_n, 2, "size")

## Quantile bucket analysis with cut and qcut
`cut` and `qcut` are two handy tools for performing bucket analysis. `cut` gives equal length buckets, while `qcut` gives equal size buckets:

In [None]:
distribution = np.random.randn(1000)

In [None]:
pd.cut(distribution, 4)[:10]

In [None]:
pd.cut(distribution, 4).value_counts()

`qcut` in action:

In [None]:
pd.qcut(distribution, 4)[:10]

In [None]:
pd.qcut(distribution, 4).value_counts()

These two tools combined with `groupby` and `apply` allow for some powerful quantile bucket analysis:

In [None]:
quantile_df = pd.DataFrame({'data1': np.random.randn(1000),
                            'data2': np.random.randn(1000)})
quantile_df['factor'] = pd.cut(quantile_df.data1, 4)
quantile_df.head(5)

We can then `groupby` the `factor` and use the function `get_stats` to perform our quantile bucket analysis:

In [None]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}

pd.DataFrame(quantile_df.groupby('factor')['data2'].apply(get_stats)).unstack() 