# Pandas III + Intro to Plotting

### Sneak peak:

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10
sns.set(style='ticks', context='talk')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
df = pd.read_csv('data/beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = [c for c in df.columns if c[0:6] == 'review']
df.head()

In [None]:
fig, ax = plt.subplots(figsize=(5, 10))
sns.countplot(hue='kind', y='stars', data=(df[review_cols]
                                           .stack()
                                           .rename_axis(['record', 'kind'])
                                           .rename('stars')
                                           .reset_index()),
              ax=ax, order=np.arange(0, 5.5, .5))
sns.despine()

### Groupby

The components of a groupby operation are to

1. Split a table into groups
2. Apply a function to each group
3. Combine the results

In pandas the first step looks like

```python
df.groupby( grouper )
```

`grouper` can be many things

- Series (or string indicating a column in `df`)
- function (to be applied on the index)
- dict : groups by *values*
- `levels=[ names of levels in a MultiIndex ]`

In [None]:
gr = df.groupby('beer_style')
gr

Haven't really done anything yet. Just some book-keeping to figure out which **keys** go with which **rows**. Keys are the things we've grouped by (each `beer_style` in this case).

In [None]:
gr.

In [None]:
gr.groups

In [None]:
gr.groups.keys()

In [None]:
gr.get_group('Rye Beer')

There's a generic aggregation function:

In [None]:
gr.agg?

Which accepts some common operations as strings:

In [None]:
gr.agg('mean')

Or functions that can operate on Pandas or Numpy objects:

In [None]:
gr.agg(np.mean)

And for many common operations, there are also convenience functions:

In [None]:
gr.mean()

By default the aggregation functions get applied to all columns, but we can subset:

In [None]:
gr[review_cols].agg('mean')

`.` attribute lookup works as well.

In [None]:
gr.abv.agg('mean')

In [None]:
gr.abv.mean()

### Example

Find the `beer_styles` with the greatest variance in `abv`:

In [None]:
df.groupby('beer_style').abv

In [None]:
df.groupby('beer_style').abv.std()

In [None]:
df.groupby('beer_style').abv.std().sort_values(ascending=False)

### Some more complex examples

Multiple aggregations on one column

In [None]:
gr['review_aroma'].agg(['mean', np.std, 'count']).head()

Single aggregation on multiple columns

In [None]:
gr[review_cols].mean()

Multiple aggregations on multiple columns

In [None]:
gr[review_cols].agg(['mean', 'count', 'std'])

Hierarchical Indexes in the columns can be awkward to work with, so you can move a level to the Index with `.stack`:

In [None]:
multi = gr[review_cols].agg(['mean', 'count', 'std']).stack(level=0)
multi.head(10)

You can group by **levels** of a MultiIndex:

In [None]:
multi.groupby(level='beer_style')['mean'].agg(['min', 'max'])

Group by **multiple** columns

In [None]:
df.groupby(['brewer_id', 'beer_style']).review_overall.mean()

In [None]:
df.groupby(['brewer_id', 'beer_style'])[review_cols].mean()

### Example
Find the relationship between `review` length (the text column) and average `review_overall`

In [None]:
df.text.str.len()

In [None]:
df.groupby(df.text.str.len())

In [None]:
df.groupby(df.text.str.len()).review_overall.mean()

In [None]:
df.groupby(df.text.str.len()).review_overall.mean().corr()

In [None]:
df.groupby(df.text.str.len()).review_overall.mean().reset_index()

In [None]:
df.groupby(df.text.str.len()).review_overall.mean().reset_index().corr()

In [None]:
df.groupby(df.text.str.len()).review_overall.mean().plot(style='.k', figsize=(10, 5))

<div class="alert alert-info">
  <b>Bonus exercise</b>
</div>

- Try grouping by the number of words
- Try grouping by the number of sentences

_Hint_: `str.count` accepts a regular expression...

### Example

Which **brewer** (`brewer_id`) has the largest gap between the min and max `review_overall` for two of their beers?

_Hint_: You'll need to do this in two steps:
    1. Find the average `review_overall` by `brewer_id` and `beer_name`.
    2. Find the difference between the max and min by brewer (rembember `.groupby(level=)`)

In [None]:
avg = (df.groupby(['brewer_id', 'beer_name'])
       .review_overall
       .mean())
avg

In [None]:
extrema = avg.groupby(level='brewer_id').agg(['min', 'max'])
extrema

In [None]:
difference = extrema['max'] - extrema['min']
difference.sort_values(ascending=False)

### Example

Create a more aggregated "kind" of beer, less detailed than `style`

In [None]:
style = df.beer_style.str.lower()
style.head()

In [None]:
kinds = ['ipa', 'apa', 'amber ale', 'rye', 'scotch', 'stout', 'barleywine', 'porter', 'brown ale', 'lager', 'pilsner',
         'tripel', 'biter', 'farmhouse', 'malt liquour', 'rice']

In [None]:
expr = '|'.join(['(?P<{name}>{pat})'.format(pat=kind, name=kind.replace(' ', '_')) for kind in kinds])
expr

In [None]:
beer_kind = (style.replace({'india pale ale': 'ipa',
                            'american pale ale': 'apa'})
            .str.extract(expr, expand=False).fillna('').sum(1)
            .str.lower().replace('', 'other'))
beer_kind.head()

In [None]:
df.groupby(beer_kind).review_overall.mean().sort_values(ascending=False).head()

In [None]:
df.groupby(['brewer_id', beer_kind]).review_overall.mean()

Finding the number of beers of each kind by brewer:

In [None]:
df.groupby(['brewer_id', beer_kind]).beer_id.nunique().unstack(1).fillna(0).head()

We've seen a lot of permutations among number of groupers, number of columns to aggregate, and number of aggregators.


In fact, the `.agg`, which returns one row per group, is just one kind of way to combine the results. The three ways are

- `agg`: one row per results
- `transform`: identically shaped output as input
- `apply`: anything goes

### Transform

Combined `Series`/`DataFrame` is the same shape as the input. 

For example, say you want to standardize the reviews by subtracting the mean.

In [None]:
df.head()

In [None]:
def de_mean(reviews):
    s = reviews - reviews.mean()
    return s

In [None]:
de_mean(df.review_overall)

We can do this at the *person* level with `groupby` and `transform`.

In [None]:
df['review_overall_demeaned'] = df.groupby('profile_name').review_overall.transform(de_mean)

This uses the *group* means instead of the overall means

In [None]:
df[['profile_name','review_overall','review_overall_demeaned']].sort_values('profile_name').head(10)

## Apply

- `.apply()` can return all sorts of things, doesn't have to be the same shape...
- Lots of uses, too many to go into...

In [None]:
def something(x):
    return x['review_appearance'].max() - x['review_aroma'].min()

In [None]:
df.groupby('beer_style').apply(something)

Or more succinctly as a `lambda` function:

In [None]:
df.groupby('beer_style').apply(lambda x: x['review_appearance'].max() - x['review_aroma'].min())

## Matplotlib

- Tons of features
- Just scraping the surface

Check out [the tutorials](http://matplotlib.org/users/beginner.html)

In [None]:
from IPython import display
display.HTML('<iframe src="http://matplotlib.org/users/beginner.html" height=500 width=1024>')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()

A single series is interpreted as y values, so x is just the index...

In [None]:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])

For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot. 

In [None]:
plt.plot([1,2,3,4], [1,4,9,16], 'ro')

In [None]:
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])

You can enter multiple series at once...

In [None]:
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')

Lots of `keyword` properties...

In [None]:
np.random.seed(5)
plt.plot(np.arange(10), np.random.rand(10), linewidth=5, alpha=.3)

#### Overlaying plots

In [None]:
np.random.seed(5)
plt.plot(np.arange(10), np.random.rand(10))
plt.plot(np.arange(10), np.random.rand(10))

#### Multiple plots

In [None]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')

#### Types of axes

In [None]:
# make up some data in the interval ]0, 1[
y = np.random.normal(loc=0.5, scale=0.4, size=1000)
y = y[(y > 0) & (y < 1)]
y.sort()
x = np.arange(len(y))

# plot with various axes scales
plt.figure(1, figsize=(10,5))

# linear
plt.subplot(121)
plt.plot(x, y)
plt.yscale('linear')
plt.title('linear')
plt.grid(True)


# log
plt.subplot(122)
plt.plot(x, y)
plt.yscale('log')
plt.title('log')
plt.grid(True)

The best way to learn is [the gallery](http://matplotlib.org/gallery.html)

In [None]:
display.HTML('<iframe src="http://matplotlib.org/gallery.html" height=500 width=1024>')

# Plotting with Pandas

matplotlib is a relatively *low-level* plotting package, relative to others. It makes very few assumptions about what constitutes good layout (by design), but has a lot of flexiblility to allow the user to completely customize the look of the output.

On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.

In [None]:
normals = pd.Series(np.random.normal(size=10))
normals.plot()

In [None]:
normals.cumsum().plot()

Similarly, for a DataFrame:

In [None]:
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                          'gamma': np.random.gamma(1, size=100), 
                          'poisson': np.random.poisson(size=100)})
variables.cumsum(0).plot()

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for `plot`:

In [None]:
variables.cumsum(0).plot(subplots=True)

Or, we may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space:

In [None]:
variables.cumsum(0).plot(secondary_y='normal', grid=False)

(Note that ["friends don't let friends use two y-axes"](https://kieranhealy.org/blog/archives/2016/01/16/two-y-axes/), but we're just showing some examples here...)

If we would like a little more control, we can use matplotlib's `subplots` function directly, and manually assign plots to its axes:

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i,var in enumerate(['normal','gamma','poisson']):
    variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum')

### Bar plots

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the `plot` method with a `kind='bar'` argument.

For this series of examples, let's load up the Titanic dataset:

In [None]:
titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.head()

In [None]:
titanic.groupby('pclass').survived.sum().plot(kind='bar')

In [None]:
titanic.groupby(['sex','pclass']).survived.sum().plot(kind='barh')

In [None]:
death_counts = pd.crosstab([titanic.pclass, titanic.sex], titanic.survived.astype(bool))
death_counts.plot(kind='bar', stacked=True, color=['black','gold'], grid=False)

Or if we wanted to see survival _rate_ instead:

In [None]:
death_counts.div(death_counts.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, color=['black','gold'])

## Histograms

Frequently it is useful to look at the *distribution* of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For instance, fare distributions aboard the titanic:

In [None]:
titanic.fare.hist()

In [None]:
titanic.fare.hist(grid=False)

In [None]:
titanic.fare.hist(grid=False, bins=30)

In [None]:
titanic.fare.dropna().plot(kind='kde', xlim=(0,600))

In [None]:
titanic.fare.hist(bins=30, normed=True, color='steelblue')
titanic.fare.dropna().plot(kind='kde', xlim=(0,600), style='r--')

### Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

In [None]:
titanic.boxplot(column='fare', by='pclass', grid=False)

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.

In [None]:
bp = titanic.boxplot(column='age', by='pclass', grid=False)
for i in [1,2,3]:
    y = titanic.age[titanic.pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y.values, 'r.', alpha=0.2)

### Scatter plots

In [None]:
df.head()

In [None]:
plt.scatter(df.abv, df.review_overall)
plt.xlabel('ABV')
plt.ylabel('Score')

In [None]:
plt.scatter(df.abv, df.review_overall, s=np.sqrt(df.review_palate*150), alpha=0.5)
plt.xlabel('ABV')
plt.ylabel('Score')

In [None]:
plt.scatter(df.abv, df.review_overall, alpha=0.5, c=df.review_palate, cmap='hot')
plt.xlabel('ABV')
plt.ylabel('Score')

In [None]:
jittered_df = df[review_cols] + (np.random.rand(*df[review_cols].shape) - 0.5)
pd.scatter_matrix(jittered_df, figsize=(12,8), diagonal='kde', )

### Lots more info on Pandas plotting in [the docs](http://pandas.pydata.org/pandas-docs/stable/visualization.html)

## So many plotting libraries!

In [None]:
display.HTML('<iframe src="https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/" width=1024 height=500>')

## Exercise 5 - "Choose your own adventure" workshop

1. Grab the data of your choice
    - Can't think of anything? [GHDx](http://ghdx.healthdata.org/)
2. Load it into a Pandas `DataFrame`
3. Compute some summary statistics, taking advantage of e.g. `.groupby()`
4. Create some cool plots

## References

Slide materials inspired by and adapted from [Chris Fonnesbeck](https://github.com/fonnesbeck/statistical-analysis-python-tutorial) and [Tom Augspurger](https://github.com/TomAugspurger/pydata-chi-h2t)