<p><font size="6"><b>Plotnine: Introduction </b></font></p>


> *DS Data manipulation, analysis and visualisation in Python*  
> *December, 2017*

> *© 2016, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---


In [None]:
%matplotlib inline

import pandas as pd

# Plotnine

http://plotnine.readthedocs.io/en/stable/

* Built on top of Matplotlib, but providing
    1. High level functions
    2. Implementation of the [Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448), which became famous due to the `ggplot2` R package 
    3. The syntax is highly similar to the `ggplot2` R package
* Works well with Pandas

In [None]:
import plotnine as pn

## Introduction

We will use the Titanic example data set:

In [None]:
titanic = pd.read_csv('../data/titanic.csv')

In [None]:
titanic.head()

Let's consider following question:
>*For each class at the Titanic, how many people survived and how many died?*

Hence, we should define the *size* of respectively the zeros (died) and ones (survived) groups of column `Survived`, also grouped by the `Pclass`. In Pandas terminology:

In [None]:
survived_stat = titanic.groupby(["Pclass", "Survived"]).size().rename('count').reset_index()
survived_stat
# Remark: the `rename` syntax is to provide the count column a column name 

Providing this data in a bar chart with pure Pandas is still partly supported (e.g. the `by` option in the plot):

In [None]:
survived_stat.plot(x='Survived', y='count', kind='bar', by='Pclass')
## A possible other way of plotting this could be using groupby again:   
# survived_stat.groupby('Pclass').plot(x='Survived', y='count', kind='bar') # (try yourself by uncommenting)

but with mixed results...

Plotting libraries focussing on the **grammar of graphics** are really targeting these *grouped* plots. For example, the plotting of the resulting counts can be expressed in the grammar of graphics:

In [None]:
(pn.ggplot(survived_stat, 
           pn.aes(x='Survived', y='count', fill='factor(Survived)'))
    + pn.geom_bar(stat='identity', position='dodge')
    + pn.facet_wrap(facets='Pclass'))

Moreover, these `count` operations are embedded in the typical Grammar of Graphics packages and we can do these operations directly on the original `titanic` data set in a single coding step:

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='Survived', fill='factor(Survived)'))
    + pn.geom_bar(stat='count', position='dodge')
    + pn.facet_wrap(facets='Pclass'))

<div class="alert alert-info">

 <b>Remember</b>: 

 <ul>
    <li>The <b>Grammar of Graphics</b> is especially suitbale for these so-called [`tidy`](http://vita.had.co.nz/papers/tidy-data.pdf) dataframe representations (check [here](#this_is_tidy) for more about `tidy` data)</li>
  <li>`plotnine` is a library that supports the Grammar of graphics</li>
</ul>
<br>

</div>

## Building a plotnine graph

Building plots with plotnine is typically an iterative process. As illustrated in the introduction, a graph is setup by layering different elements on top of each other using the `+` operator. putting everything together in brackets `()` provides Python-compatible syntax.

#### data

* Bind the plot to a specific data frame using the data argument:

In [None]:
(pn.ggplot(data=titanic))

We haven 't defined anything else, so just an empty *figure* is available.

#### aesthestics

 
* Define aesthetics (**aes**), by **selecting variables** used in the plot and linking them to presentation such as plotting size, shape color, etc. You can interpret this as: **how** the variable will influence the plotted objects/geometries:

The most important `aes` are: `x`, `y`, `alpha`, `color`, `colour`, `fill`, `linetype`, `shape`, `size` and `stroke`

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare')))

#### geometry

* Still nothing plotted yet, as we have to define what kind of [**geometry**](http://plotnine.readthedocs.io/en/stable/api.html#geoms) will be used for the plot. The easiest is probably using points:

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point()
) 

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Starting from the code of the last figure, adapt the code in such a way that the `Sex` variable defines the **color** of the points in the graph. </li>
  <li>As both sex categories overlap, use an alternative geometry, so called `geom_jitter` </li>
</ul>
</div>

In [None]:
# %load _solutions/visualization_02_plotnine12.py

These are the basic elements to have a graph, but other elements can be added to the graph:

#### labels

* Change the [**labels**](http://plotnine.readthedocs.io/en/stable/api.html#Labels):

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point()
     + pn.xlab("Cabin class")
) 

#### facets

* Use the power of `groupby` and define [**facets**](http://plotnine.readthedocs.io/en/stable/api.html#facets) to group the plot by a grouping variable:

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point()
     + pn.xlab("Cabin class")
     + pn.facet_wrap('Sex')#, dir='v')
) 

#### scales

* Defining [**scale**](http://plotnine.readthedocs.io/en/stable/api.html#scales) for colors, axes,...

For example, a log-version of the y-axis could support the interpretation of the lower numbers:

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point() 
     + pn.xlab("Cabin class")
     + pn.facet_wrap('Sex')
     + pn.scale_y_log10()
) 

#### theme

* Changing [**theme **](http://plotnine.readthedocs.io/en/stable/api.html#themes):

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point() 
     + pn.xlab("Cabin class")
     + pn.facet_wrap('Sex')
     + pn.scale_y_log10()
     + pn.theme_bw()
) 

or changing specific [theming elements](http://plotnine.readthedocs.io/en/stable/api.html#Themeables), e.g. text size:

In [None]:
(pn.ggplot(titanic,
           pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point() 
     + pn.xlab("Cabin class")
     + pn.facet_wrap('Sex')
     + pn.scale_y_log10()
     + pn.theme_bw()
     + pn.theme(text=pn.element_text(size=14))
) 

#### more...

* adding [**statistical derivatives**](http://plotnine.readthedocs.io/en/stable/api.html#stats)
* changing the [**plot coordinate**](http://plotnine.readthedocs.io/en/stable/api.html#coordinates) system

<div class="alert alert-info">

 <b>Remember</b>: 

 <ul>
    <li>Start with defining your `data`, `aes` variables and a `geometry`</li>
  <li>Further extend your plot with `scale_*`, `theme_*`, `xlab/ylab`, `facet_*`</li>
</ul>
<br>

</div>

## plotnine is built on top of Matplotlib

As plotnine is built on top of Matplotlib, we can still retrieve the matplotlib `figure` object from plotnine for eventual customization:

In [None]:
myplot = (pn.ggplot(titanic, 
                    pn.aes(x='factor(Pclass)', y='Fare'))
     + pn.geom_point()
) 

The trick is to use the `draw()` function in plotnine:

In [None]:
my_plt_version = myplot.draw()

In [None]:
my_plt_version.axes[0].set_title("Titanic fare price per cabin class")
ax2 = my_plt_version.add_axes([0.5, 0.5, 0.3, 0.3], label="ax2")
my_plt_version

<div class="alert alert-info" style="font-size:18px">

 <b>Remember</b>: 

Similar to Pandas handling above, we can set up a matplotlib `Figure` wit plotnine. Use `draw()` and the Matplotlib `Figure` is returned.

</div>

## (OPTIONAL SECTION) Some more plotnine functionalities to remember...

**Histogram**: Getting the univariaite distribution of the `Age`

In [None]:
(pn.ggplot(titanic.dropna(subset=['Age']), pn.aes(x='Age'))
     + pn.geom_histogram(bins=30))

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Make a histogram of the age, grouped by the `Sex` of the passengers</li>
  <li>Make sure both graphs are underneath each other instead of next to each other to enhance comparison</li>

</ul>
</div>

In [None]:
# %load _solutions/visualization_02_plotnine22.py

**boxplot/violin plot**: Getting the univariaite distribution of `Age` per `Sex`

In [None]:
(pn.ggplot(titanic.dropna(subset=['Age']), pn.aes(x='Sex', y='Age'))
     + pn.geom_boxplot())

Actually, a *violinplot* provides more inside to the distribution:

In [None]:
(pn.ggplot(titanic.dropna(subset=['Age']), pn.aes(x='Sex', y='Age'))
     + pn.geom_violin()
)

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Make a violin plot of the Age for each `Sex`</li>
  <li>Add `jitter` to the plot to see the actual data points</li>
  <li>Adjust the transparency of the jitter dots to improve readability</li>

</ul>
</div>

In [None]:
# %load _solutions/visualization_02_plotnine25.py

**regressions**

plotnine supports a number of statistical functions with the [`geom_smooth` function]:(http://plotnine.readthedocs.io/en/stable/generated/plotnine.stats.stat_smooth.html#plotnine.stats.stat_smooth)

The available methods are:
```
* 'auto'       # Use loess if (n<1000), glm otherwise
* 'lm', 'ols'  # Linear Model
* 'wls'        # Weighted Linear Model
* 'rlm'        # Robust Linear Model
* 'glm'        # Generalized linear Model
* 'gls'        # Generalized Least Squares
* 'lowess'     # Locally Weighted Regression (simple)
* 'loess'      # Locally Weighted Regression
* 'mavg'       # Moving Average
* 'gpr'        # Gaussian Process Regressor
```

each of these functions are provided by existing Python libraries and integrated in plotnine, so make sure to have these dependencies installed (read the error message!)

In [None]:
(pn.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), pn.aes(x='Fare', y='Age', color="Sex"))
     + pn.geom_point()
     + pn.geom_rug(alpha=0.2)
     + pn.geom_smooth(method='lm')
)

In [None]:
(pn.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), pn.aes(x='Fare', y='Age', color="Sex"))
     + pn.geom_point()
     + pn.geom_rug(alpha=0.2)
     + pn.geom_smooth(method='lm')
     + pn.facet_wrap("Survived")
     + pn.scale_color_brewer(type="qual")
)

# Need more plotnine inspiration? 

<div class="alert alert-info" style="font-size:18px">

 <b>Remember</b>(!)

<ul>
  <li>[plotnine gallery ](http://plotnine.readthedocs.io/en/stable/gallery.html) and [great documentation](http://plotnine.readthedocs.io/en/stable/api.html)</li>
</ul>
<br>
Important resources to start from!

</div>

<a id='this_is_tidy'></a>

# What the f* is `tidy`?

If you're wondering what *tidy* data representations are, you can read the scientific paper by Hadley Wickham, http://vita.had.co.nz/papers/tidy-data.pdf. 

Here, we just introduce the main principle very briefly:

Compare:

#### un-tidy
        
| WWTP | Treatment A | Treatment B |
|:------|-------------|-------------|
| Destelbergen | 8.  | 6.3 |
| Landegem | 7.5  | 5.2 |
| Dendermonde | 8.3  | 6.2 |
| Eeklo | 6.5  | 7.2 |

*versus*

#### tidy

| WWTP | Treatment | pH |
|:------|:-------------:|:-------------:|
| Destelbergen | A  | 8. |
| Landegem | A  | 7.5 |
| Dendermonde | A  | 8.3 |
| Eeklo | A  | 6.5 |
| Destelbergen | B  | 6.3 |
| Landegem | B  | 5.2 |
| Dendermonde | B  | 6.2 |
| Eeklo | B  | 7.2 |

This is sometimes also referred as *short* versus *long* format for a specific variable... Plotnine (and other grammar of graphics libraries) work better on `tidy` data, as it better supports `groupby`-like transactions!

<div class="alert alert-info" style="font-size:16px">

 <b>Remember:</b>
 
 <br><br>
 
 A tidy data set is setup as follows:
 
    <ul>
      <li>Each <code>variable</code> forms a <b>column</b> and contains <code>values</code></li>
      <li>Each <code>observation</code> forms a <b>row</b></li>
        <li>Each type of <code>observational unit</code> forms a <b>table</b>.</li>
    </ul>
</div>

