# Tutorial: Plots

__The goal of this assignment is to create 5 plots based on the Titanic dataset.__

Plots are a very useful tool to explore datasets and to present information to others.

You will use [Plotnine](https://plotnine.readthedocs.io/en/latest/index.html), a Python port of R `ggplot2` that provides 2D visualizations.

`ggplot2` is based on the [Grammar of Graphics](http://vita.had.co.nz/papers/layered-grammar.html), an algebra for plot components.

__Grade scale__: 10 points
- correct plot: 2 points
- incorrect plot: 0 points

__Further documentations__:
* https://ggplot2.tidyverse.org/reference/
* http://pbpython.com/python-vis-flowchart.html
* https://plotnine.readthedocs.io/en/latest/api.html
* https://www.kaggle.com/residentmario/grammar-of-graphics-with-plotnine-optional/#

![](https://i.imgur.com/UoIbtqI.png)

# Core

__VARIABLE DESCRIPTIONS__:

- __survival__        Survival(0 = No; 1 = Yes)
- __pclass__          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- __name__            Name
- __sex__             Sex
- __age__             Age
- __sibsp__           Number of Siblings/Spouses Aboard
- __parch__           Number of Parents/Children Aboard
- __ticket__          Ticket Number
- __fare__            Passenger Fare
- __cabin__           Cabin
- __embarked__        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [None]:
# import plotnine objects
from plotnine import *

# import pandas library
import pandas as pd

In [None]:
# load the dataset with pandas
df = pd.read_csv("titanic.csv.gz")

df.head()

# Examples

In [None]:
# you must specify the source (df) and the variables (aes) to ggplot
plot = ggplot(df, aes(x="age", y="fare", color="factor(survival)"))
plot += ggtitle("Relation between age, fare, surrival")  # title
plot += xlab("Passenger age")  # x label
plot += ylab("Ticket fare")  # y label
plot += geom_point()  # as a scatter plot
plot += stat_smooth(method='lm')  # regression line
plot += scale_y_log10()  # scale y-axis with log scale
plot

# Questions

__IMPORTANT__: Your result should match the image included in the question !

__1. Create a bar plot that represents the distribution of the `embarked` variable.__

![](P1.png)

In [None]:
def Q1(df):
    plot = None
    ### BEGIN SOLUTION
    ### END SOLUTION

Q1(df)

__2. Create a jitter plot with a point size of 0.5 that represents the relation between `sibsp`, `parch` and `survival` variables__
- __note__: be careful about the y tick breaks !

![](P2.png)

In [None]:
def Q2(df):
    plot = None
    ### BEGIN SOLUTION
    ### END SOLUTION
    
Q2(df)

__3. Create a box plot with 'red' outlier color and variable width to show the relation between `pclass` (x-axis) and `age` (y-axis) variables__
- __note__: the outlier color must be 'red', not '#FF0000' or other alternatives
- __hint__: you can use the factor() function for the aes to group values by plass

![](P3.png)

In [None]:
def Q3(df):
    plot = None
    ### BEGIN SOLUTION
    ### END SOLUTION
    
Q3(df)

__4. Create a stacked histogram with 20 bins that shows the distribution of passenger `age` (x) according to `survival` (fill)__

![](P4.png)

In [None]:
def Q4(df):
    plot = None
    ### BEGIN SOLUTION
    ### END SOLUTION
    
Q4(df)

__5. Create a heatmap plot that shows the number of passenger per `sex` (x-axis) and `pclass` (y-axis)__
- __hint__: you might have to create a new dataframe to aggregate the number of passengers
- __note__: name the aggregated variable 'count' and use geom_tile

![](P5.png)

In [None]:
def Q5(df):
    plot = None
    ### BEGIN SOLUTION
    ### END SOLUTION
    
Q5(df)