<img src="mmu_logo.png" style="height: 80px;" align=left>  

# Learning Objectives

Towards the end of this lesson, you should be able to:
- plotting in Python
- use the Seaborn package 




In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import pandas as pd

# At the time of creating this material, there was a versioning issue 
# between seaborn and numpy that results in a FutureWarning. This does 
# not affect the results and will presumably be fixed in some update cycle 
# but creates an annoying warning message we don't want to see every time.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Seaborn
The base library for visualization in Python is `matplotlib`.  
`matplotlib` is designed to visualize *anything*, not just data. Because we're most interested in examining and presenting relationships between data, however, we will use a different library, `seaborn`. This library is specifically designed for statistical data visualization and provides a consistent and easy-to-use API.

## Relationships Between Continuous Variables
Visualizing the relationship between continuous variables is as simple as plotting the values of both variables for each data entry on the x- and y-axes of a plot.

### Scatter plots

In [None]:
tips = pd.read_csv("tips.csv")
tips.head()

Study the relationship between total bill and tip

In [None]:
sns.relplot(x="total_bill", y="tip", data=tips)

We may, of course, be interested in more than just the x- and y- values. We can use additional arguments to `relplot(...)` to distinguish data points

In [None]:
sns.relplot(x="total_bill", y="tip", hue="tip", data=tips)

Points are now colored differently depending on whether the entry in the dataset corresponds to a smoker or not. We can do the same for the size and style aesthetics as well.

In [None]:
sns.relplot(x="total_bill", y="tip", size="smoker", hue='smoker', data=tips)

In [None]:
sns.relplot(x="total_bill", y="tip", style="day", hue='day', data=tips)

The aesthetic mappings can be combined as desired to visualize up to 5 dimensions in our datasets via the `x`, `y`, `hue`, `shape`, and `size` arguments.

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", size="day", style="time", data=tips)

Be warned that this will make plots extremely difficult to visualize parse.

The `hue` and `size` aesthetics have been categorical so far, meaning that distinct colors and sizes were chosen for each possible, discrete value of the dataframe columns they were applied to. They can also be applied to continuous, numerical variables. In this case, the color palette will automatically be set to a gradient. We will see further on how to customize colors.

In [None]:
# your answer here...



In [None]:
# by default kind='scatter'

# your answer here...



### Line plots
By default, `seaborn` will create a scatterplot. In the case of time series, we may be interested in creating a line plot to better visualize trends. We can do this by simply adding a `kind="line"` argument (by default, this argument is `kind="scatter"`).

In [None]:
# A bit information about numpy cumsum - cummulative sum
import numpy as np 
  
in_arr = np.array([1,2,3,4,5]) 
    
out_sum = np.cumsum(in_arr)
out_sum

In [None]:
df = pd.DataFrame({
    "time": np.arange(500),
    "value": np.random.randn(500).cumsum()})

In [None]:
sns.relplot(x="time", y="value", kind="line", data=df)

By default, the dataframe will be sorted so that the x-values are in ascending order. This ensures that the line plot looks like a timeseries plot. This can, however, be disabled by setting `sort=False`. This could be useful, for example, if we are following the movement of an object or tracking how two variables change simultaneously through time.

In [None]:
import pandas as pd
df = pd.DataFrame(np.random.randn(500, 2).cumsum(axis=0), columns=["x", "y"])
df.head(10)

In [None]:
# if sort=False on x-axis

sns.relplot(x="x", y="y", sort=False, kind="line", data=df)

Line plots have the same aesthetic mapping possibilities as scatter plots, `hue`, `size`, and `shape`, and they can also be combined in the same way. Notice how multiple lines are created and only points with the identical mapped aesthetics are connected. That means, if we create a line plot that maps a variable to `hue` and to `style`, we will end up with an individual line for each existing combination of variables in our data.

In [None]:
df = pd.DataFrame({
    "time": np.arange(500),
    "value": np.random.randn(500).cumsum(), 
    "region": "North", "division": "A"})
df = df.append(pd.DataFrame({
    "time": np.arange(500),
    "value": np.random.randn(500).cumsum(), 
    "region": "North", "division": "B"}))
df = df.append(pd.DataFrame({
    "time": np.arange(500),
    "value": np.random.randn(500).cumsum(), 
    "region": "North", "division": "C"}))
df = df.append(pd.DataFrame({
    "time": np.arange(500),
    "value": np.random.randn(500).cumsum(), 
    "region": "South", "division": "A"}))
df = df.append(pd.DataFrame({
    "time": np.arange(500),
    "value": np.random.randn(500).cumsum(), 
    "region": "South", "division": "B"}))

display(df)

sns.relplot(
    x="time", y="value", kind="line", hue="region", 
    style="division", data=df)

In [None]:
df.head()

In [None]:
# Using size instead of style
sns.relplot(x="time", y="value", kind="line", hue="region", size="division", data=df)

If using the `style` parameter, we can also decide whether we want dashes, dots, or both.

In [None]:
df = pd.DataFrame({
    "time": np.arange(20),
    "value": np.random.randn(20).cumsum(), 
    "region": "North"})
df = df.append(pd.DataFrame({
    "time": np.arange(20),
    "value": np.random.randn(20).cumsum(), 
    "region": "South"}))
sns.relplot(x="time", y="value", kind="line", hue ='region', 
            style="region", markers=True, data=df)

In [None]:
sns.relplot(x="time", y="value", kind="line", style="region", 
            dashes=False, markers=True, data=df)

#### Aggregating Data
Often, we may have data with multiple measurements for the same data point, i.e. x-value. For example, we might have several temperature sensors in a device as a failsafe. `seaborn` can automatically aggregate y-values for identical x-values. By default, it plots the **mean** and the 95% confidence interval around this mean in either direction.

In [None]:
# fmri = sns.load_dataset("fmri")
fmri = pd.read_csv("data/fmri.csv")
fmri.head()

In [None]:
fmri.loc[(fmri["timepoint"] == 18)].head()

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)

In [None]:
# using sort=False parameter. can you compare with above?

sns.relplot(x="timepoint", y="signal", kind="line", data=fmri, sort=False)

Because `seaborn` uses bootstrapping to compute the confidence intervals and this is a time-consuming process, it may be better to either switch to the standard deviation (`ci="sd"`) or turn this off entirely and only plot the mean (`ci=None`)

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", ci="sd", data=fmri)

In [None]:
# by default estimator=np.mean

sns.relplot(x="timepoint", y="signal", kind="line", ci=None, data=fmri)

We can also change our `estimator` to any aggregation function, such as `np.median(...)`, `np.sum(...)`, or even `np.max(...)`. If we want to turn off aggregation then we just set `estimator=None`. Note that this will plot all measurements and cause the data to be plotted in strange ways.

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", 
            estimator=np.mean, data=fmri)

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", 
            estimator=None, data=fmri)

#### Plotting Dates
Because they're so ubiquitous, `seaborn` natively supports the date format and will automatically format plots accordingly.

In [None]:
pd.date_range("2017-1-1", periods=5)

In [None]:
pd.date_range("1-1-2017", "22-3-2017")

In [None]:
df = pd.DataFrame({
    "time": pd.date_range("2017-1-1", periods=500),
    "value": np.random.randn(500).cumsum()})
df.head()

In [None]:
g = sns.relplot(x="time", y="value", kind="line", data=df)
g.fig.autofmt_xdate() # automatic formatting the dates

### Showing multiple relationships with facets
We've emphasized in this tutorial that, while there functions can show several semantic variables at once, it's not always effective to do so. But what about when you do want to understand how a relationship between two variables depends on more than one other variable?

The best approach may be to make more than one plot. Because `relplot()` is based on the `FacetGrid`, this is easy to do. To show the influence of an additional variable, instead of assigning it to one of the semantic roles in the plot, use it to "facet" the visualization. This means that you make multiple axes and plot subsets of the data on each of them:

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)

You can also show the influence two variables this way: one by faceting on the columns and one by faceting on the rows. As you start adding more variables to the grid, you may want to decrease the figure size. Remember that the size `FacetGrid` is parameterized by the height and aspect ratio of each facet.

In [None]:
sns.relplot(x="timepoint", y="signal", hue="subject", col="region",
           row="event", height=3, kind="line", estimator=None, data=fmri)

When you want to examine effects across many levels of a variable, it can be a good idea to facet that variable on the columns and then "wrap" the facets into the rows:

In [None]:
fmri_temp = fmri.copy()
fmri_temp = fmri_temp[fmri_temp["region"] == "frontal"]

In [None]:
# your answer here...


These visualizations, which are often called "lattice" plots or "small-multiples", are very effective because they present the data in a format that makes it easy for the eye to detect both overall patterns and deviations from those patterns. While you should make use of the flexibility afforded by `scatterplot()` and `relplot()`, always try to keep in mind that several simple plots are usually more effective than one complex plot.

## Relationships to Categorical Variables
We've already seen how we can show dependence on categorical variables with the various aesthetics in the previous section (`hue`, `size`, and `style`). Often, we may not have two continuous variables to relate to each other, though. For this, we use the `seaborn` function `catplot(...)` which can create multiple kinds of categorical plots.

### Categorical Scatter Plots
The simplest way to represent the relationship between continuous and categorical data is with a categorical scatter plot that represents the distribution of (continuous) values for each category. For this, we can make use of the default value `kind="strip"`.

In [None]:
# tips = sns.load_dataset("tips")
tips = pd.read_csv("data/tips.csv")
tips.head()

In [None]:
sns.catplot(x="day", y="total_bill", data=tips)

The `jitter` parameter controls the magnitude of jitter or disables it altogether:

In [None]:
sns.catplot(x="day", y="total_bill", jitter=False, data=tips)

The second approach adjusts the points along the categorical axis using an algorithm that prevents them from overlapping. It can give a better representation of the distribution of observations, although it only works well for relatively small datasets. This kind of plot is sometimes called a "beeswarn" and is drawn in seaborn by `swarmplot()`, which is activated by setting `kind="swarm"` in `catplot()`:

In [None]:
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

### Distribution Plots
Swarm plots are good for approximating distributions, but we often want to have an exact description of the data distribution. For this, we can use box plots and variants thereof.

In [None]:
sns.catplot(x="day", y="total_bill", kind="box", data=tips)

Boxplots encode valuable information about our distribution. For each subset of the data, i.e. each box, the following pieces of information are shown:
- The central line of each box represents the median value.
- The top and bottom of the boxes are the $3^{rd}$ and $1^{st}$ quantile, respectively.
    - This means that 25% of all values are below the bottom line and 25% are above the top line, i.e. 50% of all values are within the colored region.
- The whiskers denote the outlier limits. Any value between the whiskers is considered "normal".
- The points outside of the whiskers are outliers that may require special attention.

The `hue` argument can be used to show additional, nested relationships.

In [None]:
sns.catplot(x="day", y="total_bill", kind="box", hue="sex", data=tips)

Note that `hue` assumes a categorical variable when used on `catplot(...)` and `seaborn` will therefore automatically convert numerical variables into categorical ones.

In [None]:
sns.catplot(x="day", y="total_bill", kind="box", hue="size", data=tips)

When quantiles aren't enough, `seaborn` can also display a violin plot. This kind of plot estimates a density and plots it as a distribution

Like with line plots, we may be interested in summary statistics over our data. For this, we can use a bar plot. `seaborn` will compute a summary statistic, such as the mean, as well as confidence intervals for each individual category (denoted by the x-axis).

In [None]:
# titanic = sns.load_dataset("titanic")
titanic = pd.read_csv("data/titanic.csv")
titanic.head()

If we're just interested in counting the number of occurances of a single variable, we can use `kind="count"`.

In [None]:
# Count the number of passengers by sex and class
# your answer here...


An alternative to a barplot is a "point plot", which connects groups. This can be used to track psuedo-timeseries data that may only have a few categorical time points, e.g. sales data for 5 years. Notice how it connects data subgroups with the same value of the variable mapped to the `hue` aesthetic (`sex`).

### Showing multiple relationships with facets
Just like `relplot()`, the fact that `catplot()` is built on a `FacetGrid` means that it is easy to add faceting variables to visualize higher-dimensional relationships:

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker", col="time",
           aspect=.6, kind="swarm", data=tips)

In [None]:
titanic_temp = titanic.copy()
titanic_temp = titanic_temp[titanic_temp["fare"] > 0]

In [None]:
g = sns.catplot(x="fare", y="survived", row="class", kind="box",
               orient="h", height=1.5, aspect=4, data=titanic_temp)
g.set(xscale="log")

## Visualizing the distribution of a dataset
When dealing with a set of data, often the first thing you'll want to do is get a sense for how the variables are distributed.

### Plotting univariate distributions
The most convenient way to take a quick look at a univariate distribution in seaborn is the `distplot()` function. By default, this will draw a `histogram` and fit a `kernel density estimate (KDE)`.

In [None]:
# your answer here...


Histogram are likely familar, and a `hist` function already exists in matplotlib. A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin.

To illustrate this, let's remove the density curve and add a rug plot, which draws a small vertical tick at each observation. You can make the rug plot itself with the `rugplot()` function, but it is also available in `distplot()`:

In [None]:
sns.distplot(diamonds.price, kde=False, rug=True)

### Plotting bivariate distributions
It can also be useful to visualize a bivariate distribtuion of two variables. The easiest way to do this in seaborn is to just use `jointplot()` function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

In [None]:
sns.jointplot(x=diamonds.depth, y=diamonds.price, data=df)

### Visualizing pairwise relationships in a dataset
To plot multiple pairwise bivariate distributions in a dataset, you can use the `pairplot()` function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution of each variable on the diagonal axes:

In [None]:
diamonds.head(10)

In [None]:
# your answer here...
