# Matplotlib & Seaborn

## Overview

matplotlib is a python 2D plotting library built on the top of the basic Python language and Numpy.
More about matplotlib can be found in its [documentation](http://matplotlib.org/contents.html).

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Here is the [documentation](http://seaborn.pydata.org/index.html).

To display the graph inside IPython notebook, we need to run the following line of code.

In [None]:
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

For today's lecture, we will use the IMDB dataset scraped by Sun Chuan, who is one of our graduates from Bootcamp 6.

He also uploaded the dataset to Kaggle so you can check it out [here](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset). Let's load the data first.

In [None]:
df = pd.read_csv('movie_metadata.csv')

In [None]:
df.head()

You might see those `...` since there are too many columns there. We want to get the whole idea how does the dataset look like. You could check the output of `df.columns`. It is a little bit better.

In [None]:
df.columns

We can tell pandas to display a larger number of columns without truncating them

In [None]:
pd.set_option('display.max_columns', 50)

In [None]:
df.head()

In [None]:
df.shape

Describe will exclude the missing value by default.

In [None]:
df.describe()

In [None]:
df['language'].value_counts()

# Histogram

In [None]:
plt.hist(df['imdb_score'])

Explore some of the available parameters when you move your cursor inside the function parentheses and press **shift+tab**

For example, we can change the color and the number of bins. For color, you can simply type `"blue"` or `"b"` for short. At the same time, it accepts hex color code, which you can pick a color from this [website](http://www.color-hex.com/) and paste the hexadecimal code.

In [None]:
plt.hist(df['imdb_score'], bins=20, color="#666699", orientation="horizontal")

### Exercise 1

- Create a histogram of the budget column. Did you encounter any error? See whether you can fix it by following this [Stack Overflow Link](http://stackoverflow.com/q/20656663)

- How does the graph look? Is there anyway to improve this? What does x-axis stand for? 
- All the code in the same cell will contribute to the same plot. Similiar to adding different layers in ggplot2
- Type `plt.` and press tab to see if you can get any hint.

In [None]:
#### Your code here
plt.figure(figsize=(12,6)) # This line changes the size of the plot. The width and height are in inches.


- Pandas dataframe also provides plotting function. It is calling the matplotlib library behind the scenes.
- You can check all the available plots from the documenation [here](http://pandas.pydata.org/pandas-docs/version/0.19.2/visualization.html#visualization)
- To make the plot looks nicer, we can import the seaborn package here.

In [None]:
import seaborn as sns
np.log(df['budget']).plot.hist()
plt.xlabel('log of budget')
plt.ylabel('count')
plt.title('Histogram of budget', fontsize=20)

# Scatterplot
Scatterplots are useful for bivariate analysis. We can check the relationship between two columns.
Suppose we want to figure out whether there is a relationship between the gross income and the budget.

In [None]:
plt.scatter(df['budget'], df['gross'])
plt.xlabel('Budget')
plt.ylabel('Gross Income')

Here is a way to plot it using the plotting function from pandas.

In [None]:
df.plot.scatter(x='budget', y='gross')
plt.xlabel('Budget')
plt.ylabel('Gross Income')

We can see here there are some outliers there so it makes the graph less useful. Let's try to remove them from our dataframe using the apply function.

In [None]:
scatter_df = df[['gross', 'budget']]
scatter_df = scatter_df[scatter_df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

In [None]:
plt.scatter(scatter_df['budget'], scatter_df['gross'])

Sometimes it is interesting to take a look at those outliers. In the previous example, we remove the outlier, we can simply use the **~** symbol for the other way around. We can merge the outlier dataframe with the original one to get other features and sort by the amount of budget in a descending order.

In [None]:
outliers = df[['gross', 'budget']].dropna()
outliers = outliers[~outliers.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
outliers.merge(df)[['gross', 'budget', 'movie_title']].sort_values(by='budget', ascending=False)

Next, let's check out whether there is a relationship between imdb_score and gross income. 

In [None]:
score_df = df[['gross', 'imdb_score']].dropna()
# score_df = score_df[score_df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

In [None]:
score_df.plot.scatter('gross', 'imdb_score')

### Exercise 2
- How is gross income related to director Facebook likes? 
- How is the imdb_score related to num_critic_for_reviews?

In [None]:
#### Your code here


# Barplot

Barplot is often used to visualize the amount of each class in a categorical feature. It shows the difference between factors.

In [None]:
plt.figure(figsize=(12,6))
df.groupby('country')['imdb_score'].median().sort_values(ascending=False).plot.bar()

In [None]:
plt.figure(figsize=(12,6))
df.groupby('country')['imdb_score'].median().sort_values(ascending=False).head(10).plot.bar()

In [None]:
df_clean = df.dropna()
bar_df = df_clean.groupby('title_year')[['budget']].mean().tail(10)
bar_df.head()

In [None]:
bar_df.index = bar_df.index.astype(np.int16)
bar_df.plot.bar()

### Genre
- Definitely we want to check out how the features changes across different genres. However, things get a little bit tricky here. We saw that genres are separated by the `|` symbol. At the same time, each movie might have more than one genre. So we may have to duplicate the row by multiple times
- Suppose we want to check out the imdb distribution among all the genres.

In [None]:
df_clean = df[['genres', 'budget', 'gross', 'title_year', 'imdb_score']].dropna()
df_genre = pd.DataFrame(columns = ['genre', 'budget', 'gross', 'year', 'imdb_score'])

def genreRemap(row):
    global df_genre
    d = {}
    genres = row['genres'].split('|')
    n = len(genres)
    d['genre'] = genres
    d['budget'] = [row['budget']] * n
    d['gross'] = [row['gross']] * n
    d['year'] = [row['title_year']] * n
    d['imdb_score'] = [row['imdb_score']] * n

    df_genre = df_genre.append(pd.DataFrame(d), ignore_index = True)

df_clean.apply(genreRemap, axis = 1)
df_genre['year'] = df_genre['year'].astype(np.int16)

In [None]:
df_genre.head()

All right, we get exactly what we want. Next we can group by the genre column and perform different analysis.
First, let's check out the imdb score across different genre.

In [None]:
df_genre.groupby('genre')['imdb_score'].mean().plot.bar()
plt.ylabel('Averge Imdb Score')

### Exercise 3 
- Which genre has the highest mean budget?
- When the number of bars becomes larger, it is a good idea to plot it horizitionally. See whether you can find the function from documentation or use the tab to see available functions.

In [None]:
#### Your code here


# Boxplot
- A boxplot is another way to visualize the distribution of a numeric feature. Let Q1, Q2 and Q3 represent the 25%, 50% and 75% quantile, respectively.
- A Boxplot is made of five quantiles: Q1−1.5(Q3−Q1), Q1, Q2, Q3, and Q3+1. 5(Q3−Q1). It can be made by function boxplot.

In [None]:
df_score = df[['color', 'imdb_score']].dropna()
df_score.boxplot(by='color', column='imdb_score')
plt.ylabel('Imdb Score')

### Exercise 4
What is the duration distribution for different kinds of posters?

In [None]:
#### Your code here


# Seaborn

In [None]:
import seaborn as sns

We can of course visualize the distribution of imdb score with histogram. However, seaborn provides a nice function that smooths out the histogram to estimate the distribution.

In [None]:
sns.kdeplot(df['imdb_score'], shade=True, label='Estimated PDF of imdb score')

It is possible to combine histogram and the distribution estimate plot:

In [None]:
sns.distplot(df['imdb_score'])

The joinplot() function combines histogram and scatter plot.

In [None]:
sns.jointplot(df['num_critic_for_reviews'], df['imdb_score'])

Below we see "pearsonr=0.31" which indicates the pearson correlation of these two variables. However, we also see the p value is pretty small, which indicates that there is a significant linear relation between the two variables.

We may also visualize the distribution of multiple features by using boxplot:

In [None]:
sns.boxplot(x='color', y='imdb_score', data=df)

### Advanced plots
Seaborn also provides another famous sample dataset:

In [None]:
tips = sns.load_dataset("tips")
tips.head()

This is a famous dataset record the information of people of different sex, being a smoker or not, visiting at different day or for different meal, the size of parties they joined, and the tip and the total bill they pay.

To visualize the relation between total_bill and tip, of course we can use scatter plot. However, seaborn allows fitting with linear model as well.

In [None]:
sns.lmplot("total_bill", "tip", tips)

We can further split the data into "Male" and "Female" parts and visualize them.

In [None]:
sns.lmplot("total_bill", "tip", tips, hue="sex", palette="Set2")

You can also pass a dictionary to the palette argument by specifying the color of each level in the hue variable

In [None]:
sns.lmplot("total_bill", "tip", tips, hue="sex", palette={"Male": "b", "Female": "r"})

Here we used:
- hue indicates according to which column we group our data.
- palette simply specifies the color we want to use.

We often need to compare the same kind of plot for different features. Functions for faceting comes in handy.

In [None]:
SexGrid = sns.FacetGrid(tips, col='sex', hue="sex", palette="Set1", size=4)
SexGrid.map(sns.distplot, "tip")

Within the function FacetGrid(), hue indicates sketching a separate plot for each sex; col specifies that each plot is placed in a new column; palette specifies the colors. Then we use .map() method to specify the type of the plot and the feature we want to visualize.

Facet can be used to differentiate multiple factors as well.

In [None]:
tipsGrid = sns.FacetGrid(tips, row='sex', col='smoker',\
                               hue='time', palette="Set2")
tipsGrid.map(sns.regplot, 'total_bill', 'tip')
tipsGrid.add_legend()

### Exercise 5
Use FacetGrid to compare the distribution of imdb score for different poster color.

In [None]:
#### Your code here


The function FacetGrid help you explore the specific variables conditioned on different levels. Function PairGrid is useful to explore the relationships between pairs of variables.

In [None]:
tipGrid = sns.PairGrid(tips)
tipGrid.map(plt.scatter)

In [None]:
iris = sns.load_dataset("iris")
iris.head()

In [None]:
g = sns.PairGrid(iris)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)

Some examples with different type of plot and colors.

In [None]:
g = sns.PairGrid(iris, hue = 'species', palette='Set2',\
           hue_kws={'cmap':['Greens','Oranges','Blues']})

g.map_diag(plt.hist)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)
g.add_legend()

# Solutions

**Exercise 1**

In [None]:
plt.hist(np.log(df['budget'].dropna()), color="#666699")
plt.xlabel('log of budget')
plt.ylabel('count')
plt.title('Histogram of budget', fontsize=20)

**Exercsie 2**

In [None]:
gross_df = df[['director_facebook_likes', 'gross']].dropna()
gross_df.plot.scatter('director_facebook_likes', 'gross')

In [None]:
critic_df = df[['num_critic_for_reviews', 'imdb_score']].dropna()
critic_df.plot.scatter('num_critic_for_reviews', 'imdb_score')

**Exercise 3**

In [None]:
df_genre.groupby('genre')['budget'].mean().sort_values().plot.barh()

**Exercise 4**

In [None]:
df_dur = df[['color', 'duration']]
df_dur.boxplot(by='color', column='duration')
plt.ylabel('Length of Duration')

**Exercsie 5**

In [None]:
g = sns.FacetGrid(data=df, col='color', hue='color', palette='Set1', size=4)
g.map(sns.distplot, 'imdb_score')