# Session 5: Pandas & Visualization Exercise
_Author: B Rhodes (DC)_


Complete the following exercises in class. You will need to use the available documentation for matplotlib and seaborn. For all the tasks below, once you determine the correct method to use you can write it in a cell and then use the `shift-tab` trick to get details on what parameters to set.

Documentation can be found at the following sites:

- [Matplotlib Documentation](https://matplotlib.org/)
    - The [examples gallery](https://matplotlib.org/gallery/index.html) is very helpful. 
- [Seaborn Documentation](https://seaborn.pydata.org/)
    - The [examples gallery](https://seaborn.pydata.org/examples/index.html) is very helpful.

Note, most of the matplotlib tasks can be done by referring to the other notebooks in this lesson. The seaborn tasks are not all covered in this lesson and the documentation may be necessary. Although, in most cases you are asked to use a specific seaborn method so you should be able to figure out what needs to be done without referring to the documentation.

**```shift-tab``` is your friend.**

Also, don't forget to use `;` to supress extraneous output when executing various plot commands.



##### An example problem
Use `.random()` from the `random` module to generate a random number. 

Now, if you didn't want to first look at the documentation you could simply write the python and use `shift-tab` to inspect the docstring.

In [None]:
#
import random

# use shift-tab to determine what arguments and parameters you need.
random.random()

###### 1. import pandas and matplotlib

Read all the questions and import all the necessary libraries here.
Remember we use the magic ```%inline matplotlib```

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

###### 2. Create a dataframe from the following data set & verify the result. 

[Auto MPG Dataset](https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Auto%20MPG/auto-mpg.data)

Use this path: ```'./datasets/autoMPG.csv'```

Here is the [data dictionary](https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Auto%20MPG).

You will need to do some data cleaning to get this data ready to plot. Use the simplest solution you can think of to clean the data. In the end all your data needs to be numerical.


In [None]:
# 2. Answer
path = '../datasets/autoMPG.csv'

df = pd.read_csv(path)

# verify the result
df.head()

###### 3. Explore the data
Use standard methods.



In [None]:
# 3. Answer
df.info()

Look at all the values for any non-numeric columns. use the pandas attirbute `.values` to look at all the values in a series.

In [None]:
## We can see all the values in a given column
# we want to look at horsepower, since it has data type object, which means it contains string data.
df['horsepower'].values

We need to convert the horsepower column to floats. But the `?` won't convert since it is non-numeric. The simplest thing to do is to drop them since we don't have any knowledge as to what they should be and it doesn't make sense to impute a value.


In [None]:
#create a boolean to flag the rows containing question marks
is_question = df['horsepower'] == '?'
df[is_question]

# create a new dataframe excluding rows containing a question mark.
auto_df = df[~is_question].copy(deep=True)


In [None]:
# Check the results
auto_df.info()

# the result still has horsepower as data type 'object', which means string. 
# But we can be sure we've removed non-numerical values.

In [None]:
# Convert horsepower to floats
auto_df['horsepower'] = auto_df['horsepower'].astype(float)

###### 4.  Generate summary statistics on the data set?


In [None]:
# Generate summary stats

auto_df.describe()


###### 5. generate a scatter plot of  displacement vs  horsepower?


In [None]:
# 5. Answer
ax = auto_df.plot(kind='scatter', x='displacement', y='horsepower');
ax.set_title('Horsepower vs Displacement');

###### 6. generate a scatter plot of  displacement vs  horsepower and vary color by mpg for each point?
###### Bonus points: use a reverse color gradient


In [None]:
# 6. answer
ax = auto_df.plot(kind='scatter', x='displacement', y='horsepower', c='mpg', colormap='Blues_r');
ax.set_title('Horsepower vs Displacement, mpg by color');



###### 7. Use seaborn's lmplot to create a scatterplot of displacement vs  horsepower with a trend line. What does this show?


In [None]:
# 7. answer
ax = sns.lmplot(x="displacement",y="horsepower", data=auto_df, fit_reg=True);
plt.title('Horsepower vs Displacement with Trend Line');

###### 8. Use seaborn to create a lineplot of mpg vs displacement.

Use seaborn `.lineplot()` and be sure to give it a title.

What does this show? What can you conclude?

In [None]:
# 8. answer
sns.lineplot(x= "displacement", y="mpg", data=auto_df);
plt.title("MPG vs Displacement");

###### 9. Create a barplot of mpg vs model year - use seaborn

Use seaborn `.barplot()` and be sure to give it a title.

In [None]:
sns.barplot(x="model year", y="mpg", data=auto_df);
plt.title("MPG vs model year");

###### 10. Create a scatter plot of mpg vs displacement and use cylinders to set the hue.

In [None]:
sns.scatterplot(x='displacement', y='horsepower', data=auto_df, hue='cylinders'); 


###### 11. Use seaborn `.swarmplot()` to plot a mpg vs year

In [None]:
sns.swarmplot(x='model year', y='mpg', data=auto_df);


###### 12. Repeat the last exercise, but add hue coded by cylinders.

What does this tell you?

In [None]:
sns.swarmplot(x='model year', y='mpg', data=auto_df,  hue='cylinders');


###### 13. Create a seaborn countplot of model year and cylinders.

Plot model year using `.countplot()` and set hue=cylinders. What does this tell you about the number of cylinders vs model year?

In [None]:
sns.countplot(x='model year', hue='cylinders', data=auto_df)

###### 14. Create a histogram using seaborn's `.distplot()`

Create a histogram of acceleration using seaborn's `'.distplot()`. Use `bins = 10`


In [None]:
sns.distplot(a=auto_df['acceleration'], kde=False, bins=10)


###### 15. Repeat 11 but add a density plot Create a histogram using seaborn's `.distplot()`

Create a histogram of acceleration using seaborn's `'.distplot()` and add a density plot.


In [None]:
sns.distplot(a=auto_df['acceleration'], bins=10, kde=True)


###### 16. Add mpg to the above plot.

You can plot two histograms in the same plot.

In [None]:
sns.distplot(a=auto_df['acceleration'], bins=10, kde=True)
sns.distplot(a=auto_df['mpg'], bins=10)

###### 17. Create a density plot using seaborn's `.kdeplot()`

Create density plots of mpg and acceleration on the same graph. Use the `shade` parameter to fill in under the curves.

In [None]:
sns.kdeplot(data=auto_df['mpg'], shade=True)
sns.kdeplot(data=auto_df['acceleration'], shade=True)


###### 18. Create a seaborn `jointplot` of displacement vs mpg

In [None]:
sns.jointplot(x="displacement", y="mpg", data=auto_df);


###### 19. Repeat the above, but change the kind of plot to a density plot.
Use seaborn to create a jointplot, but change the kind to a density plot. The command should be the same, but you need to set a new parameter. Use `shift-tab` to figure out which parameter that is.

In [None]:
sns.jointplot(x="displacement", y="mpg", data=auto_df, kind='kde');


###### 20. Use seaborn to create a boxplot of mpg vs model year 

In [None]:
sns.boxplot(x="model year", y="mpg", data=auto_df);

###### 21. Use seaborn to create a correlation heatmap
Include the following:

1. Use a diverging color map.
2. Set the center to zero.
3. set linewidths to 0.5
4. set square to True
5. add numerical values in the squares
6. Set the title to 'Correlation Heatmap w/ annotation'

In [None]:
corr = auto_df.corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
hm = sns.heatmap(corr, cmap=cmap, center=0,
            square=True, linewidths=.5, annot=True);

hm.set_title('Correlation Heatmap w/annotation');

###### 22. BONUS: Create a diagonal heatmap.
Repeat the above, but make it a diagonal display.
Refer to the seaborn documentation for an example.

In [None]:
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
hm = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

hm.set_title('Diagonal Correlation Heatmap w/annotation');