# Matplotlib and Seaborn plotting

datasets:

- `salary.csv` -- a dataset comparing salary data across gender and tenure lines for academics 
- `wine_quality.csv` -- a dataset comparing chemical qualities of red and white wine and user-rated quality scores (on a 10 point scale)


##### Workflow Tips

1. Open the data and do a quick EDA:
  - How many rows and columns?
  - Is there missing data?
  - What do each of the columns mean?
2. Begin plotting:
  - If a variable of interest is encoded as a string, do some feature extraction / transformation to turn it into numeric values
  - Use something like seaborn's pairplot to visualize overall relationships
  - Start digging into a bivariate relationship
3. Refine plots:
  - Try different plotting types / plotting options
  - Remember titles, axes labels, etc.
  - Does the plot have a story? What should a reader take away from the plot?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Salary

In [None]:
salary = pd.read_csv('datasets/salary.csv')
salary.head()

In [None]:
d = {'doctorate': True, 'masters': False}
salary['dg'].map(d)

any(salary['dg'])

In [None]:
salary['dg']

In [None]:
salary.info

In [None]:
salary['yr'].max()

In [None]:
salary['dg']

In [None]:
sns.distplot(salary['sl'], kde=False)

In [None]:
sns.barplot(x='sx', y='sl', data=salary)

In [None]:
corr_salary = salary[['sx', 'rk', 'yr', 'dg', 'yd', 'sl']].corr()
corr_salary

In [None]:
corr_salary = salary[['yr', 'yd']].corr()
corr_salary

In [None]:
sns.heatmap(corr_salary, annot=True)

In [None]:
sns.violinplot(x='yr', y='yd', hue='sx', data=salary)

In [None]:
sns.pairplot(salary[['sx', 'rk', 'yr', 'dg', 'yd', 'sl']])

In [None]:
sns.jointplot(x='yr', y='sl', data=salary, kind='kde')

# Wine Quality

In [None]:
wine = pd.read_csv('datasets/wine_quality.csv')
wine.head()

In [None]:
wine.info()

In [None]:
wine.columns

In [None]:
corr_wine = wine[['quality','fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'type']].corr()
corr_wine

In [None]:
sns.heatmap(corr_wine, annot=True)

In [None]:
sns.pairplot(wine[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'type']])

In [None]:
sns.jointplot(x='quality', y='alcohol', data=wine)

In [None]:
sns.violinplot(x='quality', y='alcohol', data=wine)

In [None]:
fig = plt.figure() # Instantiates a figure, which will be empty
ax = fig.add_subplot(1, 1, 1) # Adds a subplot (read, axes) to the figure

# ## Set different aspects of the graph by calling methods on the axes
ax.set_xlim([0.5, 4.5]) # Set the minimum and maximum of x-axis to be 0.5 and 4.5
ax.set_ylim([5, 32]) # Set the min and max of the y-axis. Minimum value comes first, followed by the maximum
ax.set_title('An Example Graph') # Set the title of the graph.
ax.set_ylabel('y-axis') # Set the label on the y-axis. 
ax.set_xlabel('x-axis') # and on the x-axis

In [None]:
wine.hist(bins=10)

In [None]:
sns.barplot(x='alcohol', y='quality', data=wine)

In [None]:
sns.violinplot(x='alcohol', y='quality', data=wine)

In [None]:
wine.plot('quality', 'alcohol', kind='scatter')
#ax.scatter([4.2, 3.8, 2.5, 3.5], [11, 25, 9, 26], color='darkgreen', marker='^') # Plot scatter points on Axes

# Matplotlib

Datasets

- [datasets/data_atlanta.csv](./datasets/data_atlanta.csv)
- [datasets/data_austin.csv](./datasets/data_austin.csv)
- [datasets/data_boston.csv](./datasets/data_boston.csv)
- [datasets/data_chicago.csv](./datasets/data_chicago.csv)
- [datasets/data_nyc.csv](./datasets/data_nyc.csv)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
x = [1,2,3,4]                       # define data
y = [20, 21, 20.5, 20.8]
plt.plot(x, y)                      # plot a line graph
plt.savefig('graph1.png')           # save figure to file

In [None]:
# In matplotlib, feed in two lists of numbers. 
# The first list is the x-coordinates of each point
# The second list is the y-coordinates of each point
# In the below case, the first bar (0th index) sits at:
# (0, 15) -> 0 on the x-axis, height of 15.

ax.bar([0, 1, 2, 3], [15, 30, 6, 8], color='purple') # Plot 4 bars on Axes. 

# We can ask Python to show us the graph, but if we're in a script, it's usually easier to have it save to file and look at it that way

# plt.savefig('bar_chart.png') # Save current figure to current working directory
# We'll plot a line graph next -- just like with bar graphs, this is in [X-coordinates], [Y-coordinates] format
# the first point is (1, 10) or 1 on the x-axis and 10 on the y-axis

ax.plot([1, 2, 3, 4], [10, 20, 25, 30], color='lightblue', linewidth=3) 

# plt.savefig('line_graph_too.png')

# Until we clear out the current graph from memory, we'll keep adding stuff on top of it.
# Finally, we'll add a scatter plot

ax.scatter([4.2, 3.8, 2.5, 3.5], [11, 25, 9, 26], color='darkgreen', marker='^') # Plot scatter points on Axes
# plt.savefig('final_graph.png')

# How do we clear out the current graph?

# plt.clf()

In [None]:
# Clear the current graph
# plt.clf()

## Full Matplotlib

In this section, we'll build up a 3x3 grid of different plots to investigate some different parts of matplotlib syntax

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

np.random.seed(2017) # This ensures that we all get the same "random" numbers

Boxplot, Violin Plot, Histogram

Pie Chart, Stacked Bar Chart, Line Chart

Scatter Plot, Stem Plot, 3D Scatter

In [None]:
fig = plt.figure(figsize=(20,16), dpi=300)

all_data = [np.random.normal(0, std, 100) for std in range(1, 7)]

ax1 = plt.subplot(3, 3, 1)
ax1.set_title('Plot 1: Box Plot')
ax1.set_ylabel('Y-axis')
ax1.set_xlabel('X-axis')

## Set up boxplot. 
boxplot = ax1.boxplot(all_data, patch_artist=True)
colors = ['lightblue', 'pink', 'lightgreen', 
        'lightyellow', 'lightgray', 'orange']
for patch, color in zip(boxplot['boxes'], colors):
    patch.set_facecolor(color)

## Violin Plots (http://matplotlib.org/examples/statistics/customized_violin_demo.html)

ax2 = plt.subplot(3, 3, 2)
ax2.set_title('Plot 2: Violin Plot')
ax2.set_ylabel('Y-axis')
ax2.set_xlabel('X-axis')

violinplot = ax2.violinplot(all_data)

## Histograms (http://matplotlib.org/examples/statistics/histogram_demo_features.html)

ax3 = plt.subplot(3, 3, 3)
ax3.set_title('Plot 3: Histogram')
ax3.set_ylabel('Y-axis')
ax3.set_xlabel('X-axis')

histogram = ax3.hist(all_data[0], bins=10, color='pink', edgecolor='black')

## Pie Charts (DO NOT MAKE THESE!) (http://matplotlib.org/examples/pie_and_polar_charts/pie_demo_features.html)

ax4 = plt.subplot(3, 3, 4)
ax4.set_title('Plot 4: Pie Charts - DO NOT MAKE THESE')
ax4.axis('equal') # Makes sure circle is circle

pie_labels = ['Inappropriate', 'Arguable', 'Appropriate']
sizes = [70, 20, 10]
explode_ratio = (0, 0, 0.25)

piechart = ax4.pie(sizes, explode=explode_ratio, labels=pie_labels, autopct='%1.1f%%')

## Stacked Bar Chart

ax5 = plt.subplot(3, 3, 5)
ax5.set_title('Plot 5: Stacked Bar Chart')
ax5.set_ylabel('Y-axis')
ax5.set_xlabel('X-axis')

xs = [x for x in range(1, 6)]
y1 = [2, 4, 6, 2, 10]
y2 = [4, 0, 6, 15, 2]

bottom_bar = ax5.bar(xs, y1, color='red')
top_bar = ax5.bar(xs, y2, bottom=y1, color='blue')

# ## Line Chart

ax6 = plt.subplot(3, 3, 6)
ax6.set_title('Plot 6: Line Chats')
ax6.set_ylabel('Y-axis')
ax6.set_xlabel('X-axis')

xs = [x for x in range(1, 101)]

first_line = ax6.plot(xs, all_data[0], color='red')
second_line = ax6.plot(xs, all_data[1], color='blue')
third_line = ax6.plot(xs, all_data[4], color='green')
fourth_line = ax6.plot(xs, all_data[5], color='black')

## Scatter Plot

ax7 = plt.subplot(3, 3, 7)
ax7.set_title('Plot 7: Scatterplots')
ax7.set_ylabel('Y-axis')
ax7.set_xlabel('X-axis')

scatter = ax7.scatter(all_data[0], all_data[5])

## Stem Plot (http://matplotlib.org/examples/pylab_examples/stem_plot.html)

ax8 = plt.subplot(3, 3, 8)
ax8.set_title('Plot 8: Stem Plot')
ax8.set_ylabel('Y-axis')
ax8.set_xlabel('X-axis')

x_data = [x for x in range(1, 101)]

stemplot = ax8.stem(x_data, np.sort(all_data[0]))

## 3D scatter plot (http://matplotlib.org/examples/mplot3d/scatter3d_demo.html)

ax9 = plt.subplot(3, 3, 9, projection='3d')
ax9.set_title('Plot 9: 3D Scatter Plot')
ax9.set_ylabel('Y-axis')
ax9.set_xlabel('X-axis')
ax9.set_zlabel('Z-axis')

scatter_3d1 = ax9.scatter(all_data[0], all_data[1], all_data[2], 
    color='red', marker='*')
scatter_3d2 = ax9.scatter(all_data[3], all_data[4], all_data[5],
    color='blue', marker='^')

plt.show()

Cleaning up

In [None]:
fig = plt.figure(figsize=(20,16), dpi=300)

all_data = [np.random.normal(0, std, 100) for std in range(1, 7)]

ax1 = plt.subplot(3, 3, 1)
ax1.set_title('Plot 1: Box Plot')
ax1.set_ylabel('Y-axis')
ax1.set_xlabel('X-axis')

## Set up boxplot.
boxplot = ax1.boxplot(all_data, patch_artist=True)
colors = ['lightblue', 'pink', 'lightgreen', 
        'lightyellow', 'lightgray', 'orange']
for patch, color in zip(boxplot['boxes'], colors):
    patch.set_facecolor(color)

## Violin Plots (http://matplotlib.org/examples/statistics/customized_violin_demo.html)

ax2 = plt.subplot(3, 3, 2)
ax2.set_title('Plot 2: Violin Plot')
ax2.set_ylabel('Y-axis')
ax2.set_xlabel('X-axis')

violinplot = ax2.violinplot(all_data)

## Histograms (http://matplotlib.org/examples/statistics/histogram_demo_features.html)

ax3 = plt.subplot(3, 3, 3)
ax3.set_title('Plot 3: Histogram')
ax3.set_ylabel('Y-axis')
ax3.set_xlabel('X-axis')

histogram = ax3.hist(all_data[0], bins=10, color='pink', edgecolor='black')

## Pie Charts (DO NOT MAKE THESE!) (http://matplotlib.org/examples/pie_and_polar_charts/pie_demo_features.html)

ax4 = plt.subplot(3, 3, 4)
ax4.set_title('Plot 4: Pie Charts - DO NOT MAKE THESE')
ax4.axis('equal') # Makes sure circle is circle

pie_labels = ['Inappropriate', 'Arguable', 'Appropriate']
sizes = [70, 20, 10]
explode_ratio = (0, 0, 0.25)

piechart = ax4.pie(sizes, explode=explode_ratio, labels=pie_labels, autopct='%1.1f%%')

## Stacked Bar Chart

ax5 = plt.subplot(3, 3, 5)
ax5.set_title('Plot 5: Stacked Bar Chart')
ax5.set_ylabel('Y-axis')
ax5.set_xlabel('X-axis')

xs = [x for x in range(1, 6)]
y1 = [2, 4, 6, 2, 10]
y2 = [4, 0, 6, 15, 2]

bottom_bar = ax5.bar(xs, y1, color='red')
top_bar = ax5.bar(xs, y2, bottom=y1, color='blue')

# ## Line Chart

ax6 = plt.subplot(3, 3, 6)
ax6.set_title('Plot 6: Line Chats')
ax6.set_ylabel('Y-axis')
ax6.set_xlabel('X-axis')

xs = [x for x in range(1, 101)]

first_line = ax6.plot(xs, all_data[0], color='red')
second_line = ax6.plot(xs, all_data[1], color='blue')
third_line = ax6.plot(xs, all_data[4], color='green')
fourth_line = ax6.plot(xs, all_data[5], color='black')

## Scatter Plot

ax7 = plt.subplot(3, 3, 7)
ax7.set_title('Plot 7: Scatterplots')
ax7.set_ylabel('Y-axis')
ax7.set_xlabel('X-axis')

scatter = ax7.scatter(all_data[0], all_data[5])

## Stem Plot (http://matplotlib.org/examples/pylab_examples/stem_plot.html)

ax8 = plt.subplot(3, 3, 8)
ax8.set_title('Plot 8: Stem Plot')
ax8.set_ylabel('Y-axis')
ax8.set_xlabel('X-axis')

x_data = [x for x in range(1, 101)]

stemplot = ax8.stem(x_data, np.sort(all_data[0]))

## 3D scatter plot (http://matplotlib.org/examples/mplot3d/scatter3d_demo.html)

ax9 = plt.subplot(3, 3, 9, projection='3d')
ax9.set_title('Plot 9: 3D Scatter Plot')
ax9.set_ylabel('Y-axis')
ax9.set_xlabel('X-axis')
ax9.set_zlabel('Z-axis')

scatter_3d1 = ax9.scatter(all_data[0], all_data[1], all_data[2], 
    color='red', marker='*')
scatter_3d2 = ax9.scatter(all_data[3], all_data[4], all_data[5],
    color='blue', marker='^')

plt.tight_layout() # To make things neater

size = fig.get_size_inches()
print(size)
print(size*fig.dpi)

plt.savefig('chart_types.png') # to save the plot

plt.show()

## Seaborn

In [None]:
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import the data

This census data comes from a 1994 census found [here](http://archive.ics.uci.edu/ml/datasets/Adult)

In [None]:
df = pd.read_csv('census_data.csv')
df.head()

# Univariate Plotting in Seaborn

Using Seaborn's [`distplot`](https://seaborn.pydata.org/generated/seaborn.distplot.html)

In [None]:
sns.distplot(df['age'])

Sometimes we want the full distribution and not the [Kernel Density Estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation)

In [None]:
sns.distplot(df['age'], kde=False)

Notice the scale of the y-axis changing to counts of occurances

Using Seaborn's [barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html)

In [None]:
sns.barplot(x='race', y='capital_gain', data=df, )

_Using two categories_

In [None]:
sns.barplot(x='race', y='capital_gain', hue='sex', data=df)

_Compare this to `matplotlib`_

In [None]:
races = df['race'].unique()
averages = [df.loc[(df['race'] == race), 'capital_gain'].mean() for race in races]

In [None]:
ax = plt.subplot()
ax.bar(range(len(averages)), averages)
ax.set_xticks(range(len(averages)))
ax.set_xticklabels(races)
plt.show()

# Bivariate Plotting in Seaborn

Using Seaborn's [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)

This style of plotting is best used to help visualize a dense table of information, such as a correlation table

In [None]:
df.head()

In [None]:
df['over50k'] = df['income_dummy'].apply(lambda x: 1 if x == ' >50K' else 0)
corr_df = df[['age', 'ednum', 'capital_gain', 'capital_loss', 'hours_per_week', 'over50k']].corr()
corr_df

In [None]:
sns.heatmap(corr_df)

In [None]:
sns.heatmap(corr_df, annot=True)

Plot half of the correlation heat map

In [None]:
mask = np.zeros_like(corr_df, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_df, annot=True, mask=mask)

Using Seaborn's [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html)

_Univariate Distribution_

In [None]:
sns.boxplot(df['ednum'])

_Grouped by a categorical variable_

In [None]:
sns.boxplot(x='income_dummy', y='ednum', data=df)

_Grouped by *two* categorical variables_

In [None]:
sns.boxplot(x='income_dummy', y='ednum', hue='sex', data=df)

Using Seaborn's [violin plots](https://seaborn.pydata.org/generated/seaborn.violinplot.html) to do the same thing

In [None]:
sns.violinplot(df['ednum'])

In [None]:
sns.violinplot(x='income_dummy', y='ednum', data=df)44

In [None]:
sns.violinplot(x='income_dummy', y='ednum', hue='sex', data=df)

Using Seaborn's [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

In [None]:
sns.pairplot(df[['age', 'ednum', 'capital_gain', 'capital_loss', 'hours_per_week', 'over50k']])

Using Seaborn's [jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html)

In [None]:
sns.jointplot(x='age', y='ednum', data=df)

Using `kind='kde'` or `kind='hex'` can help with densly packed data

In [None]:
sns.jointplot(x='age', y='ednum', data=df, kind='kde')

In [None]:
sns.jointplot(x='age', y='ednum', data=df, kind='hex')