# Intro to Data Visualization with Matplotlib


## Why is data visualization important?
> "A picture is worth a thousand words"


> Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments.



## What is Matplotlib?
The **matplotlib** Python library, developed by **John Hunter** and many other contributors, is used to create high-quality graphs, charts, and figures. The library is extensive and capable of changing very minute details of a figure. 

__Some basic concepts and functions provided in matplotlib are__

### Figure and axes 
The entire illustration is called a figure and each plot on it is an axes (do not confuse Axes with Axis). The figure can be thought of as a canvas on which several plots can be drawn. We obtain the figure and the axes using the **subplots()** function

### Plotting
The very first thing required to plot a graph is data. A dictionary of key-value pairs can be declared, with keys and values as the x and y values. After that, scatter(), bar(), and pie(), along with tons of other functions, can be used to create the plot. 

### Axis
The figure and axes obtained using subplots() can be used for modification. Properties of the x-axis and y-axis (labels, minimum and maximum values, etc.) can be changed using Axes.set()

## Anatomy of a Figure 
A matplotlib visualization is a figure onto which is attached one or more axes. Each axes has a horizontal (x) axis and vertical (y) axis, and the data is encoded using color and glyphs such as markers (for example circles) or lines or polygons (called patches). The figure below annotates these parts of a visualization and was created by Nicolas P. Rougier using matplotlib. The source code can be found in the [matplotlib documentation](https://matplotlib.org/gallery/showcase/anatomy.html#sphx-glr-gallery-showcase-anatomy-py).

![](../img/mpl_anatomy.png)

## Basic Charts
- Line Plot 
- Histograms 
- Scatter Plot 
- Box plots
- Bar Plots 
- Pie Chart 

## Take a look at data types 
![](../img/type_of_data.png)

## Import Library

In [None]:
import pandas as pd # for data manipulation 
import numpy as np # for linear algebra 
import matplotlib.pyplot as plt # plt is the convention, also known as nickname 

## Import Dataset

In [None]:
df = pd.read_csv('../data/gapminder.csv')

## Exploring Data

In [None]:
# examine first few rows 
df.head() 

In [None]:
# examine last few rows 
df.tail() 

In [None]:
# check shape of the data 
df.shape

In [None]:
# check data types 
df.dtypes

In [None]:
# check columns 
df.columns

In [None]:
# info about the data 
df.info() 

In [None]:
# summary stats 
df.describe()

## Describing Distributions

### Histograms 
Helps you understand the distribution of a numeric value in a way that you can not with mean or median alone.

In [None]:
plt.hist(df.gdpPercap)
plt.show() 

In [None]:
plt.hist(df['gdpPercap'])
plt.show() 

In [None]:
# title 
plt.hist(df['gdpPercap'])
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# add xlabel and ylabel 
plt.hist(df['gdpPercap'])
plt.xlabel("GDP Per Capita")
plt.ylabel("Counts")
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# fifsize 
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'])
plt.xlabel("GDP Per Capita")
plt.ylabel("Counts")
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# edgecolor  
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black')
plt.xlabel("GDP Per Capita")
plt.ylabel("Counts")
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# bins   
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20)
plt.xlabel("GDP Per Capita")
plt.ylabel("Counts")
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# xrange   
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, range = (0, 50000))
plt.xlabel("GDP Per Capita")
plt.ylabel("Counts")
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# color    
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita")
plt.ylabel("Counts")
plt.title("Distribution of GDP Per Capita")
plt.show() 

In [None]:
# xlabel, ylabel, title fontsize     
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita", fontsize=16)
plt.ylabel("Counts",  fontsize=16)
plt.title("Distribution of GDP Per Capita",  fontsize=16)
plt.show() 

In [None]:
# xticcs, yticks fontsize    
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita", fontsize=16)
plt.ylabel("Counts",  fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.title("Distribution of GDP Per Capita",  fontsize=16)
plt.show() 

In [None]:
# style 
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita", fontsize=16)
plt.ylabel("Counts",  fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.title("Distribution of GDP Per Capita",  fontsize=16)
plt.show() 

In [None]:
# tight layout 
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita", fontsize=16)
plt.ylabel("Counts",  fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.title("Distribution of GDP Per Capita",  fontsize=16)
plt.tight_layout() 
plt.show() 

In [None]:
print(plt.style.available)

In [None]:
# style 
plt.style.use('ggplot')
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita", fontsize=16)
plt.ylabel("Counts",  fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.title("Distribution of GDP Per Capita",  fontsize=16)
plt.tight_layout() 
plt.show() 

In [None]:
# export figure: Final Step!! 
plt.style.use('seaborn')
plt.figure(figsize=(10,6))
plt.hist(df['gdpPercap'], edgecolor = 'black', bins=20, color="green")
plt.xlabel("GDP Per Capita", fontsize=16)
plt.ylabel("Counts",  fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.title("Distribution of GDP Per Capita",  fontsize=16)
plt.tight_layout()
# figure_name.file_format, dpi 
plt.savefig("gdp.tiff", dpi = 300)
plt.show() 

### Comparing Asia and Europe's GDP Per Capita

In [None]:
df.head() 

In [None]:
# unique continent: unique()
df.continent.unique() 

In [None]:
# unique continent: set()
set(df.continent)

In [None]:
# unique continent: nnique()
df.continent.nunique() 

In [None]:
# unique continent: len(set())
len(set(df.continent)) 

In [None]:
df_2007 = df[df.year == 2007]

In [None]:
df_2007.head() 

In [None]:
asia_2007 = df_2007[df_2007.continent == "Asia"]
europe_2007 = df_2007[df_2007.continent == "Europe"]

In [None]:
asia_2007.head() 

In [None]:
europe_2007.head()

In [None]:
len(set(asia_2007.country))

In [None]:
len(set(europe_2007.country))

In [None]:
# mean gdp percap 
asia_2007.gdpPercap.mean()

In [None]:
europe_2007.gdpPercap.mean()

In [None]:
# median gdp perc
asia_2007.gdpPercap.median()

In [None]:
europe_2007.gdpPercap.median()

In [None]:
plt.hist(asia_2007.gdpPercap)
plt.show() 

In [None]:
plt.hist(asia_2007.gdpPercap, edgecolor = 'black', bins = 20)
plt.show() 

In [None]:
plt.subplot(211)
plt.title('Distribution of GDP Percapita')
plt.hist(asia_2007.gdpPercap, edgecolor = 'black', bins = 20)
plt.ylabel('Asia')

plt.subplot(212)
plt.hist(europe_2007.gdpPercap, edgecolor = 'black', bins = 20)
plt.ylabel('Europe')
plt.show() 

In [None]:
plt.subplot(211)
plt.title('Distribution of GDP Percapita')
plt.hist(asia_2007.gdpPercap, edgecolor = 'black', bins = 20, range = (0, 50000))
plt.ylabel('Asia')

plt.subplot(212)
plt.hist(europe_2007.gdpPercap, edgecolor = 'black', bins = 20, range = (0, 50000))
plt.ylabel('Europe')
plt.show() 

### Compare Europe and America's Life Expectancy in 1997

In [None]:
df_1997 = df[df.year == 1997]
df_1997.head() 

In [None]:
americas_1997 = df_1997[df_1997.continent == 'Americas']
europe_1997 = df_1997[df_1997.continent == 'Europe']

In [None]:
americas_1997.head() 

In [None]:
europe_1997.head() 

In [None]:
americas_1997.country.nunique() 

In [None]:
europe_1997.country.nunique() 

In [None]:
# mean 
americas_1997.lifeExp.mean() 

In [None]:
americas_1997.lifeExp.median() 

In [None]:
europe_1997.lifeExp.median() 

In [None]:
europe_1997.lifeExp.mean() 

In [None]:
plt.subplot(211)
plt.title('Distribution of Life Expectancy')
plt.hist(americas_1997.lifeExp)
plt.ylabel('Americas')

plt.subplot(212)
plt.hist(europe_1997.lifeExp)
plt.ylabel('Europe')
plt.show() 

In [None]:
plt.subplot(211)
plt.title('Distribution of Life Expectancy')
plt.hist(americas_1997.lifeExp, edgecolor = 'black')
plt.ylabel('Americas')

plt.subplot(212)
plt.hist(europe_1997.lifeExp, edgecolor = 'black')
plt.ylabel('Europe')
plt.show() 

In [None]:
bins = 20
plt.subplot(211)
plt.title('Distribution of Life Expectancy')
plt.hist(americas_1997.lifeExp, edgecolor = 'black', bins = bins)
plt.ylabel('Americas')

plt.subplot(212)
plt.hist(europe_1997.lifeExp, edgecolor = 'black', bins = bins) 
plt.ylabel('Europe')
plt.show() 

In [None]:
plt.subplot(211)
plt.title('Distribution of Life Expectancy')
plt.hist(americas_1997.lifeExp, edgecolor = 'black', bins = 20, range = (55, 85))
plt.ylabel('Americas')

plt.subplot(212)
plt.hist(europe_1997.lifeExp, edgecolor = 'black', bins = 20, range = (55, 85))
plt.ylabel('Europe')
plt.show() 

In [None]:
americas_1997[americas_1997.lifeExp < 65]

## Creating Time Series with Line Charts 

### Time Series 
Any chart that shows a trend over time 

### Compare GDP Per Capita Growth in the US and China 

In [None]:
us = df[df.country == "United States"]

In [None]:
us.head() 

In [None]:
plt.plot(us.year, us.gdpPercap)
plt.title('GDP Per Capita in the US')
plt.xlabel('Year')
plt.ylabel('GDP Per Capita')
plt.show()

In [None]:
china = df[df.country == 'China']

In [None]:
plt.plot(us.year, us.gdpPercap)
plt.plot(china.year, china.gdpPercap)
plt.show() 

In [None]:
plt.plot(us.year, us.gdpPercap)
plt.plot(china.year, china.gdpPercap)
plt.title('GDP Per Capita in the US')
plt.xlabel('Year')
plt.ylabel('GDP Per Capita')
plt.legend(["United States", "China"]) 
plt.show() 

## Customization
__SEE MORE:__
https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html
- Color 
- Marker 
- Marker Size 
- Line Style 
- Line width 

In [None]:
plt.plot(china.year, china.gdpPercap, label="China",  color="b", linewidth=2, marker='o')
plt.plot(us.year, us.gdpPercap, label= "United States", color="red", linewidth=3, marker='s')
plt.xlabel('Year')
plt.ylabel('GDP Per Capita')
plt.legend() 
plt.tight_layout()
plt.show() 

In [None]:
plt.plot(china.year, china.gdpPercap, label="China",  color='#8D3DAF', linewidth=2, marker='o')
plt.plot(us.year, us.gdpPercap, label= "United States", color="red", linewidth=3, marker='x')
plt.xlabel('Year')
plt.ylabel('GDP Per Capita')
plt.legend() 
plt.tight_layout()
plt.show() 

In [None]:
plt.plot(china.year, china.gdpPercap, label="China",  color='#444444', linewidth=2, marker='o')
plt.plot(us.year, us.gdpPercap, label= "United States", color="red", linewidth=3, marker='x')
plt.xlabel('Year', fontsize=16)
plt.ylabel('GDP Per Capita', fontsize=16)
plt.legend(fontsize=16) 
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.tight_layout()
plt.show() 

In [None]:
plt.figure(figsize=(10,6))
plt.plot(china.year, china.gdpPercap, label="China",  color='#444444', linewidth=2, marker='o')
plt.plot(us.year, us.gdpPercap, label= "United States", color="red", linewidth=3, marker='x')
plt.xlabel('Year', fontsize=16)
plt.ylabel('GDP Per Capita', fontsize=16)
plt.legend(loc = "best") 
plt.tight_layout()
plt.show() 

## Subplots
In fiction, a subplot is a secondary strand of the plot that is a supporting side story for any story or the main plot. Subplots may connect to main plots, in either time and place or in thematic significance. Subplots often involve supporting characters, those besides the protagonist or antagonist.
![](../img/sphx_glr_subplots_demo_005.png)

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2)
print(ax)

In [None]:
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=1)
print(ax)

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)
ax1.plot(china.year, china.gdpPercap, label="China",  color='#444444', linewidth=2, marker='o')
ax2.plot(us.year, us.gdpPercap, label= "United States", color="red", linewidth=3, marker='x')

ax1.set_xlabel('Year', fontsize=16)
ax1.set_ylabel('China', fontsize=16)

ax2.set_xlabel('Year', fontsize=16)
ax2.set_ylabel('US', fontsize=16)

plt.tight_layout()
plt.savefig("GDP.pdf") # .pdf, .png, .jpg, .jepg, .tiff

plt.show() 

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, sharex=True)
ax1.plot(china.year, china.gdpPercap, label="China",  color='#444444', linewidth=2, marker='o')
ax2.plot(us.year, us.gdpPercap, label= "United States", color="red", linewidth=3, marker='x')
ax1.set_ylabel('China', fontsize=16)

ax2.set_xlabel('Year', fontsize=16)
ax2.set_ylabel('US', fontsize=16)

plt.tight_layout()
plt.savefig("GDP.pdf") # .pdf, .png, .jpg, .jepg, .tiff

plt.show() 

In [None]:
df = pd.read_excel('../data/LungCapData.xls')

In [None]:
df.head() 

## Boxplots

In [None]:
plt.boxplot(df['Age'])
plt.show() 

## Bar Charts 

In [None]:
plt.bar(df['Gender'], df['LungCap'])
plt.show() 

## Scatter Plot 

In [None]:
plt.scatter(df['Age'], df['LungCap'])
plt.show() 

In [None]:
df.corr()

## Matrix Plot 

In [None]:
plt.imshow(df.corr(), cmap="viridis")
plt.colorbar()
plt.show()

## Exploring Data with Pandas and Matplotlib


In [None]:
import pandas as pd 
from pandas.plotting import scatter_matrix
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

# set options 
sns.set_style('whitegrid')
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
%matplotlib inline 

In [None]:
df = pd.read_csv('../data/diabetes.csv')

In [None]:
# examine first few rows 
df.head() 


## Univariate Plots
- Histograms
- Density Plot
- Box and Whisker Plots

### Histograms
4 Main Aspects

- Shape: Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc.

- Center: Mean or Median

- Spread: How far our data spreads. Range, Interquartile Range (IQR),standard deviation, variance.

- Outliers: Data points that fall far from the bulk of the data

In [None]:
df.hist(figsize=(20,10), edgecolor='black')
plt.show() 

## Density Plots
- Density plots are another way of getting a quick idea of the distribution of each attribute.
- The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin,much like your eye tried to do with the histograms.

In [None]:
df.plot(kind='density', subplots=True, sharex=False, figsize=(20,10), layout=(3, 3))
plt.show()

## Boxplots
- Boxplots provide a graphical picture of the five-number summary: showing center (median), spread (IQR and range), and identifies potential outliers.

- Boxplots can hide some shape aspects(histograms do better job at displaying shape)

- mSide-by-Side Boxplots are useful for comparing two or more sets of observations.

In [None]:
df.plot(kind='box', subplots=True, sharex=False, figsize=(20,10), layout=(3,3))
plt.show() 

## Multivariate Plots
- Correlation Matrix Plot
- Scatter Plot

## Correlation Matrix Plot
Correlation gives an indication of how related the changes are between two variables.

- If two variables change in the same direction they are positively correlated.

- If they change in opposite directions together (one goes up, one goes down), then they are negatively correlated.

- This is useful to know, because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in your data.

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True)
plt.show()

### Scatter Plot
A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute.



In [None]:
df.plot(kind='scatter', x="BMI", y="BloodPressure", alpha=0.3)
plt.show() 