# Data Programming in Python | BAIS:6040
# Data Visualization

Instructor: Jeff Hendricks

Topics to be covered:
- Visualization Using Matplotlib Pandas
- Visualization of Time Series Data
- Visualization Using Seaborn
- Interactive Visualization Using Plotly & Cufflinks
- Exercises

References: 
- Seaborn (https://seaborn.pydata.org/)
- Plotly (https://plotly.com/python/)
- Cufflinks(https://plotly.com/python/pandas-backend/)
- Major League Baseball data form SeanLahman.com (http://www.seanlahman.com/baseball-archive/statistics/)
- Pandas Visualization (https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) 
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)

## Importing Modules

In [None]:
from IPython.display import Image                   # image display
import matplotlib.pyplot as plt                     # visualization
import numpy as np                                  # random number generation
import pandas as pd                                 # dataframes

# inline display of plots
%matplotlib inline

## Basic Matplotlib

Matplotlib provides the functionality to create static bitmap objects (PNG,JPG) or of PDF format

- plt.figure() figure properties
- plt.axis() axis properties
- plt.grid() grid properties
- plt.plot() plot function with x and y arguments
- plt.xlabel(), plt.ylabel() labels for respective axes
- plt.title() title for the overall plot

https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py

### Using Style Sheets

The style package adds support for easy-to-switch plotting "styles"

https://matplotlib.org/tutorials/introductory/customizing.html

https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html

In [None]:
# set the plotting style to seaborn
plt.style.use('seaborn')

### Line Properties

- Color Abbreviations
   - r = red
   - g = green
   - b = blue
   - k = blak
- Character Symbols
   - o = circle marker
   - v = triangle_down marker
- Line width

https://matplotlib.org/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D

In [None]:
# create a 2-dimensional array with 3 columns of data
# each column is a cumulative sum along the rows (axis 0)
y = np.random.standard_normal((20,3)).cumsum(axis=0)

In [None]:
y.shape

In [None]:
# pyplot automatically considers each column (axis 1 dimension) as a separate data series
plt.figure(figsize=(10,10))
plt.plot(y,lw=1.5)
plt.plot(y,'ro')  # adds a red circle symbol
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Basic Plot')
plt.show()

In [None]:
# explicity assign a data subset to a line and label them
# you can add a legend and set the location
plt.figure(figsize=(10,10))
plt.plot(y[:,0],lw=1.5, label='Line 1')
plt.plot(y[:,1],lw=1.5, label='Line 2')
plt.plot(y[:,2],lw=1.5, label='Line 3')
plt.plot(y,'ro')  # adds a red circle symbol
plt.legend(loc=0)  # sets the location of the legend to the best possible
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Basic Plot')
plt.show()

## Basics of Pandas and Matplotlib Visualization

The <b>matplotlib.pyplot.plot</b> is a function of Matplotlib, while the <b>pandas.Series.plot</b> and <b>pandas.DataFrame.plot</b> are methods of Pandas. 

The <b>matplotlib.pyplot.plot</b> function is used to plot data. The <b>plot</b> method on series and dataframes is just a simple wrapper around <b>matplotlib.pyplot.plot</b>, which means you can just use the <b>plot</b> method on a series or a dataframe without having to explicitly call <b>matplotlib.pyplot.plot</b>. Nevertheless, you still need to import <b>matplotlib.pyplot</b> to use the <b>plot</b> method of Pandas.

In [None]:
#import matplotlib.pyplot as plt                     
import numpy as np                                  
import pandas as pd

series = pd.Series(np.random.randint(1, 101, 10))   # from an array with 10 random integers between 1 and 100 
series

In [None]:
series.plot(kind="line", grid=True)
plt.show()

pandas.Series.plot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html

When drawing a plot on a Pandas series using the <b>plot</b> method, the x axis is the index of the series, while the y axis is its values.

`kind`: str
- line: line plot (default)
- bar: vertical bar plot
- barh: horizontal bar plot
- hist: histogram
- box: boxplot
- kde: Kernel Density Estimation plot
- density: same as ‘kde’
- area: area plot
- pie: pie plot

In [None]:
df = pd.DataFrame(np.random.randint(1, 101, (10,3)),  # from a 10 x 3 array with random integers between 1 and 100
                  columns=["a", "b", "c"])
df

In [None]:
df.plot(kind="bar", grid=True)
plt.show()

pandas.DataFrame.plot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

When drawing a plot on a Pandas dataframe using the <b>plot</b> method, the x axis is the index of the dataframe, while the y axis is the values of the columns in the dataframe. In this example, there are thee lines that correspond to the three columns.

In [None]:
df[["b", "c"]].plot(kind="line", grid=True)

You can select some of the columns you are interested in. 

In [None]:
df[["b", "c"]].plot(kind="line", grid=True)
plt.show()                                        # Remove the plotting objects information above the plot.

## Visualization of Baseball Data

In [None]:
import matplotlib.pyplot as plt 
import pandas as pd

In [None]:
dfb = pd.read_csv("data/MLB_Batting.csv")

In [None]:
dfb.info()

Each row, or record, is a batter. 

In [None]:
dfb.head()

In [None]:
dfb.yearID.value_counts(ascending=False)     # Count the number of rows, or records, by year.

pandas.Series.value_counts: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

In [None]:
dfb.lgID.value_counts(ascending=False)

In [None]:
dfb18 = dfb[(dfb.yearID == 2018) & ((dfb.lgID == "NL") | (dfb.lgID == "AL"))]

We would like to select the rows in which the <i>yearID</i> is 2018 and the <i>lgID</i>  is either 'NL' or 'AL'.

In [None]:
dfb18.shape

In [None]:
dfb18.info()

In [None]:
dfb18.head()

In [None]:
dfb18.tail()

In [None]:
dfb18.H.plot(kind="line", figsize=(15,7), legend=True, grid=True)
plt.show()

In [None]:
dfb18['H'].plot(kind="line", figsize=(15,7), legend=True, grid=True)
plt.show()

The x axis is the index of the series, which is the index of the dataframe, while the y axis is the values of the series. 

In [None]:
dfb18.H.plot(kind="hist", bins=30, grid=True, figsize=(15,7))
plt.show()

A histogram is a representation of the distribution of data. The function groups the values of a series into bins, counts the values in each bin, and then draws a histrogram with all bins in the x axis and their counts in the y axis.

Many of the values are in the first bin that contains values from 0 to 5 or something, which means many batters make at most 5 hits in the season of 2018. 

In [None]:
dfb18.H.plot(kind="hist", bins=30, cumulative=True, grid=True, figsize=(15,7))
plt.show()

A cumulative histogram is a cumulative representation of the distribution of data.

In [None]:
dfb18[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", grid=True, figsize=(15,7))
plt.show()

A box plot is a method for graphically depicting groups of numerical data through their quartiles (Q1, Q2, and Q3). The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. The position of the whiskers is set by default to 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box. Outlier points are those past the end of the whiskers.

In [None]:
Image(url="http://www.datavizcatalogue.com/methods/images/anatomy/box_plot.png")

In [None]:
dfb18[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", vert=False, grid=True, figsize=(15,7))
plt.show()

In [None]:
dfb18.plot(kind='scatter', x='H', y='HR', figsize=(10,10))
plt.show()

A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables - one plotted along the x axis and the other plotted along the y axis. This kind of plot is useful to see complex correlations between two variables.

In [None]:
dfb18.plot(kind='scatter', x='HR', y='SO', figsize=(10,10))
plt.show()

In [None]:
pd.plotting.scatter_matrix(dfb18[["AB", "H", "2B", "3B", "HR", "BB", "SO"]], figsize=(10,10), diagonal="hist")
plt.show()

pandas.plotting.scatter_matrix: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html

A scatter matrix is a pair-wise scatter plot of several variables presented in a matrix format. It can be used to determine whether the two variables are correlated and whether the correlation is positive or negative. 
- Positive correlation: as one variable increases so does the other
- Negative Correlation: as one variable increases, the other decreases
- No correlation: there is no apparent relationship between the two variables

In [None]:
dfb18[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].corr()

pandas.DataFrame.corr: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

The <b>corr()</b> method computes pairwise correlation of columns. The closer the corrleation coefficient is to 1, the stronger the positive correlation is. Likewise, the closer it is to -1, the stronger the negative correlation is. 

`method`: {'pearson', 'kendall', 'spearman'}
- pearson: standard correlation coefficient (default)
- kendall: Kendall Tau correlation coefficient
- spearman : Spearman rank correlation

In [None]:
series = dfb18.groupby("teamID").HR.sum()
series

In [None]:
series = series.sort_values(ascending=False)
series

In [None]:
series.plot(kind="bar", title="Home Runs by Team", figsize=(15,5))

In [None]:
series.plot(kind="barh", title="Home Runs by Team", figsize=(10,10))

In [None]:
series = dfb18.groupby("lgID").H.sum()
series

In [None]:
series.plot(kind="pie", title="Hits: AL vs. NL", figsize=(5,5), autopct='%.1f', fontsize=13)

## Visualization of Time Series Data 

In [None]:
import pandas as pd

# setting the squeeze parameter to True squeezes the one column dataframe down to type Series
series = pd.read_csv('data/daily-minimum-temperatures.csv',header=0, index_col=0, parse_dates=True, squeeze=True)

series.head()

In [None]:
series.plot(kind="line", title="Minimum Temp over Time", figsize=(15,7))
plt.show()

In [None]:
series.loc['1990-1-1 00:00:00':]

In [None]:
series.loc['1990-1-1':]

In [None]:
series.loc['1990-1-1':'1990-6-30']

In [None]:
series.loc['1990-1-1':'1990-6-30'].plot(kind="line", title="Minimum Temp over Time", figsize=(15,7))
plt.show()

## Plotting with Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

- Import of matplotlib required
- Specializes on statistical visualizations
- Richer visuals than matplotlib

https://seaborn.pydata.org/tutorial.html

### Seaborn figure styles
There are five preset seaborn themes: darkgrid, whitegrid, dark, white, and ticks. They are each suited to different applications and personal preferences. The default theme is darkgrid. 

https://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles

### Building color palettes

The most important function for working with discrete color palettes is color_palette(). This function provides an interface to many (though not all) of the possible ways you can generate colors in seaborn, and it’s used internally by any function that has a palette argument (and in some cases for a color argument when multiple colors are needed).

https://seaborn.pydata.org/tutorial/color_palettes.html

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

sns.set(style='ticks', palette='Set2')

In [None]:
df = sns.load_dataset('titanic')[["survived", "pclass", "sex", "age", "fare"]].dropna()
df['survive'] = df.survived.astype(bool)
# Drop outliers. This is to help the visualization in the next examples.
df = df[df.fare < 400]
df.head()

### Seaborn Boxplot

https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="survive", y="fare", hue='sex', width=0.6, data=df)
plt.show()

In [None]:
def createBoxPlot(df, x, y):
    sns.set(style="whitegrid")
    p = sns.boxplot(x=x, y=y, data=df)
    m1 = df.groupby([x])[y].median().values
    mL1 = [str(np.round(s, 2)) for s in m1]

    ind = 0
    for tick in range(len(p.get_xticklabels())):
        p.text(tick-.2, m1[ind], mL1[ind],  horizontalalignment='center',  color='w', weight='semibold')
        ind += 1    
    plt.show()

In [None]:
createBoxPlot(df,'sex','fare')

### Seaborn Distribution Plot

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions.

- bins : specify the number of bins
- rug : create the rugplot or not
- kde : whether to plot a gaussian kernel density estimate

https://seaborn.pydata.org/generated/seaborn.distplot.html#seaborn.distplot

In [None]:
sns.distplot(df.fare, rug=True, kde=True)
plt.show()

### Correlation Plot using Seaborn Heatmap

In [None]:
def createCorrelationPlot(df):
    sns.set(style="white")
    # Compute the correlation matrix
    #corr = d.corr()

    # Generate a mask for the upper triangle
    #mask = np.triu(np.ones_like(corr, dtype=np.bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    #cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(df.corr()
                ,mask=np.triu(np.ones_like(df.corr(), dtype=np.bool))
                ,cmap=sns.diverging_palette(220, 10, as_cmap=True)
                ,vmax=.3, center=0
                ,square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.show()

In [None]:
createCorrelationPlot(df[["survive", "pclass", "age", "fare"]])

## Interactive Visualization with Plotly Express

Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures.

https://plotly.com/python/plotly-express/

Plotly is a library available to create interactive plots based on the D3.js standard. 

- Allows for zooming in and out
- Dedicated visualization for data science and tightly integrated with Python ecosystem

https://plotly.com/python/

### Plotly Express

In [None]:
#!pip install plotly-express

In [None]:
data=np.random.standard_normal((360,3)).cumsum(axis=0)

index = pd.date_range('2018-1-1', freq='D', periods=len(data))

df = pd.DataFrame(data = 10+5*data
             ,index=index
             ,columns=['A','B','C'])
df.head()

In [None]:
import plotly.express as px

fig = px.histogram(df, x='A', nbins=10,
                   marginal="box", # or violin, rug
                   hover_data=df.columns)
fig.show()

In [None]:
import plotly.express as px

fig = px.line(df, x=df.index, y=['A','B'])
fig.show()

In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x='A', y='B', z='C')
fig.show()

In [None]:
import seaborn as sns

df_titanic = sns.load_dataset('titanic')[["survived", "pclass", "sex", "age", "fare"]].dropna()
df_titanic['survive'] = df_titanic.survived.astype(bool)
# Drop outliers. This is to help the visualization in the next examples.
df_titanic = df_titanic[df_titanic.fare < 400]
df_titanic.head(2)

In [None]:
import plotly.express as px

fig = px.histogram(df_titanic, x='fare', nbins=10, color='pclass',
                   marginal="violin", # or violin, rug
                   hover_data=df_titanic.columns)
fig.show()

## Exercises for Visualization (6 questions)

Let's continue to use <i>dfb</i> for our dataframe.

1\. Draw a histogram that plots the distribution of <i>Walks in 2017 for the National League</i> with 10 bins. Set `figsize` to 15 x 5.

In [None]:
# Your answer here


2\. Create the same histogram as #1 above using Seaborn and include a rugplot and kernel density estimate.

In [None]:
# Your answer here


3\. Draw a scatter plot where the x axis is <i>Walks in 2017</i> and the y axis is <i>Strikeouts in 2017</i>. Set the `figsize` to 10 x 10. 

In [None]:
# Your answer here


4\. Draw a box plot that displays the distributions of both <i>Hits in 2017</i> and <i>Homeruns in 2017</i>. Set the `title` to 'Hitting Distribution' and the `figsize` to 10 x 10. 

In [None]:
# Your answer here


5\. Create a Seaborn boxplot for the distribution of home runs in 2017 and 2018 in the American and National Leagues. The hue should be based on the league and the x axis represents the year.

In [None]:
# Your answer here


6\. Using plotly express, draw an interactive line plot of the minimum temperature data above for the year 1990.

In [None]:
# Your answer here
