# Data Programming in Python | BAIS:6040
# Module 8. Data Visualization with Matplotlib and Ipywidgets

Written by Kang-Pyo Lee 

Topics to be covered:
- Data Visualization Using Pandas
- Plotting Basic Plots
- Interactive Visualization Using Ipywidgets
- Exercises

In [None]:
# ! pip install --user --upgrade  ipywidgets matplotlib
# ! jupyter nbextension enable --py widgetsnbextension

## Data Visualization Using Pandas

The <b>pandas.Series.plot</b> and <b>pandas.DataFrame.plot</b> methods provide basic but convenient visualization functionality. They are, in fact, a simple wrapper around <b>matplotlib.pyplot.plot</b> in the matplotlib package. 

In [None]:
import numpy as np
import pandas as pd 

np.random.seed(0)
series = pd.Series(np.random.randint(1, 101, 10))   # from an array with 10 random integers between 1 and 100 
series

In [None]:
series.plot(kind="line")

pandas.Series.plot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html

When plotting a plot on a Pandas series using the <b>plot</b> method, the x axis is the index of the series, and the y axis is its values.

`kind`: str
- line: line plot (default)
- bar: vertical bar plot
- barh: horizontal bar plot
- hist: histogram
- scatter: scatter plot
- box: boxplot
- kde: Kernel Density Estimation plot
- density: same as ‘kde’
- area: area plot
- pie: pie plot

In [None]:
series.plot(kind="line", title="Line Chart", grid=True, figsize=(10,5))

In [None]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1, 101, (10,3)),  # from a 10 x 3 array with random integers between 1 and 100
                  columns=["a", "b", "c"])
df

In [None]:
df.plot(kind="line")

pandas.DataFrame.plot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

When plotting a plot on a Pandas dataframe using the <b>plot</b> method, the x axis is the index of the dataframe, and the y axis is the values of the columns in the dataframe. In this example, there are 3 lines that correspond to the 3 columns.

In [None]:
df[["b", "c"]].plot(kind="line")

You can select some of the columns you are interested in. 

<hr>

## Plotting Basic Plots

SeanLahman.com (http://www.seanlahman.com/baseball-archive/statistics/)

In [None]:
dfb = pd.read_csv("classdata/MLB_Batting.csv")
dfb

Each row refers to a batter playing in MLB. 

In [None]:
dfb.info()

In [None]:
dfb.head()

In [None]:
dfb.tail()

In [None]:
dfb.yearID.value_counts()                    # Count the number of rows by year.

pandas.Series.value_counts: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

In [None]:
dfb.yearID.value_counts().sort_index()

In [None]:
dfb.lgID.value_counts()                       # Count the number of rows by league.

In [None]:
dfb19 = dfb[(dfb.yearID == 2019) & ((dfb.lgID == "NL") | (dfb.lgID == "AL"))]
dfb19

We would like to select the rows in which the `yearID` is 2019 and the `lgID`  is either *NL* or *AL*.

In [None]:
dfb19.shape

In [None]:
dfb19.info()

In [None]:
dfb19.H.plot(kind="line", figsize=(15,7), grid=True, legend=True)

The x axis is the index of the series, which is the index of the dataframe, while the y axis is the values of the series. 

In [None]:
dfb19.H.plot(kind="hist", bins=30, grid=True, figsize=(15,7))

A histogram is a representation of the distribution of data. The function groups the values of a series into bins, counts the values in each bin, and then plots a histrogram with all bins in the x axis and their counts in the y axis.

Many of the values are in the first bin that contains values from 0 to 5 or so, which means many batters make at most 5 hits in the season of 2019. 

In [None]:
dfb19.H.plot(kind="hist", bins=30, cumulative=True, grid=True, figsize=(15,7))

A cumulative histogram is a cumulative representation of the distribution of data.

In [None]:
from IPython.display import Image

Image(url="http://www.datavizcatalogue.com/methods/images/anatomy/box_plot.png")

A box plot is a method for graphically depicting groups of numerical data through their quartiles (Q1, Q2, and Q3). The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. The position of the whiskers is set by default to 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box. Outlier points are those past the end of the whiskers.

In [None]:
dfb19[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", grid=True, figsize=(15,7))

In [None]:
dfb19[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", vert=False, grid=True, figsize=(15,7))

In [None]:
dfb19.plot(kind='scatter', x='H', y='HR', grid=True, figsize=(10,10))

A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables - one plotted along the x axis and the other plotted along the y axis. This kind of plot is useful for visualizing correlations between two variables. 

- Positive correlation: as one variable increases so does the other (dots spreading from bottom left to top right)
- Negative Correlation: as one variable increases, the other decreases (dots spreading from top left to bottom right)
- No correlation: there is no apparent relationship between the two variables (dots randomly spreading)

In [None]:
dfb19.plot(kind='scatter', x='HR', y='SO', grid=True, figsize=(10,10))

In [None]:
pd.plotting.scatter_matrix(dfb19[["AB", "H", "2B", "3B", "HR", "BB", "SO"]], figsize=(10,10), diagonal="hist")

pandas.plotting.scatter_matrix: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html

A scatter matrix is a pair-wise scatter plot of several variables presented in a matrix format. 

In [None]:
dfb19[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].corr()

pandas.DataFrame.corr: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

The <b>corr()</b> method computes pairwise correlation of columns. The closer the corrleation coefficient is to 1, the stronger the positive correlation is. Likewise, the closer it is to -1, the stronger the negative correlation is. 

`method`: {'pearson', 'kendall', 'spearman'}
- pearson: standard correlation coefficient (default)
- kendall: Kendall Tau correlation coefficient
- spearman : Spearman rank correlation

In [None]:
count = dfb19.groupby("teamID").HR.sum()
count

In [None]:
count = count.sort_values(ascending=False)
count

In [None]:
count.plot(kind="bar", title="Home Runs by Team", grid=True, figsize=(15,5))

In [None]:
count.plot(kind="barh", title="Home Runs by Team", grid=True, figsize=(10,10))

In [None]:
count = dfb19.groupby("lgID").H.sum()
count

In [None]:
count.plot(kind="pie", title="Hits: AL vs. NL", figsize=(5,5), autopct='%.1f', fontsize=13)

<hr>

## Interactive Visualization Using Ipywidgets

Interactive visualization allows users to interactively update the current plot by changing its parameter values. Ipywidgets are interactive HTML widgets for Jupyter notebooks and the IPython kernel.

In [None]:
from ipywidgets import widgets, interactive, Layout

### ToggleButtons Widget

In [None]:
w = widgets.ToggleButtons(
    description = 'Speed:',
    options = ['Slow', 'Regular', 'Fast'],
    value = 'Slow',
    style = {"description_width": '50px'},
    layout = Layout(width="70%")
)

display(w)

ToggleButtons Widget: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#ToggleButtons

Widgets are eventful Python objects that have a representation in the browser, often as a control like a slider, textbox, etc.

In [None]:
w.value

The <b>value</b> attribute of the widget holds the selected value. 

In [None]:
w_league = widgets.ToggleButtons(
    description = 'League:',
    options = ['AL', 'NL', 'Both'],
    value = 'Both',
    style = {"description_width": '50px'},
    layout = Layout(width="70%")
)

display(w_league)

In [None]:
w_league.value

In [None]:
if w_league.value == "Both":                         # If the user has selected 'Both',
    df_tmp = dfb19                                   # select the entire dataframe.
else:                                                # If the user has selected either 'AL' or 'NL',
    df_tmp = dfb19[dfb19.lgID == w_league.value]     # select the rows with the selected league.

title = "Batting Stats for {}".format(w_league.value)
df_tmp[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", title=title, grid=True, figsize=(15,7))

### Make It Interactive!

We want to allow users to interactively select one of the two leagues, so they can compare the distributions of key batter metrics for that league. 

In [None]:
w_league = widgets.ToggleButtons(
    description = 'League:',
    options = ['Both', 'AL', 'NL'],
    value = 'Both',
    style = {"description_width": '50px'},
    layout = Layout(width="70%")
)

def view(league):
    if league == "Both":
        df_tmp = dfb19
    else:
        df_tmp = dfb19[dfb19.lgID == league]
    
    title = "Batting Stats of {}".format(league)
    df_tmp[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", title=title, grid=True, figsize=(15,7))

i = interactive(view, league=w_league)     # The value of the widget is passed to the view function as a parameter value.
display(i)

### Dropdown Widget

In [None]:
w = widgets.Dropdown(
    description = 'Speed:',
    options = ['Slow', 'Regular', 'Fast'],
    value = 'Slow',
    style = {"description_width": '50px'},
    layout = Layout(width="20%")
)

display(w)

Dropdown Widget: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#Dropdown

In [None]:
w.value

We want to allow users to interactively select one of the teams, so they can compare the distributions of key batter metrics for that team. 

In [None]:
sorted(set(dfb19.teamID))

In [None]:
w_team = widgets.Dropdown(
    description = 'Team:',
    options = ["All"] + sorted(set(dfb19.teamID)),    # a sorted list of unique teamIDs plus All
    value = "All",
    style = {"description_width": '50px'},
    layout = Layout(width="15%")
)

def view(team):
    if team == "All":
        df_tmp = dfb19
    else:
        df_tmp = dfb19[dfb19.teamID == team]

    title = "Batting Stats of {}".format(team)
    df_tmp[["AB", "H", "2B", "3B", "HR", "BB", "SO"]].plot(kind="box", title=title, grid=True, figsize=(15,7))

i = interactive(view, team=w_team)
display(i)

### Toggle Buttons & Dropdown Widgets

We want to allow users to interactively select the year and the team that they're interested in, so they can check the raw records for that year and the team.

In [None]:
w_year = widgets.ToggleButtons(
    description = 'Year:',
    options = [2016, 2017, 2018, 2019, 2020],
    value = 2019,
    style = {"description_width": '50px'},
    layout = Layout(width="90%")
)

w_team = widgets.Dropdown(
    description = 'Team:',
    options = ["All"] + sorted(set(dfb19.teamID)),
    value = "All",
    style = {"description_width": '50px'},
    layout = Layout(width="15%")
)

def view(year, team):
    if team == "All":
        df_tmp = dfb[dfb.yearID == year]
    else:
        df_tmp = dfb[(dfb.yearID == year) & (dfb.teamID == team)]

    display(df_tmp)

i = interactive(view, year=w_year, team=w_team)
display(i)

### Text Widget

In [None]:
w = widgets.Text(
    description = 'String:',
    value = 'Hello World',
    style = {"description_width": '50px'},
    layout = Layout(width="30%")
)

display(w)

Text Widget: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#Text

In [None]:
w.value

We want to allow users to interactively type a search term, so they can search the `text` column in `dft` for the term.

In [None]:
dft = pd.read_csv("classdata/timeline_cnnbrk.csv", sep="\t")
pd.set_option('display.max_colwidth', 150)
dft

In [None]:
w_string = widgets.Text(
    description = 'String:',
    style = {"description_width": '50px'},
    layout = Layout(width="90%")
)

def view(string):
    mask = dft.text.str.contains(string, case=False)
    display(dft[mask])

i = interactive(view, string=w_string)
display(i)

Note that you can use interactive visualization using Ipywidgets only in a notebook environment. In other words, you lose the interactive functionality once you convert it to HTML or other formats. 

## Exercises for Visualization