
# Fundamental Python - Visualization
This tutorial is based on Russ Poldrack's [PythonForRUsers](https://github.com/poldrack/PythonForRUsers) tutorials and is adpated to a Python-only tutorial by Shao-Fang Wang (2020).  

Many people have contributed to developing and revising the R tutorial material (which is what this Python tutorial is based on) over the years: 
Anna Khazenzon, Cayce Hook, Paul Thibodeau, Mike Frank, Benoit Monin, Ewart Thomas, Michael Waskom, Steph Gagnon, Dan Birman, Natalia Velez, Kara Weisman, Andrew Lampinen, Joshua Morris, Yochai Shavit, Jackie Schwartz, Arielle Keller, and Leili Mortazavi.   

## Why visualize data?
The greatest value of a picture is when it forces us to notice what we never expected to see. — John Tukey    

In [None]:
from IPython.display import Image
Image(filename='./figures/anscombe.png') 

Anscombe’s quartet in the figure above (left side) illustrates the importance of visualizing data. Even though the datasets I-IV have the same summary statistics (mean, standard deviation, correlation), they are importantly different from each other. On the right side, we have four data sets with the same summary statistics that are very similar to each other.

We will use Python's and [`matplotlib`](https://matplotlib.org/) and [`seaborn`](https://seaborn.pydata.org/index.html) libraries for plotting. Seaborn is a library for making statistical graphics in Python. It is built on top of ```matplotlib``` and closely integrated with ```pandas``` data structures. Let's import the libraries!

In [None]:
import matplotlib.pyplot as plt
import matplotlib
#matplotlib.use("TkAgg")
import seaborn as sns
import pandas as pd
import numpy as np

# this is necessary to fix a bad interaction on Mac systems
# per: https://github.com/openai/spinningup/issues/16
#import os
#os.environ['KMP_DUPLICATE_LIB_OK']='True'

First let's load the UH2 data files and merge them.

In [None]:
data = pd.read_csv('./data/meaningful_variables_clean.csv', index_col=0)
demogdata = pd.read_csv('./data/demographics.csv', index_col=0)
mean_results = pd.read_csv('./data/arrest_ssrt_impulsivity.csv', index_col=0)
df = data.join(
    demogdata, how='inner').join(
        mean_results, how='inner').dropna(
            subset=['ArrestedChargedLifeCount',
                   'TrafficTicketsLastYearCount'])
df.head()

## Scatter plots

First let's create a scatter plot showing mean SSRT versus mean impulsivity scores. 

We can use `.scatterplot` in Seaborn: x,y- input data variables. We can pass data directly or reference columns in data.

In [None]:
plt.figure(figsize=(8,8))#define figure size
ax = sns.scatterplot(x='mean_SSRT',
                         y='mean_impulsivity',
                         data=df)


Now let's say that you want to generate a version of this plot that colors the points by whether the person has ever been arrested or not, and plot the size as the number of times arrested. In Seaborn we would do this using the ```hue``` and ```size``` arguments:

In [None]:
plt.figure(figsize=(8,8))
ax = sns.scatterplot(x='mean_SSRT',
                         y='mean_impulsivity',
                         hue='EverArrested',
                         size='ArrestedChargedLifeCount',
                         data=df)


Finally, we may want to change the x and y axis label and add a title to the plot. We will use ```.set()```:

In [None]:
plt.figure(figsize=(8,8))
ax = sns.scatterplot(x='mean_SSRT',
                         y='mean_impulsivity',
                         hue='EverArrested',
                         size='ArrestedChargedLifeCount',
                         data=df)
ax.set(xlabel='Mean Impulsivity', ylabel='Mean SSRT', title = "Impulsitivty and SSRT")

We can also change the aesthetic style of the plots by adding `set_style()`. This will affect things like color of the axes, whether a grad is enable by default, and other aesthetic elements.

In [None]:
plt.figure(figsize=(8,8))
sns.set_style('whitegrid')#here we use the tyle 'whitegrid'
ax = sns.scatterplot(x='mean_SSRT',
                         y='mean_impulsivity',
                         hue='EverArrested',
                         size='ArrestedChargedLifeCount',
                         data=df)
ax.set(xlabel='Mean Impulsivity', ylabel='Mean SSRT', title = "Impulsitivty and SSRT")

In [None]:
plt.figure(figsize=(8,8))
sns.set_style('darkgrid')#here we use the tyle 'darkgrid'
ax = sns.scatterplot(x='mean_SSRT',
                         y='mean_impulsivity',
                         hue='EverArrested',
                         size='ArrestedChargedLifeCount',
                         data=df)
ax.set(xlabel='Mean Impulsivity', ylabel='Mean SSRT', title = "Impulsitivty and SSRT")

To explore more themes: https://python-graph-gallery.com/104-seaborn-themes/

## Visualizing categorical variables
For quantitative data, we can use box plots to compare data distribution between variables or across levels of a categorical variable. We can use `.boxplot()` to plot mean impulsivity for each of the groups in motivation for participation:

In [None]:
plt.figure(figsize=(8,8))
sns.set_style("whitegrid")
sns.boxplot(x='MotivationForParticipation',
                y='mean_impulsivity',
                data = df)

Another way to visualize quantitative data is to use bar graph (```.barplot()```). Here, the error bar is 95% cinfidence interval.

In [None]:
plt.figure(figsize=(8,8))
sns.set_style("whitegrid")
sns.barplot(x='MotivationForParticipation',
                y='mean_impulsivity',
                n_boot=100,
                ci = 95,
                data = df)

In [None]:
##Apply your knowledge
#Plot mean impulsivity for each group in the gambling problem using boxplot and barplot









## Visualizing libear relationships

TTwo main functions in seaborn are used to visualize a linear relationship as determined through regression. These functions, ```regplot()``` and ```lmplot()``` are closely related, and share much of their core functionality. It is important to understand the ways they differ so that you can quickly choose the correct tool for particular job.



It appears that there is a realtionship between impulsivity and arrests (and in fact our earlier analyses of these data also showed that). We can use ```lmplot()``` to draw a scatterplot of two variables and then fit the regression model y~x and plot the resulting regression line and a 95% confidence interval for that regression:

In [None]:
plt.figure(figsize=(8,8))
sns.set_style('darkgrid')
ax = sns.lmplot(x='ArrestedChargedLifeCount',
                    y='mean_impulsivity',
                    data=df)


Here we can see that when x is discrete values, the data points overlap with one another. To visualize the data points better, one option is to add some random noise (“jitter”) to the discrete values to make the distribution of those values more clear. Note that jitter is applied only to the scatterplot data and does not influence the regression line fit itself:

In [None]:
plt.figure(figsize=(8,8))
sns.set_style('darkgrid')
ax = sns.lmplot(x='ArrestedChargedLifeCount',
                    y='mean_impulsivity',
                    x_jitter = 0.1,
                    data=df)


Sometimes, the relationship between variables is non-linear:

In [None]:
df_poly=pd.DataFrame()
df_poly['x']=np.random.randint(1,100,size = 300)
df_poly['y']=np.random.randint(10,40,size = 300)
df_poly['z']=df_poly['x']**2*df_poly['y']
df_poly.head()

In [None]:
plt.figure(figsize=(8,8))
sns.set_style('darkgrid')
ax = sns.lmplot(x='x',
                    y='z',
                    data=df_poly)


In [None]:
#we can use order to estimate a polynomial regression
plt.figure(figsize=(8,8))
sns.set_style('darkgrid')
ax = sns.lmplot(x='x',
                    y='z',
                    order = 2,
                    line_kws={'color': 'black'},#change the color of the line
                    data=df_poly)


## Resources
Seaborn styles: http://seaborn.pydata.org/tutorial/aesthetics.html