# Tutorial 3: Creating plots in matplotlib

This tutorial introduces the matplotlib library and demonstrates how to create bar plots, scatter plots, and box plots.\
It will also cover some basic statistical analysis tools.

---

## 3.1 Importing data

Start by importing the libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

\
Then import the data you exported from Tutorial 3:

In [None]:
df = pd.read_table('YOUR_NAME_pokemon_data.csv', sep=',', index_col=0) #index_col=0 sets the first column as the index

---

## 3.2 Global parameters

Plots can be individually configured, but we can also set **global parameters** at the start of the notebook. Once these are set they will apply to all plots unless otherwise specified. A few examples:

In [None]:
plt.style.use('seaborn-v0_8-whitegrid') #applies a preset style 
plt.rcParams['figure.figsize'] = [8,6] #figure size
plt.rcParams['font.size'] = 12 #global font size
plt.rcParams['figure.dpi'] = 75

Many options for style presets exist, which can be viewed with `plt.style.available`. There is even one called `ggplot` if you want to pretend that you made these figures in R...

---

## 3.3 Bar plots

Try running the following:

In [None]:
df.plot(kind='bar')

As you can see, the function generates a **grouped** bar plot by default. This is fine for most data, but recall that we scaled all of our Type values based on a **percentage** of the total counts in each Generation. In this case, it would be more appropriate to use a **stacked** bar plot:

In [None]:
df.plot(kind='bar', stacked=True)

Now we can see the scaled values as a proportion of the total (note that the Y axis is now scaled from 0 to 100). 

Next, we want to change the color scheme, as the default colors are not very informative. I provided a custom color map as a separate file, which can be imported as a dataframe:

In [None]:
color_map = pd.read_table('pokemon_color_map.csv', sep=',', index_col=0)
color_map = color_map.sort_values(by='Type') #sorting alphabetically to match up with our data

color_map.head(5)

The `['Color']` column contains hex-codes that correspond to specific colors. We can then pass the column in `.plot()` as an additional parameter:

In [None]:
df.plot(kind='bar', stacked=True, color=color_map['Color'])

Now that the colors are applied, lets format the legend. In the above plot, the legend lists the Types in reverse order from how they are actually shown in the bars (top-to-bottom vs. bottom-to-top). 

The `.legend()` function is separate from the `.plot()` function, so we'll need to define a new object (`barplot`) for our plotting data. 

In [None]:
barplot = df.plot(kind='bar', stacked=True, color=color_map['Color'])

handles, labels = barplot.get_legend_handles_labels() #handles are the colored boxes, labels are the text

legend = barplot.legend(
    handles=reversed(handles),
    labels=reversed(labels), 
    bbox_to_anchor=(1, 1), 
    title='Type',
    frameon=True
)

#as an aside, whenever a function has a lot of parameters its good practice to put them on separate lines

The `.get_legend_handles_labels()` function retrieves the legend data from `barplot`, and the `.legend()` function is where the legend formatting parameters are specified. 

The first two parameters are always the legend handles and labels. Using `reverse()` will reverse their order.\
The `bbox_to_anchor` sets the x and y coordinates of the legend relative to the plot. `(0,0)` would be bottom left and `(1,1)` is top right. \
We also provided some extra parameters such as `title` and `frameon` here to give the legend a title and draw a box around it.

Other legend formatting options exist: (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html) \
Feel free to try out some others, such as `title_fontsize`, `labelspacing`, and `borderpad`.

\
Finally, lets add some additional axis labels and titles. 

In [None]:
### PLOT ###
barplot = df.plot(
    kind='bar', 
    stacked=True, 
    width=(0.7),    #sets the width of bars 
    ylim=(0,100),   #sets the y-axis limits
    color=color_map['Color']
)

## LEGEND ##
handles, labels = barplot.get_legend_handles_labels()
legend = barplot.legend(
    handles=reversed(handles),
    labels=reversed(labels), 
    title='Type', 
    title_fontsize=(14),
    bbox_to_anchor=(1.02, 1), 
    frameon=True
)

## X AXIS LABELS ##
plt.xticks(rotation=0) 
plt.xlabel('Generation', size=14, labelpad=5)

## Y AXIS LABELS ##
plt.yticks(ticks=[0,10,20,30,40,50,60,70,80,90,100])
plt.ylabel('% of Total', size=14, labelpad=5)

## TITLE ##
plt.title('Pokemon Types by Generation', size=14, pad=15)

plt.show() #this function ensures that only the plot is displayed as the output

The finished bar plot enables easy visualization of numerical differences across categorical variable(s). For example, we can see that Gen 1 has a higher proportion of Poison and Water type pokemon compared to Gen 6, which has more Fairy and Ghost types. 

---

## 3.4 Scatter plots

Let's return to the original pokemon data set:

In [None]:
df2 = pd.read_table('pokemon_data.csv', sep=',', index_col=0)

\
A scatter plot can be created by setting any two numerical columns as the `x` and `y` values:

In [None]:
df2.plot(kind='scatter', x='Attack', y='Defense')

---

### Exercise #1

Individual points on a scatter plot can be colored by passing a list or series of colors into the `colors=` parameter.

A) Using the `color_map` dataframe, modify the scatterplot above to color each point by the `Type 1` column. \
Hint: the `join()` function can be used here.

B) Once you have applied the color map, format the plot axes and labels using the barplot example as a guide.



In [None]:
### YOUR CODE HERE ###

Optionally, you can give the plot a legend. Retrieving the handles and labels is beyond the scope of this tutorial, so you can paste the following code for that part:

```python
import matplotlib.lines

custom_handles = [matplotlib.lines.Line2D(
    [],[], marker='o', color=c, linestyle="none") for c in color_map['Color']]
custom_labels = color_map.index.to_list()

```


---

## 3.5 Linear regression

Statistical analyses can be used to help our interpretation of scatter plots. 

We will use the **SciPy.stats** library to evaluate linear regression and correlation coefficients:

In [None]:
from scipy import stats

\
The `stats.linregress()` function returns an object that stores multiple values:

In [None]:
stats.linregress(df2['Attack'], df2['Defense'])

\
Individual values can be accessed with `.slope`, `.intercept`, etc. \
Based on the p-value, is the correlation between pokemon Attack and Defense significant?

In [None]:
stats.linregress(df2['Attack'], df2['Defense']).pvalue

\
We can also plot the line of best fit based on the slope and intercept values. Some additional code is written to facilitate plotting two sets of data on the same plot area:

In [None]:
#storing the regression object
reg = stats.linregress(df2['Attack'], df2['Defense'])

#getting line of best fit
df4 = df2.copy()
df4['x_vals'] = df4['Attack']
df4['y_vals'] = reg.slope*df2['Attack'] + reg.intercept

#defining subplots 
fig, ax = plt.subplots()

df4.plot(kind='scatter', x='Attack', y='Defense', label='data', ax=ax)
df4.plot(kind='line', x='x_vals', y='y_vals', color='r', label='fitted line', ax=ax)

plt.legend(bbox_to_anchor=(1.25, 1))
plt.show()

The `ax` (sometimes written as `axs` or `axes` when there is more than one) parameters specify the location to plot data within the figure (or `fig`). \
In this case, we want the scatter and line plots on the same plot area, so we set `ax=ax` for both of them. 

---

## 3.6 Box plots

The `df.plot(kind='box')` is not implemented very well, so we're using `df.boxplot()` instead.\
The `boxplot()` function takes similar parameters as `groupby()`, where `by` specifies the categorical data to aggregate on, and `column` specifies the numerical data to plot:

In [None]:
df2.boxplot(by='Legendary', column='Attack')

Leaving out the `columns` parameter will automatically plot all numerical columns:

In [None]:
df5=df2.drop('Generation', axis=1)

df5.boxplot(by='Legendary', layout=(2,3), grid=False, widths=0.7)

plt.show()

---

## 3.7 Comparing groups

Let's do a simple **t-test** with just the Attack values of legendary and non-legendary pokemon:

In [None]:
leg = df5.loc[df5['Legendary']==True]['Attack'] #attack values of legendary pokemon
not_leg = df5.loc[df5['Legendary']==False]['Attack'] #attack values of non-legendary pokemon

stats.ttest_ind(leg,not_leg) #ttest_ind() is for two independent samples

A p-value of $7.83 \times 10^{-24}$ indicates a significant difference between the two groups. 

---

### Exercise #2

A) Make a boxplot comparing HP values for pokemon in different Generations. 

B) Use a **one-way ANOVA** to test for significant differences between pokemon HP across Generations. 

Documentation for `stats.f_oneway()`:\
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html



In [None]:
## YOUR CODE HERE ##

---

## 3.8 Exporting plots

You can add `plt.savefig('FIG_NAME.EXTENSION')` at the end of any cell that makes a plot to export it. \
Note that `plt.savefig()` and `plt.show()` are mutually exclusive, it won't work if you have both.

Extensions can be png, jpeg, pdf, svg, and many others. You can also specify resolution with `dpi=`.

---

## 3.9 Extras

Again, **nothing required** here. Just showing some additional plotting/stats options. 

### Subplots

Earlier we used `plt.subplots()` to create a figure (`fig`) and plot 2 sets of data on the same axis (`ax`). Subplots are also great for creating multi-paneled figures: 

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(7, 7), constrained_layout=True)

g_df = df2.loc[(df2['Type 1']=='Grass')]
f_df = df2.loc[(df2['Type 1']=='Fire')]
w_df = df2.loc[(df2['Type 1']=='Water')]
n_df = df2.loc[(df2['Type 1']=='Normal')]

g_df.plot(kind='scatter', x='Attack', y='HP', 
          c='tab:green', title='grass', ax=axes[0,0])
f_df.plot(kind='scatter', x='Attack', y='HP', 
          c='tab:orange', title='fire', ax=axes[0,1])
w_df.plot(kind='scatter', x='Attack', y='HP', 
          c='tab:blue', title='water', ax=axes[1,0])
n_df.plot(kind='scatter', x='Attack', y='HP', 
          c='tab:gray', title='normal', ax=axes[1,1])

plt.show()

The first two `.subplots()` parameters are the number of rows and columns in the figure (2x2 grid, 4 axes total), and the individual axes are specified with `axes[x,y]` corresponding to their location in the figure.\
Once you start customizing plot labels, colors, etc in large figures, it can get a bit tedious... using `for` loops is highly encouraged. 

### Correlation matrices

If you want to check pairwise correlations between multiple variables, `df.corr` generates a matrix:

In [None]:
df5 = df2.select_dtypes(include='int64').drop('Generation',axis=1)

corr_df = df5.corr(method='pearson')
corr_df.style.background_gradient(cmap='Blues')

Using `method='spearman'` is recommended for non-parametric data. \
You can write a custom function to extract the p-values as a separate matrix. But whenever you are making multiple comparisons, correction factors need to be applied before interpreting significance.

### Multivariate statistics

To compare the data shown in the barplots, we can use distance matrix based statistics. These are available in the **scikit-bio** library:

In [None]:
!pip install scikit-bio 

It can take up to several minutes to install, but once done, you won't need to run it again in subsequent notebooks.\
Windows users may experience errors - if so, try installing/updating MS visual studio C++ https://visualstudio.microsoft.com/visual-cpp-build-tools/ and making sure your python version is up to date.

In [None]:
from skbio import DistanceMatrix
from skbio import stats
from scipy.spatial.distance import pdist, squareform

In [None]:
dm = squareform(pdist(df)) #compute a nxn matrix of pairwise distances
dm = DistanceMatrix(dm, ids=df.index) #store distance matrix object with generation as labels

dm

The plot shows that Generations 1 and 6 are the least similar to each other, while 4 and 5 are the most similar.

From here, a variety of statistical analyses can be performed within scikitbio, such as anosim, permanova, compositional analysis, etc. 

https://scikit.bio/docs/latest/generated/skbio.stats.distance.html