# Introduction to Matplotlib and Seaborn

## Goals
-----
* understand the central concepts for plotting beautiful figures using matplotlib
* get used to seaborn as a higher level plotting package


This week we will be using the packages `matplotlib` and `seaborn` to visualise our breast cancer dataset from last week. It's important to be able to both visualise data (usually from dataframes) and plot results of any analysis you have carried out on the data. The plotting functions you use in these two scenarios will overlap a considerable amount, so we will just focus on generating plots to visualise the original data, without doing too much processing of the data first. 

### Loading the required modules and data

We will need a few more modules than we have on previous weeks, as you can see below. We also import the same data that we used last week (the METABRIC dataset).

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

A quick note for anyone who has seen The West Wing - by convention `seaborn` is imported as `sns`. This is relating to Samuel Norman 'Sam' Seaborn who is a character on The West Wing. The creator of the seaborn package has also written other packages such as `moss` (Donna Moss) and `lyman` (Josh Lyman).

In [None]:
metabric = pd.read_csv('https://raw.githubusercontent.com/AstraZeneca-Code-Club/intermediate_python/main/metabric_clinical_and_expression_data.csv')
metabric.head()

To recap from last time - the rows in a `DataFrame` are the **observations** (patients in the case of METABRIC) whereas columns are the observed **variables**.

### Using matplotlib to visualise our data

Let's first plot a simple bar plot showing the distribution of tumour stage in the patients.

In [None]:
labels, counts = np.unique(metabric['Tumour_stage'].dropna(), return_counts = True)
plt.bar(labels, counts, align='center')
plt.show()

This plot gives us the basic information if we are looking at it as part of this whole notebook. However, you will often be generating plots to include in presentations or posters, and will therefore need to add more information to the plot.

If you are not just using a set figure such as `plt.hist` and instead want to be able to personalise the figure, it is better to set up the figure as below:

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111) #row-col-num
ax.bar(labels, counts, align='center', color=['black', 'red', 'cyan', 'blue', 'green'])
ax.set(title = 'Distribution of cancer stage in metabric dataset',
       ylabel = 'No. of patients',
       xlabel = 'Cancer stage')
plt.show()

Now that we've adapted the plot to our personal requirements, we would like to export the image so that we can use it for example in a presentation. We do this by just adding one more line to our existing code.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111) 
ax.bar(labels, counts, align='center', color=['black', 'red', 'cyan', 'blue', 'green'])
ax.set(title = 'Distribution of cancer stage in metabric dataset',
       ylabel = 'No. of patients',
       xlabel = 'Cancer stage')
plt.savefig('cancer_stage.png')
plt.show()

#### Exercise 1

- Generate a scatter plot with `FOXA1` on the x-axis and `MLPH` on the y-axis
- Add figure and axes titles
- Change the points to be red '+' signs, instead of the standard blue dots
- Save the figure as a .png file

Extension: This figure may be too small for people to read properly. Change the size of the figure to be 8cm across by 6cm down.

### Doing multiple plots at once:

`matplotlib` allows us to create subplots within a figure and to display different things in each of these subplots. We will use this to generate some more types of plots in `matplotlib`.

In [None]:
fig, axs = plt.subplots(2, 2)
axs[0, 0].scatter(metabric['FOXA1'], metabric['MLPH'], marker='+', color='red')
axs[0, 0].set_title('FOXA1 vs MLPH')
axs[0, 1].hist(metabric['Age_at_diagnosis'])
axs[0, 1].set_title('Histogram of age at diagnosis')
axs[1, 0].matshow(metabric.corr())
axs[1, 1].plot(metabric['Survival_time'], metabric['PIK3CA'], 'tab:red', color='green')
axs[1, 1].set_title('Line plot survival vs PIK3CA')
fig.tight_layout() # without this the x-ticks would overlap with the bottom row titles

#### Drawbacks of matplotlib:

- basic data manipulation
- can't sort based on a column value
- not particularly well suited to dealing with DataFrames
- not that easy to adapt aesthetics past colours

Let's see how `seaborn` can be used as a wrapper around matplotlib to extend the functionality and create better figures.

First of all we will recreate a figure from `matplotlib` and see how we can format it differently using `seaborn`

In [None]:
# Here's our original figure using matplotlib

fig = plt.figure()
ax = fig.add_subplot(111) 
ax.scatter(metabric['FOXA1'], metabric['MLPH'], marker='+', color='red')
ax.set(ylabel = 'MLPH value',
       xlabel = 'FOXA1 value')
plt.savefig('test_figure.png')
plt.show()

In [None]:
sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', color='red', marker='+')

In [None]:
# We can easily change styles using seaborn:

figure = plt.figure(figsize=(6, 4))
gs = figure.add_gridspec(2, 2)

with sns.axes_style("darkgrid"):
    ax = figure.add_subplot(gs[0, 0])
    sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', color='red', marker='+')

with sns.axes_style("white"):
    ax = figure.add_subplot(gs[0, 1])
    sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', color='red', marker='+')

with sns.axes_style("ticks"):
    ax = figure.add_subplot(gs[1, 0])
    sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', color='red', marker='+')

with sns.axes_style("whitegrid"):
    ax = figure.add_subplot(gs[1, 1])
    sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', color='red', marker='+')

figure.tight_layout()

It looks like the patients can be split into two groups. `seaborn` allows us to easily colour points according to the value of a third column in the DataFrame.

In [None]:
sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', hue='ER_status', marker='+')

# ER-positive: Breast cancers that have estrogen receptors are called ER-positive (or ER+) cancers. 
# PR-positive: Breast cancers with progesterone receptors are called PR-positive (or PR+) cancers.

# For our dataset you get the same plot for colouring according to ER-positive and PR-positive

In [None]:
sns.scatterplot(data=metabric, x='FOXA1', y='MLPH', hue='GATA3', marker='+')

In [None]:
sns.violinplot(data=metabric, x='ER_status', y='GATA3')

#### Exercise 2:

- produce a violin plot which measures the value of GATA3 across each cohort (you should end up with 5 'violins')
- still in one plot, split each cohort into two separate 'violins' depending on ER_status, by using the ER_status as the variable to colour the plot by (you should end up with 10 'violins' on one plot)

Extension: explore what the 'split' argument does to your violin plots, and try out doing violin plots with different colour schemes using the 'palette' argument

In [None]:
# violin plot Cohort vs GATA3

In [None]:
# violin plot Cohort vs GATA3 coloured by ER_status

In [None]:
# violin plot Cohort vs GATA3 coloured by ER_status, with split and PuRd palette

Provided we know which columns we are interested in and can narrow it down to a relatively small number of columns, the `pairplot` function can be useful.

In [None]:
small_metabric = metabric[['Survival_time', 'Tumour_size', 'ER_status', 'ESR1', 'PGR', 'MLPH']]
small_metabric.head()

In [None]:
sns.pairplot(small_metabric, hue='ER_status')

### Homework exercises:

Generate the following plots using seaborn's inbuilt titanic dataset. Once you have imported seaborn as sns, you can load the data by running the line:
`titanic = sns.load_dataset("titanic")`

1. Look through the seaborn documentation to find a plot which will show both a scatter plot of the two chosen variables, as well as a histogram of each variable along the top and right hand side of the plot. Use this function to plot 'fare' against 'age'.
2. Use both a boxplot and a swarmplot to investigate the distribution of 'pclass'. Spend some time exploring what this plot looks like using different seaborn styles and colours.
3. Generate a heatmap showing the correlations between each pair of variables in the dataset. Save this image as a .png file.