# Matplotlib

The major plotting library is matplotlib. This is an essential library for making graphs and figures using python. If you can master this library, then you are well on your way to making publication quality figures.

In [None]:
# Let's use some more python magic to plot
# figures inline with the notebook
%matplotlib inline

# You will mostly use the pyplot library
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from matplotlib.ticker import MaxNLocator
from collections import namedtuple

There are many ways to get started with plotting, but I think using the subplots function is the easiest and most flexible way to get started.

In [None]:
fig, ax = plt.subplots(1, figsize=(5, 5))

In [None]:
# Next, let's plot a line

data = range(0, 10)

fig, ax = plt.subplots(1, figsize=(5, 5))

# The plot function will plot a line
ax.plot(data, data)

In [None]:
# The scatter function will plot a scatter plot

fig, ax = plt.subplots(1, figsize=(5, 5))

ax.scatter(data, data)

## Exercises:

1. Now write a function that transforms a list of data and plots it using the plot function and the scatter function.

2. Write a second function that does a different transformation. Plot both functions in the same cell. 

Be creative with the transformation. You can use the numpy functions like np.exp, np.sin, np.cos, np.pow.

In [None]:
# We will be using the Treehouse clinical data as well:

clinical = pd.read_csv('../data/TreehousePEDv5_clinical_metadata.2018-05-09.tsv',
                       sep='\t',
                       index_col=0)

Next we are going to use commonly used matplotlib functions. The first one is to plot a histogram. A histogram is a valuable visualization for looking at the distribution of your data.

In [None]:
# https://matplotlib.org/gallery/statistics/hist.html

N_points = 100000
n_bins = 20

# Generate a normal distribution, center at x=0 and y=5
x = np.random.randn(N_points)

fig, axs = plt.subplots()

_ = axs.hist(x, 
             bins=n_bins)

## Exercise:
1. Make a histogram of the ages in the Treehouse compendium
2. Is this what you expected for a pediatric gene expression comendium?

The next function is a barplot. A barplot is valuable for displaying categorical data. You will likely use a barplot to compare the means and standard deviations across groups in your analysis.

In [None]:
# https://matplotlib.org/gallery/statistics/barchart_demo.html

n_groups = 5

men = (20, 35, 30, 35, 27)
women = (25, 32, 34, 20, 25)

fig, ax = plt.subplots()

index = np.arange(n_groups)
bar_width = 0.35

opacity = 0.4

rects1 = ax.bar(index, 
                men, 
                bar_width,
                alpha=opacity, 
                color='b',
                label='Men')

rects2 = ax.bar(index + bar_width, 
                women, 
                bar_width,
                alpha=opacity, 
                color='r',
                label='Women')

ax.set_xlabel('Group')
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(('A', 'B', 'C', 'D', 'E'))
ax.legend()

fig.tight_layout()
plt.show()

## Exercises:

1. Make a barplot for the number of men, women, and unknown samples in the top 5 most abundant diseases in the compendium.