## Seaborn

Matplotlib is great, but it takes a lot of work to modify the style of the graphs. Seaborn is a python library that builds upon matplotlib to make the figures more effective, but a lot of the grunt work is already factored into the library. Seaborn has many valuable functions. We will discuss distplot, swarmplot, and boxplot. We will use seaborn to explore pediatric gene expression data.

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
# We will be using the Treehouse clinical data as well:

clinical = pd.read_csv('../data/TreehousePEDv5_clinical_metadata.2018-05-09.tsv',
                       sep='\t',
                       index_col=0)

clinical.head()

In [None]:
# We will be focusing on a form of pediatric cancer known as neuroblastoma
samples = clinical[clinical['disease'] == 'neuroblastoma'].index.values

## Distplot

The distplot function plots the distribution of a data set. Be aware that the distplot will by default create a kernel density estimate for your data. If this is not desired, then set the kde keyward to False. 

In [None]:
import numpy as np

fig, ax = plt.subplots(1, figsize=(5, 5))

data = np.random.normal(0, 10, 1000)
sns.distplot(data)

### Exercise 1:
Next, we are going to investigate the expression of the oncogene known as MYCN.

1. Load in the Treehouse expression data using the pandas read_csv function (see below).
2. Subset the expression data using the samples we isolated from the clinical data
3. Plot the expression distribution of the MYCN gene using the distplot function.
   (hint: df["MYCN"] )
4. Determine if this distribution looks like a normal distribution.

In [None]:
exp = pd.read_csv('../data/nbl-expression.tsv',
                  sep='\t', 
                  header=None,
                  index_col=0)

exp.index.name = 'Sample'
exp.columns = ['MYCN']
exp.head()

###  Exercise 2:
For this exercise we will focus on the TARGET samples because the TARGET samples have additional molecular data associated with them.

1. Load in the TARGET samples that are known to be MYCN amplified 
    pth = "../data/MYCN-Amplified"
2. Subset the expression matrix from above to only include the TARGET samples. 
3. Create a plot where the MYCN amplified and MYCN non-amplified samples have different colors.
    - assume that any TARGET sample that is not in the MYCN amplified list is not amplified

Hint:
Create two lists of sample IDs. One with the MYCN-amplified samples and another with all other sample IDs.

## Swarmplot 

The next plot is called a swarmplot. This is another helpful plot for displaying categorical data. One of the benefits of a swarmplot is that it gives your reader an idea of how many samples were collected for each category. The data points are jittered to give an idea of the distribution of the data set.

One of the challenges with the seaborn package is that it is somewhat picky in how you give it the data. I've found that the easiest way to use these functions is to create a dataframe in long form. A long-form dataframe has an independent datum on each line. The goal for this next section of the notebook is to create a dataframe where the columns are sample_id, MYCN_status, MYCN_expression. You will then fill in the dataframe with the appropriate data 
(hint: there are many different ways to do this, but using iterrows may help you create a new dataframe in long form).

The function call for swarmplot is like so:

`sns.swarmplot(x='MYCN_status', y='MYCN_expression', data=df)`

If you get a strange error, try converting the MYCN expression to a numeric type using the pd.to_numeric function.

`df["MYCN"] = pd.to_numeric(df["MYCN"])`

### Exercise 3:
1. Create a long-form dataframe with the sample_id, MYCN_status, and MYCN_expression values
2. Plot the data using the swarmplot function `sns.swarmplot`

## Boxplot & Violinplot

One nice thing about seaborn is that once you have the dataframe in long-form, you can just plug it into other seaborn functions to see how it looks.

Exercise 4:
1. Plot the MYCN data using the boxplot function `sns.boxplot`
2. Plot the MYCN data using the boxplot function `sns.violinplot`

## Multiple plots in one figure

I encouraged you to use the plt.subplots function because it makes it straightforward to make more complex figures. We will make a figure here that includes all of the plots you made. To make a 2 by 2 figure, use:

`fig, axes = plt.subplots(2, 2, figsize=(10, 10))`

Now the axes object contains a matrix of plots. To access each element, use common matrix indexing. For example, the top left panel can be accessed using `axes[0, 0]`.

You can pass the Axes object to seaborn functions using the ax keyword. 

`sns.swarmplot(x="MYCN_status", y="MYCN_expression", data=df, ax=axes[0,0]`)

## Exercise 5:
1. Make a four panel figure where the neuroblastoma MYCN expression distribution is plotted in the top left panel, the swarm plot is in the top right panel, the boxplot is in the bottom left panel, and the `sns.violinplot` is in the bottom right panel.

## Barplot

Seaborn also has a nice barplot implementation. Make the same barplot you used in the matplotlib notebook, but use the `sns.barplot` function.

In [None]:
#help(sns.barplot)