# Big Data for Biologists: Decoding Genomic Function- Class 8

## How do you visualize similarities and differences of gene expression profiles across cell types? Part II

##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#MetaData>Prepare RNA-Seq samples and metadata for PCA analysis</a></li> 
 <li> <a href=#PCA1>Describe what Principal Component analysis is and how it can be used to analyze and visualize variation in large datasets</a></li>
 <li> <a href=#PCA2>Perform prinicipal component analysis to identify clustering patterns in gene expression data </a></li>
 <li> <a href=#Scatter>Make a scatter plot of the output from principal component analysis</a></li> 

# Prepare RNA-Seq samples and metadata for PCA analysis (review)<a name='MetaData' />

In [None]:
import warnings
warnings.filterwarnings('ignore')

# load the pandas package and define an abbreviation (or alias) 
import pandas as pd   

We will focus our analysis on 4 of the anatomical structures and check for differential gene expression among them.  


In [None]:
metadata_filename='/data/datasets/RNAseq/rnaseq_metadata.txt'
rnaseq_filename='/data/datasets/RNAseq/rnaseq_normalized.tsv'

## BEGIN SOLUTION ##
## END SOLUTION ##

In [None]:
#Get the number of rows and columns in the rnaseq_data table 
## BEGIN SOLUTION ##
## END SOLUTION ## 
print("Number rows: "+str(num_rows))#prints number of rows -- this is the sample
print("Number columns: "+str(num_cols))#prints number of columns -- this is the gene axis

In [None]:
#Transpose the data frame 
#Now, our features (genes) are along the column axis, and sample names are along the row axis. This will make for easier
#downstream analysis. 

## BEGIN SOLUTION ##
## END SOLUTION ## 
print("Number rows: "+str(rnaseq_data_subset_transposed.shape[0]))#prints number of rows -- this is the gene axis 
print("Number columns: "+str(rnaseq_data_subset_transposed.shape[1]))#prints number of columns

In [None]:
#merge the rnaseq_subset dataframe with the metadata frame so we can more easily sub-select the organ systems 
#of interst.

## BEGIN SOLUTION ##
## Hint: Use the "merge" function from pandas 
## END SOLUTION ##
display(merged_df.head())
display(merged_df.shape)

In [None]:
#Define the systems of interest
systems_subset=["Blood","Embryonic","Immune","Respiratory"]

In [None]:
#Pick out the samples (rows) in the merged dataframe that contain the samples from the systems of interest 
## BEGIN SOLUTION ##
## END SOLUTION ##

In [None]:
#Select the rows in the data matrix that contain the samples we wish to analyze (i.e. the samples
#from blood, embryonic, immune, and)

## BEGIN SOLUTION ##
## END SOLUTION ##

In [None]:
display(merged_df_subset.head())
display(merged_df_subset.shape)

In [None]:
#Check row & column numbers in merged_df_subset 
print("Number rows: "+str(merged_df_subset.shape[0]))#prints number of rows -- this is the sample axis
print("Number columns: "+str(merged_df_subset.shape[1]))# prints the number of columns -- this is the gene axis 

In [None]:
#We want to exclude the genes that have expression <= 0 in all 4 organ systems of interest
nonzero_columns=merged_df_subset.iloc[:,0:-3].sum(axis=0)>0
nonzero_columns['System']=True
nonzero_columns['Organ']=True
nonzero_columns['CellType']=True

Note that `merged_df_subset.iloc[:,0:-3].sum(axis=0)>0` returns a value of "True" or "False" at each gene column in the`merged_df_subset` matrix. This is referred to as binary indexing. We also indicate that we want the metadata columns (System, Organ, CellType) to be set to True (i.e. included in the matrix). 

Next, we identify the columns with a value of "True", and select them from `merged_df_subset`. 
This can be done with the command: 

In [None]:
merged_df_subset_nonzero=merged_df_subset.iloc[:,nonzero_columns.tolist()]
print(merged_df_subset_nonzero.shape)

We have extracted RNA-seq expression data for our four organ systems of interest. We have also removed all genes that are not expressed in any of the four organ systems.

## What is principal component analysis (PCA)? <a name='PCA1' />

Principal component analysis (PCA) is a statistical method to understand and visualize variation in large datasets.

In [None]:
from IPython.display import IFrame
IFrame(src="https://docs.google.com/presentation/d/12AlDV7G7aEasvS9HTllyoMNlsBdmCO6XODC7tOl5f70/embed?",
       frameborder="0", width="960", height="749", allowfullscreen="true", mozallowfullscreen="true", webkitallowfullscreen="true")

We will use the [scikit learn](http://scikit-learn.org/stable/) python library to perform principal component analysis. We import scikit learn with the command "import sklearn". This library has a number of built-in tools for performing statistical analysis and machine learning. 

[This tutorial](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) provides a guide to performing PCA analysis with scikit learn.

Thought questions: 

* What do principal components represent?
* Which genes will have the greatest influence on the principal components?
* Which samples / cells will cluster together in the PCA plot?

In [None]:
# Perform principal component analysis on the data to check for clustering patterns
from sklearn.decomposition import PCA as sklearnPCA

#We decompose the data into 10 principal components 
sklearn_pca = sklearnPCA(n_components=10)
#We want to exclude the metadata columns from the PCA transformation -- they have served their purpose in helping 
#us filter the dataset to the organ systems of interest, and now we remove them. 
metadata_subset=merged_df_subset_nonzero[['System','Organ','CellType']]
merged_df_subset_nometadata = merged_df_subset_nonzero.drop(metadata_subset, axis=1)

pca_results = sklearn_pca.fit_transform(merged_df_subset_nometadata)

In [None]:
display(merged_df_subset_nonzero.shape)
display(merged_df_subset_nometadata.shape)
display(metadata_subset.shape)

We visualize the percent of variance explained by each principal component in a graph called a "scree plot"

In [None]:
#We use our plotly helper functions to generate a scree plot from the principal component analysis. 
#Import the plotting helper functions from the helpers directory
from plotnine import * 

In [None]:
print(sklearn_pca.explained_variance_ratio_)

In [None]:
#We use the plotnine plotting library to generate a scree plot of the variance explained by each component
#Now, we create a barplot with just our 4 organ systems of interest 
y=sklearn_pca.explained_variance_ratio_
x=range(1,len(y)+1)
qplot(x=x,
      y=y,
      geom="bar",
      stat="identity",
      xlab="PC",
      ylab="Fraction of variance explained")

This indicates that the first principal component explain 33% of the variance in the data, while the second principal component explains 9% of the variance. 

Scree plots can be visualized as  both bar graphs and line plots. Below, we write the code to visualize the scree plot as a line graph. 


In [None]:
qplot(x=x,
      y=y,
      geom="line",
      xlab="PC",
      ylab="Fraction of variance explained")

In [None]:
pca_results[0:10]

In [None]:
print(pca_results.shape)

In [None]:
print (type(pca_results))

## Make a scatter plot of the output from principal component analysis <a name='Scatter' />

In [None]:
#We make a scatterplot of PC1 vs PC2 
x=pca_results[:,0]
y=pca_results[:,1]
qplot(x=x,
      y=y,
      geom="point",
      xlab="PC1",
      ylab="PC2")

To investigate whether there is any clustering of samples by organ system, we can color-code by the 'System' column from the metadata table.

We use the "scale_color_discrete" function to assign discrete color names (selected from a pre-defined color palette) to each System. 


In [None]:
qplot(x=x,
      y=y,
      geom="point",
      xlab="PC1",
      ylab="PC2",
      color=list(metadata_subset['System']))+scale_color_discrete(name="System")

In [None]:
#Make a scatter plot of principle component 2 (PC2) vs principle component 3 (PC3)
#Make sure to change your axes labels too!

## BEGIN SOLUTION ##
## END SOLUTION ##

In [None]:
#Make a scatter plot of principle component 1 (PC1) vs principle component 3 (PC3)
#Make sure to change your axes labeles too!

## BEGIN SOLUTION ##
## END SOLUTION ##