#### Project 1: Differential Expression Analysis (transcriptomics)


The goal for this project is to investigate differences in gene expression between samples of healthy lung tissues and samples of idiopathic pulmonary fibrosis (IPF) lung tissue.
 - ([Data here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150910))
 -  ([Paper here](https://pmc.ncbi.nlm.nih.gov/articles/PMC7667907/))

**DISCLAIMER:**
the purpose of this analysis is to be educational and simplistic, not to follow current best-practices for differential expression analysis. If you're doing this for real as part of a research project, you should probably use an R package like ([DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)),([edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html)) , or ([Limma-Voom](https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/thursday/DE.html)) and follow the standard analysis pipelines.

In [1]:
%%capture
%pip install scipy

In [2]:
import numpy as np
import csv
import random
import scipy
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp
import seaborn as sns

#### Step 1: Reading and processing data

**Note:** This data is not included on this Github repo (because it's not my dataset) but can be downloaded for free on the ([Gene Expression Omnibus website here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150910)). The specific file we need is called GSE150910_gene-level_count_file.csv.

We'll store the gene names in a list called `genes`, and we'll store the expression measurements for each gene in a list of lists called `data`.

In [3]:
genes = []
data = []
with open("./data/GSE150910_gene-level_count_file.csv") as csvfile:
  reader = csv.reader(csvfile, delimiter=',')
  for row in reader:
    genes.append(row[0])
    data.append(row[1:])

The first row was the sample labels, so we'll save that as its own list called `samples`, and then remove it from `data` and `genes`.


In [4]:
samples = data[0]
data = data[1:]
genes = genes[1:]

Next, we'll cast everything to NumPy arrays, since those will be easier to work with than lists. And we also need to cast the elements of `data` to floats, since they were read in as strings by default.

In [5]:
data = np.array(data).astype(float)
genes = np.array(genes)
samples = np.array(samples)

Quick check to make sure the shapes of these arrays make sense...

In [6]:
print(genes.shape)
print(samples.shape)
print(data.shape)


(18838,)
(288,)
(18838, 288)


Quick check to make sure the content of these arrays looks of...

In [7]:
print("genes:")
print(genes[0:5])
print("samples:")
print(samples[0:5])
print("data:")
print(data[0:5])

genes:
['TSPAN6' 'TNMD' 'DPM1' 'SCYL3' 'C1orf112']
samples:
['chp_26' 'chp_31' 'chp_34' 'chp_38' 'chp_1']
data:
[[1361.  993.  351. ...  465.  639.  944.]
 [   5.   13.    0. ...    0.    0.    0.]
 [1929. 2775. 1894. ...  860. 1387. 2150.]
 [ 176.  216.  208. ...   21.  146.  204.]
 [  93.  143.   97. ...   72.   69.  158.]]


Okay, next we want to make an array with the experimental condition labels for each sample. We can do this with some pretty basic string manipulation.

In [8]:
labels = []

for i in range(len(samples)):
  tmp = samples[i].split("_")
  labels.append(tmp[0])
    
labels = np.array(labels)

Quick checks..

In [9]:
print(labels[0:5])
print(np.unique(labels))

['chp' 'chp' 'chp' 'chp' 'chp']
['chp' 'control' 'ipf']


Next, we'll get rid of the chronic hypersensitivity pneumonitis(CHP) samples so that we're left with only the control and idiopathic pulmonary fibrosis(IPF) samples.

In [10]:
data = data[:  ,labels != "chp"]
samples = samples[labels != "chp"]
labels = labels[labels != "chp"]




quick sanity check again...

In [11]:
print(data.shape)
print(samples.shape)
print(labels.shape)

(18838, 206)
(206,)
(206,)


The total number of counts in a sample is called **sequencing depth**. We want to control for the difference in sequencing depth between samples, in case the experiment happened to be more sensitive in some samples and less in others.
To do this, we will apply a counts per million(CPM) normalization. For each sample, we'll divide the counts for each gene by the total number of counts in that sample, then multiply by a million.

In [12]:
for j in range(data.shape[1]):
  columnSum = sum(data[:, j])
  
  data[:,j] = data[:,j] / columnSum * 1000000

In [13]:
print(data)

[[2.62026354e+01 3.98260770e+01 3.24569396e+01 ... 3.31761577e+01
  3.12351859e+01 3.07286955e+01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.82286581e-01
  0.00000000e+00 0.00000000e+00]
 [7.37006988e+01 8.12189622e+01 5.73764856e+01 ... 6.33142058e+01
  6.77984395e+01 6.99859061e+01]
 ...
 [1.23451757e-01 2.91499191e-01 1.92280448e-01 ... 3.03810968e-02
  1.46644065e-01 1.62757921e-01]
 [1.82091341e+00 2.76924232e+00 1.42287531e+00 ... 1.79248471e+00
  1.80861014e+00 2.05074981e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


This is one common way of normalization gene expression data. Some other methods also control for gene length (which we don't have for this dataset) in addition to sequencing depth. 
Next we'll apply a filter to eliminate genes that had very low expression in both conditions.

In [14]:
mean_CPM_Control = data[:,labels=="control"].mean(axis=1)
mean_CPM_ipf = data[:, labels == "ipf"].mean(axis=1)

toKeep = (mean_CPM_ipf >= 5) | (mean_CPM_Control >= 5)

data=data[toKeep, :]
genes = genes[toKeep]

print(data.shape)
print(len(genes))

(11097, 206)
11097


**NOTE:** in a real-life research project, you might want to apply additional quality checks and normalizations at this step -- batch correction, correction based on demographic info about subjects (sex, race, age, health status, etc), identifying outliers and thinking about if you should remove them or not. For the purposes of this class, we'll skip these steps, but in a real project this would be the time to do them.

#### Step 2: Differential expression analysis