# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Transcriptomics" data-toc-modified-id="Transcriptomics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Transcriptomics</a></div><div class="lev2 toc-item"><a href="#Log-Transformations-and-DEG-analysis" data-toc-modified-id="Log-Transformations-and-DEG-analysis-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Log Transformations and DEG analysis</a></div><div class="lev2 toc-item"><a href="#Prepping-the-data" data-toc-modified-id="Prepping-the-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Prepping the data</a></div><div class="lev2 toc-item"><a href="#Explore-the-data" data-toc-modified-id="Explore-the-data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Explore the data</a></div><div class="lev2 toc-item"><a href="#DEG-analysis" data-toc-modified-id="DEG-analysis-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>DEG analysis</a></div>

# Transcriptomics
## Log Transformations and DEG analysis

By: Caroline Labelle
<br>For: BIM6065-C

<br>
Date: July  9th 2023

<hr style="border:1px solid black"> </hr>


In [None]:
Name:

## Prepping the data

To do the DEG analysis, we will be using a R package called Limma Voom. We first need to prep our data so that we have a single file to upload in R.

You were initially handed 6 unstranded RNA-seq samples of MCF7 cells (breast cancer): three of the samples were threated with estradiol (E2). You used STAR to align the reads and do the gene quantification. You now have 6 files with the suffix <code>ReadsPerGene.out.tab</code>.

SRR1012918 -> **treatment**<br>
SRR1012920 -> **treatment**<br>
SRR1012922 -> **treatment**<br>

SRR1012936 -> **control**<br>
SRR1012939 -> **control**<br>
SRR1012942 -> **control**<br>

Publication for the data: https://pubmed.ncbi.nlm.nih.gov/24319002/
<br>STAR documentation: https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

In [None]:
### Import pandas
import pandas as pd

In [None]:
### Import one DF
## Note I: the data are not comma sepereated (.csv)
## Note II: there are no headers but the columns' names can 
##          be found in STAR's documentation (Section 7)
## Note III: adding columns' names would facilitated the next steps.

fn = ""
data = pd.read_csv(fn)
data.head()

In [None]:
### From the imported dataset, we only want a subset of data.
### Let's get rid of the summary rows: identify them.
### We are considering  unstranded RNA-seq data: which count column should
### we keed?
### We want to keep the genes' identification column

### The subset dataset should contain two (2) columns: genes ID and genes count.
dataSubset = data
dataSubset.head()

In [None]:
### How many genes are you considering?
print(, " genes")

In [None]:
### For now, we've only considered one (1) sample.
### You were handed six (6)... you now need to import them all and combine
### them :) 

### You can do it how ever you want!
### If you need some guidance, the next cell contains a suggested approach,
### using a for loop and the merge() function from Pandas.




In [None]:
import os

### First, create a list with all the genes counts filenames. 
fn = 

### Second, create a list of sample label. Make sure that the labels are 
### in the same order as the filenames!
sampleLabels = 

### Thirs, define a variable for the number of files, and an dataFrame that 
### will contain all of your samples data.
N = 
geneCount = 

### Fourth, create a for loop that will import each gene count file 
### and merge its data to your global dataset.
for i in range() :
    ### Get sample filename
    fn_tmp = 
    
    ### Import sample data (similar to the single sample imported above
    ### Make sure to change the label for each sample
    data_tmp = pd.read_csv(fn_tmp)
    
    ### Select the relevant information (similar to the single sample imported above
    subset_tmp = 
    
    ### Merge the newly imported sample to the geneCount df
    geneCount = 

In [None]:
geneCount.head()

In [None]:
### You now need to export your geneCount dataFrame to a file so 
### that you can use it to do your DEG.

### You should export it as a tab-seperated file. You can use the to_csv() function.
### You should not export index numbers. You can export the header.
geneCount.to_csv()

### Read data from exported file
geneCount = pd.read_csv()
geneCount.head()

## Gene counts exploration

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(4, 2)})
sns.set_theme(context="notebook", style="white")

In [None]:
### Plot the genes counts ratio for Control1 : Treatment1


In [None]:
### Calculate the average ratio of Control1 : Treatment1
### Is there a problem?


In [None]:
import numpy as np

### Calculate the log2 of the ratio Control1 : Treatment1.
### Plot its distribution and calculate the average.


In [None]:
### What can we tell from the above plot and the calculated average value?

In [None]:
### Which genes are over expressed in the Treatment?


In [None]:
### Which genes are the most over expressed in the Treatment?
### Let's select the top10


In [None]:
### Can we conclude that these genes are the Top10 overexpressed genes in the 
### the Treatment samples?

# Exercices [30 pts]
Once you've completed the next sections, export your Notebook in HTML and submit it to StudiUM. Make sure that your results and answers to the questions are visible and clear.

## Calculate the log2 FC for each pair of Control-Treatment [5 pts]

In [None]:
import numpy as np

In [None]:
### Let's considere all of our samples.
### We want to pair each Treatment - Control samples and calculate their log2 FC 
### with respect to the Treatment.


## Calculate the average log2 FC of each gene [10 pts]

In [None]:
### We now want to calculate the average log2 FC for each gene across 
### the different pairings.

### Let's create a dataframe


In [None]:
### Let's calculate the average FC for each gene.


In [None]:
### Plot the log2 FC distribution


## Identify the Top10 over expressed genes in Treatment [10 pts]

In [None]:
### What are the Top10 over expressed genes in the Treatment?
### Do we get the same Top10?


## Analyse the methodology [5 pts]

In [None]:
### Do you find the same Top10?

In [None]:
### Put your critical spectacles on!
### Is that the best approach to do a DEG analysis? 
### What could be done differently? Are we missing something?