# Normalize and Visualize Count Tables

We want to manually normalize our count table with TPM and RPKM normalization.\
Since we must not compare samples of different conditions with those methods we will use only our TNF samples for the analysis.\
\
Afterward we will create a plot to examine our normalized count data and find out what we can use them for.


In [7]:
# import neccessary python packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Task1: Import count table
You find the TNF count table in `/vol/lehre/msmoabs2024/06_RNAseq/counts_TNF.tsv`. \
Import the file as a pandas DataFrame. Find out via Google how you can skip the first row of our file (the row starting with `#`). 

## Task2: Rename the sample column names
Our samples are currently named e.g. `mapping/RNAseq_shScr_TNF1_srt.bam`. This is a rather long string and we want to change that. \
Find out how to rename columns of a pandas DataFrame and think of a more convenient name for our samples.

## Task3: Extract count value columns
For both TPM and RKPM we only normalize the count data of our samples. \
Extract those columns and save them in a seperate pandas DataFrame. \
The result should be a DataFrame with three columns (TNF1-3).

## Task4: Extract `Length` column
For both TPM and RKPM we need the information about the gene length of all the genes. \
Extract the column named `Length` and save it in a seperate pandas DataFrame.

# Normalization
We can begin with the normalization of our samples.\
For this purpose we will use the DataFrame of our samples (Task3) as well as the DataFrame containing the informaiton about the gene lengths (Task4).

## Task5: TPM normalization

To execute the TPM normalization, use the following formula: \
<span style="color:red">**10^6 * (count value / gene length) / sum(count value / gene length)**</span> \
\
Save the intermediate results in their own DataFrames if neccessary. \
At the end, combine the TPM normalized count values with the first six columns of our original count data (Columns `Geneid` up to `Length`).\
The result should be a DataFrame looking like the `counts_TNF.tsv`, only now with normalized counts.

## Task6: RPKM normalization

To execute the RPKM normalization, use the following formula: \
<span style="color:red">**10^9 * count value / (sum(count value) * gene length) = 10^9 * count value / sum(count value) / gene length**</span> \
\
Save the intermediate results in their own DataFrames if neccessary. \
At the end, combine the RPKM normalized count values with the first six columns of our original count data (Columns `Geneid` up to `Length`).\
The result should be a DataFrame looking like the `counts_TNF.tsv`, only now with normalized counts.



## Task7: Export normalized count tables as a tsv file
Normally, we want to keep using our normalized count tables for further analyses. \
Therefore, save both your normalized DataFrames as a tsv file.

## Task8: Combine both dataframes for visualization purposes
To properly visualize (and compare) both normalization methods, we need to combine both DataFrames. \
<span style="color:red">Note, that you have to rename the columns again (e.g. `TNF1` needs to be renamed to `TPM_TNF1` and `RPKM_TNF1`).</span> \
As a result you should have a DataFrame with six columns (`TPM_TNF1-3` and `RPKM_TNF1-3`).

## Task9: Inspect TPM vs. RPKM
Calculate the sum of all columns with the following function: \
`dataframe.apply(sum)` \
Instead of `dataframe` insert the name of your combined DataFrame from Task8. \
Explain what you see! 

# Visualization
We will now visualize our normalized data to see the difference between TPM and RPKM - and why we should prefer one over the other.\
Therefore, we will create six histograms with seaborn of our normalized samples.

<span style="color:red">**Note that you need to insert the name of your DataFrame in every seaborn call after the `data` parameter before execution!**</span> \
\
Look at the last three bars of each plot. Use your observations to explain why RPKM should not be used to compare genes of different samples. 

In [None]:
fig, axes = plt.subplots(6,1, sharex=True, sharey=True, figsize=(20,15))
sns.histplot(ax=axes[0], data=tnf_norm['TPM_TNF1'], binwidth=1000)
sns.histplot(ax=axes[1], data=tnf_norm['TPM_TNF2'], binwidth=1000)
sns.histplot(ax=axes[2], data=tnf_norm['TPM_TNF3'], binwidth=1000)
sns.histplot(ax=axes[3], data=tnf_norm['RPKM_TNF1'], binwidth=1000)
sns.histplot(ax=axes[4], data=tnf_norm['RPKM_TNF2'], binwidth=1000)
sns.histplot(ax=axes[5], data=tnf_norm['RPKM_TNF3'], binwidth=1000)
plt.yscale('log')

## DESeq2 and GenExVis 

Since we must not use RPKM or TPM to find differentially expressed genes between two (or more) conditions, we will use DESeq2 for the downstream analysis of our samples.

For the DESeq2 analysis install a new conda environment like this:\
`conda deactivate` \
`mamba env create -f /vol/lehre/msmoabs2024/06_RNAseq/downstream_analysis/dge_analysis.yaml`\
\
Then activate it and use the provided R script:\
`R --vanilla --file=/vol/lehre/msmoabs2024/06_RNAseq/downstream_analysis/deseq2.R --args --count-table counts.tsv --conditions /vol/lehre/msmoabs2024/06_RNAseq/downstream_analysis/conditions_tnf.tsv --featcounts-log counts.tsv.summary --output ./`\
\
<span style="color:red">**Note, that you may need to specify the directory where your count table is located. Also change the directory of the `--output` parameter if neccessary.**</span>\
It might also happen, that the R script aborts with an error that the colData are not in the same order as the conditions. Then you need to copy the `conditions_tnf.tsv` file and change the order of the Vehicle and TNF samples.\
\
Take a look at the results - can you explain them? \
\
Next, we will use the tool `GenExVis` together, which takes DESeq2 normalized tables and creates various visualizations. Let's see if we get similar results to the ones in the paper of Schmidt et al.