# Rsult analysis for differential gene expression for samples from BeatAML study*

This is a continuation of differential gene expression (DGE) performed for samples of the study by Tyner et al. (2018).

This is a very basic functional analysis performed for genes differentially expressed between the compared groups (samples from patients with or without a mutation in NPM1 gene). Besides DGE results from the previous step, a dataset linking gene IDs to GO annotations is required.

The data on GO annotaions is already saved in the input directory as _`mart_export_go.tsv.gz`_. It can be redownloaded in the following way:
- go to https://www.ensembl.org/biomart/martview
- as `-CHOOSE DATABASE-` select `Ensembl Genes 109`
- as `-CHOOSE DATASET-` select `Human Genes (GRCh38.p13)`
- go to `Attributes` and in the section `GENES:` select:
    - Gene stable ID
- and in the section `EXTERNALS:` (subsection _Go_) select:
    - GO term accession
    - GO term name**
    - GO term evidence code
    - GO domain
- unselect other then listed above attributes in all sections if such were preselected
- click the `Results` button
- in `Export all results to` choose `Compressed file (.gz)`, `TSV`
- check the option `Unique results only`
- click the `Go` button
- save the file in a location of your choosing

The logic behind the analysis is concisely explained step by step.

---

<font size="1">
*Tyner JW, Tognon CE, Bottomly D, Wilmot B, Kurtz SE, Savage SL, Long N, Schultz AR, Traer E, Abel M, Agarwal A. Functional genomic landscape of acute myeloid leukaemia. Nature. 2018 Oct 25;562(7728):526-31.
    
**This is actually the only column that will be utilised, however, it is good to have others at hand (just in case).
</font>

---

- Import Pandas library and load DGE results to `res_df` DataFrame. Look up the whole DataFrame.

In [1]:
import pandas as pd

In [2]:
res_df = pd.read_csv('output/DGE_results.tsv', sep='\t')
res_df

Unnamed: 0,gene_id,baseMean,log2FoldChange,lfcSE,pvalue,padj
0,ENSG00000000003,21.614238,0.159750,0.211504,0.364023,0.518267
1,ENSG00000000005,0.117638,-0.107868,0.373021,0.724421,
2,ENSG00000000419,1122.078004,-0.100693,0.043725,0.020318,0.057731
3,ENSG00000000457,685.212410,0.181222,0.061381,0.002753,0.011163
4,ENSG00000000460,390.516247,-0.183616,0.077746,0.015718,0.047020
...,...,...,...,...,...,...
63672,ENSG00000273480,,,,,
63673,ENSG00000273482,,,,,
63674,ENSG00000273484,,,,,
63675,ENSG00000273490,,,,,


- Filter `res_df` accroding to `log2FoldChange` of at least `1.0` (equivalent of fold change `2.0`, both in plus and in minus directions) and `FDR` as the one indicated during performing DGE analysis (`0.01`). Keep the results in `fin_df` DataFrame. Look up the whole resulting DataFrame.

In [3]:
fin_df = res_df[
    (res_df['log2FoldChange'].abs() >= 1.0 ) & (res_df['padj'] <= 0.01)
].sort_values('log2FoldChange', ascending=False).reset_index(drop=True)
fin_df

Unnamed: 0,gene_id,baseMean,log2FoldChange,lfcSE,pvalue,padj
0,ENSG00000224842,2.828884,5.220453,0.476137,2.086320e-28,6.814681e-26
1,ENSG00000107807,1.084612,4.639766,0.503022,6.618996e-06,6.072014e-05
2,ENSG00000133110,7.404256,4.627742,0.460003,5.028314e-10,1.164092e-08
3,ENSG00000018236,118.647381,4.566301,0.382213,2.329092e-33,1.321331e-30
4,ENSG00000136944,32.970138,4.407800,0.407030,1.858105e-44,3.338086e-41
...,...,...,...,...,...,...
2722,ENSG00000213931,78.940749,-6.978391,0.388769,6.072773e-11,1.716567e-09
2723,ENSG00000266976,1.882522,-7.004681,0.648726,3.675058e-07,4.624664e-06
2724,ENSG00000206579,1.685687,-7.073905,1.190329,2.160649e-05,1.729039e-04
2725,ENSG00000147255,13.753163,-7.236593,0.699507,7.802278e-21,1.042571e-18


- Load to `meta_df` DataFrame the dataset obtained from Ensembl BioMart that links gene IDs with GO annotations. Look up the whole DataFrame.

In [4]:
meta_df = pd.read_csv('input/mart_export_go.tsv.gz', sep='\t')
meta_df

Unnamed: 0,Gene stable ID,GO term accession,GO term name,GO term evidence code,GO domain
0,ENSG00000210049,GO:0030533,triplet codon-amino acid adaptor activity,IEA,molecular_function
1,ENSG00000210049,GO:0006412,translation,IEA,biological_process
2,ENSG00000211459,GO:0003735,structural constituent of ribosome,IEA,molecular_function
3,ENSG00000211459,GO:0005840,ribosome,IEA,cellular_component
4,ENSG00000210077,,,,
...,...,...,...,...,...
601467,ENSG00000122432,GO:0001669,acrosomal vesicle,IDA,cellular_component
601468,ENSG00000122432,GO:0031410,cytoplasmic vesicle,IDA,cellular_component
601469,ENSG00000284882,,,,
601470,ENSG00000289881,,,,


- Merge both `fin_df` with `meta_df` on gene_id/Gene stable ID columns using only gene IDs from `fin_df` in order to obtain `go_df` that links gene IDs for differientiating transcripts with GO annotations. Look up the whole resulting DataFrame.

In [5]:
go_df = fin_df[['gene_id']].merge(meta_df, how='left', left_on='gene_id', right_on='Gene stable ID')
go_df = go_df[ go_df['GO term name'].notna() ]
go_df

Unnamed: 0,gene_id,Gene stable ID,GO term accession,GO term name,GO term evidence code,GO domain
1,ENSG00000107807,ENSG00000107807,GO:0003677,DNA binding,IEA,molecular_function
2,ENSG00000107807,ENSG00000107807,GO:0006355,regulation of DNA-templated transcription,IEA,biological_process
3,ENSG00000107807,ENSG00000107807,GO:0003700,DNA-binding transcription factor activity,IEA,molecular_function
4,ENSG00000107807,ENSG00000107807,GO:0000981,"DNA-binding transcription factor activity, RNA...",IEA,molecular_function
5,ENSG00000107807,ENSG00000107807,GO:0005634,nucleus,IEA,cellular_component
...,...,...,...,...,...,...
50893,ENSG00000079102,ENSG00000079102,GO:0045892,negative regulation of DNA-templated transcrip...,IDA,biological_process
50894,ENSG00000079102,ENSG00000079102,GO:0045599,negative regulation of fat cell differentiation,ISS,biological_process
50895,ENSG00000079102,ENSG00000079102,GO:0003714,transcription corepressor activity,TAS,molecular_function
50896,ENSG00000079102,ENSG00000079102,GO:0003714,transcription corepressor activity,IDA,molecular_function


- Perform a very basic analysis by calculating % of differientiating genes falling into GO terms. Since one gene may be described with more than one GO term, the resulting values will obviously not sum to 100%. Display GO terms that at lest 5% genes fell into.

In [6]:
counts_s = (go_df.value_counts('GO term name')/fin_df.shape[0]*100).round(2)
counts_s = counts_s[counts_s >= 5.0]
counts_s.name = '% of genes'
counts_df = counts_s.to_frame().reset_index()
counts_df

Unnamed: 0,GO term name,% of genes
0,plasma membrane,62.56
1,protein binding,50.79
2,membrane,42.46
3,cytoplasm,29.48
4,extracellular region,28.97
5,nucleus,25.93
6,extracellular space,21.2
7,cytosol,17.68
8,metal ion binding,12.1
9,signal transduction,10.63
