# Python code for differential gene expression by using DESeq2/limma/edgeR

I already make a vertual environment "diffexpr" on HPC, where all Python and R packages already installed. Details are available in next part of notebook. <br>
This code will work only on Linux system, as Python Bioconda based Bioconductor packages only compatible for Linux systems.<br>


# Steps for setting rpy2 in Python

I already make a vertual environment "diffexpr" (hpcf_interactive then created environment), activate "diffexpr" environment and follow these steps-
For this analysis I have to run thes commands-<br>
Step 1) First I have to go to working directory-
cd /home/nmishra/diffexpr-master/example<br>
Step 2) Then open jupyter notebook by using these commands-<br>
&emsp;&emsp; ip addr show | grep 220<br>
&emsp;&emsp; jupyter lab --ip 10.220.19.184 --port 9865<br>
Step 3) Run these line of scripts.

This blog https://towardsdatascience.com/deseq2-and-edger-should-no-longer-be-the-default-choice-for-large-sample-differential-gene-8fdf008deae9 [based on https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4] suggests Wilcoxon rank-sum test perform better than DESeq2 and edgeR for DEG analysis. They suggested FDR rate is better for Wilcoxon rank-sum test compare to DESeq2/edgeR.

In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd 
import numpy as np

In [3]:
df = pd.read_table('../test/data/ercc.tsv')
df.head()

Unnamed: 0,id,A_1,A_2,A_3,B_1,B_2,B_3
0,ERCC-00002,111461,106261,107547,333944,199252,186947
1,ERCC-00003,6735,5387,5265,13937,8584,8596
2,ERCC-00004,17673,13983,15462,5065,3222,3353
3,ERCC-00009,4669,4431,4211,6939,4155,3647
4,ERCC-00012,0,2,0,0,0,0


In [4]:
sample_df = pd.DataFrame({'samplename': df.columns}) \
        .query('samplename != "id"')\
        .assign(sample = lambda d: d.samplename.str.extract('([AB])_', expand=False)) \
        .assign(replicate = lambda d: d.samplename.str.extract('_([123])', expand=False)) 
sample_df.index = sample_df.samplename
sample_df

Unnamed: 0_level_0,samplename,sample,replicate
samplename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A_1,A_1,A,1
A_2,A_2,A,2
A_3,A_3,A,3
B_1,B_1,B,1
B_2,B_2,B,2
B_3,B_3,B,3


In [5]:
from diffexpr.py_deseq import py_DESeq2

dds = py_DESeq2(count_matrix = df,
               design_matrix = sample_df,
               design_formula = '~ replicate + sample',
               gene_column = 'id') # <- telling DESeq2 this should be the gene ID column
    
dds.run_deseq() 
dds.get_deseq_result(contrast = ['sample','B','A'])
res = dds.deseq_result 
res.head()







INFO:DESeq2:Using contrast: ['sample', 'B', 'A']


Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,id
ERCC-00002,167917.342729,0.808857,0.047606,16.990537,9.650176e-65,1.1028769999999999e-63,ERCC-00002
ERCC-00003,7902.634073,0.521731,0.058878,8.861252,7.912103999999999e-19,4.868987e-18,ERCC-00003
ERCC-00004,10567.048228,-2.330122,0.055754,-41.792764,0.0,0.0,ERCC-00004
ERCC-00009,4672.573043,-0.19566,0.0616,-3.176286,0.001491736,0.003616329,ERCC-00009
ERCC-00012,0.384257,-1.565491,4.047562,-0.386774,0.6989237,,ERCC-00012


In [6]:
dds.comparison # show coefficients for GLM

['Intercept', 'replicate_2_vs_1', 'replicate_3_vs_1', 'sample_B_vs_A']

In [7]:
# from the last cell, we see the arrangement of coefficients, 
# so that we can now use "coef" for lfcShrink
# the comparison we want to focus on is 'sample_B_vs_A', so coef = 4 will be used
lfc_res = dds.lfcShrink(coef=4, method='apeglm')
lfc_res.head()

    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895



Unnamed: 0,id,baseMean,log2FoldChange,lfcSE,pvalue,padj
0,ERCC-00002,167917.342729,0.807316,0.047609,9.650176e-65,1.1028769999999999e-63
1,ERCC-00003,7902.634073,0.519944,0.058823,7.912103999999999e-19,4.868987e-18
2,ERCC-00004,10567.048228,-2.328037,0.055783,0.0,0.0
3,ERCC-00009,4672.573043,-0.194594,0.061466,0.001491736,0.003616329
4,ERCC-00012,0.384257,-0.052326,0.820696,0.6989237,
