# Differential expression analysis

This notebook performs Differential Expression analysis using experiment, PRJNA283002, associated with [this publication](https://pubmed.ncbi.nlm.nih.gov/26078448/). Here they characterized the Anr regulon by comparing WT vs anr mutants.

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2

import os
import pandas as pd
from rpy2.robjects import pandas2ri
from core_acc_modules import paths_corr, utils, DE_helper

pandas2ri.activate()

In [2]:
# Load gene expression data
expression_df = pd.read_csv(paths_corr.PAO1_GE, sep="\t", index_col=0, header=0)

## Select expression data for experiment and replace gene ids

In [3]:
# Select expression data associated with PRJNA283002 experiment
sample_metadata = pd.read_csv(paths_corr.DE_METADATA, sep="\t", index_col=0, header=0)
select_sample_ids = list(sample_metadata.index)

select_expression_df = expression_df.loc[select_sample_ids]

In [4]:
# Replace gene sequencing ids with PAO1 ids to help us interpret our findings
pao1_fasta_file = paths_corr.PAO1_REF

seq_id_to_gene_id_pao1 = utils.dict_gene_num_to_ids(pao1_fasta_file)

select_expression_df.rename(mapper=seq_id_to_gene_id_pao1, axis="columns", inplace=True)

select_expression_df.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5561,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570
SRR2007328,10.865677,11.764342,20.73619,8.85677,1.317186,6.548511,0.733002,1.60123,12.231381,3.319641,...,68.104505,12.492543,6.635824,5.080057,5.238808,0.465672,0.393262,7.182237,24.118288,1257.906472
SRR2007326,15.08123,10.066776,16.348646,6.274354,0.683103,4.513992,1.537332,1.762365,11.736246,2.515137,...,52.300043,12.500286,11.456201,3.962125,4.771902,0.996191,0.92141,5.332436,11.84152,557.591296
SRR2007327,15.271177,8.850664,17.336518,7.216923,0.95037,6.352296,1.340327,1.680644,9.814707,4.665593,...,53.233289,11.793949,12.78159,5.879812,4.678865,0.92397,0.817454,5.979728,15.167061,689.556724
SRR2007329,16.801237,14.818068,27.404638,15.091085,1.475981,7.66818,0.903507,2.960542,17.349106,3.708218,...,65.748217,14.405002,11.910343,9.196924,7.68291,0.0,1.627338,8.748755,23.392899,1452.729889


In [5]:
# Save selected expression data
select_expression_df.to_csv(paths_corr.SELECT_GE, sep="\t")

## DE analysis

In [6]:
# Process data for DESeq
DE_helper.process_samples_for_DESeq(
    paths_corr.SELECT_GE,
    paths_corr.DE_METADATA,
    paths_corr.SELECT_GE_PROCESSED,
)

sample ids are ordered correctly


In [7]:
# Create subdirectory: "<local_dir>/DE_stats/"
os.makedirs(paths_corr.DE_STATS_DIR, exist_ok=True)

In [8]:
# Convert python path objects for use by R in the next cell
metadata_filename = str(paths_corr.DE_METADATA)
processed_expression_filename = str(paths_corr.SELECT_GE_PROCESSED)
repo_dir = str(paths_corr.PROJECT_DIR)
out_filename = str(paths_corr.DE_STATS_OUTPUT)

In [9]:
%%R -i metadata_filename -i processed_expression_filename -i out_filename -i repo_dir

source(paste0(repo_dir, '/core_acc_modules/DE_analysis.R'))

# File created: "<local_dir>/DE_stats/DE_stats_template_data_<project_id>_real.txt"
get_DE_stats_DESeq(
    metadata_filename,
    processed_expression_filename,
    out_filename
)





Attaching package: ‘BiocGenerics’



    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB



    IQR, mad, sd, var, xtabs



    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’



    expand.grid








    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.




Attaching package: ‘matrixStats’



    anyMissing, rowMedians






Error: package or namespace load failed for ‘DESeq2’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/alexandra/anaconda3/envs/core_acc_env/lib/R/library/stringi/libs/stringi.so':
  libicui18n.so.64: cannot open shared object file: No such file or directory


Attaching package: ‘DelayedArray’



    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges



    aperm, apply, rowsum


 unable to load shared object '/home/alexandra/anaconda3/envs/core_acc_env/lib/R/library/stringi/libs/stringi.so':
  libicui18n.so.64: cannot open shared object file: No such file or directory





## Compare results with publication

In [10]:
# Get top DEGs
# Compare against publication
DE_stats = pd.read_csv(paths_corr.DE_STATS_OUTPUT, sep="\t", header=0, index_col=0)

selected_DE_stats = DE_stats[(abs(DE_stats["log2FoldChange"]) > 1)]
print(selected_DE_stats.shape)
selected_DE_stats

(1268, 6)


Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
PA0005,0.483176,2.318589,4.892035,0.473952,6.355343e-01,9.993777e-01
PA0007,0.524889,-2.588519,4.892051,-0.529128,5.967169e-01,9.993777e-01
PA0014,6.289083,-1.545647,1.814872,-0.851656,3.944050e-01,9.993777e-01
PA0024,10.744584,-2.536829,1.157392,-2.191849,2.839044e-02,6.145254e-01
PA0025,1.245413,-2.210361,3.142564,-0.703362,4.818300e-01,9.993777e-01
PA0027,1.223025,1.882183,3.185265,0.590903,5.545855e-01,9.993777e-01
PA0028,3.178119,1.016346,1.812849,0.560635,5.750465e-01,9.993777e-01
PA0034,0.195636,1.221175,4.983462,0.245045,8.064212e-01,9.993777e-01
PA0037,0.268216,-1.611060,4.993369,-0.322640,7.469679e-01,9.993777e-01
PA0038,1564.242981,-1.529307,0.218655,-6.994140,2.668918e-12,9.098786e-10


In [11]:
published_DEGs = [
    "PA1557",
    "PA3928",
    "PA2119",
    "PA3847",
    "PA0515",
    "PA0513",
    "PA0512",
    "PA0510",
    "PA0521",
    "PA0522",
    "PA0525",
    "PA0526",
    "PA2126",
    "PA2127",
    "PA2133",
]

In [12]:
selected_DE_stats.loc[published_DEGs]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
PA1557,2.904203,-5.056069,2.761481,-1.830927,0.06711147,0.999378
PA3928,30.206238,1.247882,0.676836,1.8437,0.06522685,0.999378
PA2119,1.842882,-4.400123,3.228837,-1.362758,0.1729588,0.999378
PA3847,,,,,,
PA0515,7.367766,-4.857418,1.917313,-2.533451,0.01129456,0.292443
PA0513,7.318273,-3.853654,1.593231,-2.418766,0.01557326,0.379066
PA0512,5.528647,-5.985148,2.233165,-2.680119,0.007359591,0.217094
PA0510,1.563122,-4.16362,3.452414,-1.206003,0.2278165,0.999378
PA0521,10.20647,-6.870218,1.905468,-3.605528,0.0003115189,0.018742
PA0522,42.19435,-5.944762,1.059005,-5.613536,1.982336e-08,3e-06


In [13]:
input_DESeq_data = pd.read_csv(
    processed_expression_filename, sep="\t", index_col=0, header=0
)

input_DESeq_data[published_DEGs]

Unnamed: 0,PA1557,PA3928,PA2119,PA3847,PA0515,PA0513,PA0512,PA0510,PA0521,PA0522,PA0525,PA0526,PA2126,PA2127,PA2133
SRR2007328,0,42,0,2,1,1,0,0,0,0,0,0,0,0,0
SRR2007326,7,18,4,3,13,14,12,2,17,69,6,378,0,3,0
SRR2007327,4,16,3,3,14,12,9,4,22,90,7,355,0,2,0
SRR2007329,0,47,0,4,0,1,0,0,0,3,0,37,0,0,0


The differential expression results can be found in [Figure 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4524035/) of the paper. Spot checking it looks like the genes have consistent direction of logFC.

Note:
* NaN's occur if the samples are all 0. Need to check why PA3847 is NaN. Setting filtering to False doesn't get rid of NaNs

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na