# Including covariates for differential gene expression testing using `memento`

To install `memento` in the pre-release version (for Ye Lab members), install it directly from github by running:

```pip install git+https://github.com/yelabucsf/scrna-parameter-estimation.git@release-v0.0.8```

This requires that you have access to the Ye Lab organization. 

In [26]:
# This is only for development purposes

import sys
# sys.path.append('/home/ssm-user/Github/scrna-parameter-estimation/dist/memento-0.0.8-py3.8.egg')
import memento
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [19]:
import scanpy as sc
import memento

In [20]:
fig_path = '~/Github/scrna-parameter-estimation/figures/fig4/'
data_path = '/data_volume/memento/demux/'

In [21]:
import pickle as pkl

### Read IFN data and filter for monocytes

For `memento`, we need the raw count matrix. Preferrably, feed the one with all genes so that we can choose what genes to look at. 

One of the columns in `adata.obs` should be the discrete groups to compare mean, variability, and co-variability across. In this case, it's called `stim`. 

The column containing the covariate that you want p-values for should either:
- Be binary (aka the column only contains two unique values, such as 'A' and 'B'. Here, the values are either 'stim' or 'ctrl'.
- Be numeric (aka the column contains -1, 0, -1 for each genotype value). 

I recommend changing the labels to something numeric (here, i use 0 for `ctrl` and 1 for `stim`). Otherwise, the sign of the DE/EV/DC testing will be very hard to interpret.

In [166]:
adata = sc.read(data_path + 'interferon_filtered.h5ad')
adata = adata[adata.obs.cell == 'CD14+ Monocytes'].copy()

  df_sub[k].cat.remove_unused_categories(inplace=True)


In [167]:
adata.obs['stim'] = adata.obs['stim'].apply(lambda x: 0 if x == 'ctrl' else 1)

In [168]:
adata.obs[['ind', 'stim', 'cell']].sample(5)

Unnamed: 0_level_0,ind,stim,cell
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GACGTATGTAACCG-1,1015,1,CD14+ Monocytes
GACATTCTGGGACA-1,1256,0,CD14+ Monocytes
TACCGAGATCCTGC-1,1488,1,CD14+ Monocytes
TGATACCTGCCTTC-1,1015,1,CD14+ Monocytes
CTATGTTGTAGACC-1,1016,0,CD14+ Monocytes


### Engineer the covariate to be used in memento. 

Currently, optimizations in `memento` only supports discrete covariates, with fewer covariates the better. Here, we are interested in whether the stimulation affects gene expression on chromosome 1, while also including the total chromosome 1 count as a covariate.

There are now 10 unique values for "chr_expr_avg" column in `adata.obs`. We will use this as the covariate.

In [169]:
# These are not actually chromosome 1 genes
# TODO: Remake this tutorial with actual chr1 labels and maybe with the aneuploidy dataset so that it makes sense
chr1_genes = list(np.random.choice(adata.var.index, 4000))

adata_chrom = adata.copy().copy()
adata_chrom.obs['chr_expr'] = adata_chrom[:, chr1_genes].X.sum(axis=1).astype(int)
adata_chrom.obs['chr_expr_bin'] = pd.qcut(adata_chrom.obs['chr_expr'], 10)
adata_chrom.obs = adata_chrom.obs.join(adata_chrom.obs.groupby('chr_expr_bin')['chr_expr'].median(), on='chr_expr_bin', rsuffix='_avg')

  df_sub[k].cat.remove_unused_categories(inplace=True)


In [None]:
adata

In [172]:
adata_chrom.obs.head(2)

Unnamed: 0_level_0,tsne1,tsne2,ind,stim,cluster,cell,multiplets,n_genes_by_counts,log1p_n_genes_by_counts,total_counts,...,total_counts_mt,log1p_total_counts_mt,pct_counts_mt,total_counts_hb,log1p_total_counts_hb,pct_counts_hb,cell_type,chr_expr,chr_expr_bin,chr_expr_avg
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACATACATTTCC-1,-27.640373,14.966629,1016,0,9,CD14+ Monocytes,singlet,878,6.778785,3018.0,...,0.0,0.0,0.0,0.0,0.0,0.0,CD14+ Monocytes - ctrl,365,"(325.0, 377.0]",352.0
AAACATACCAGAAA-1,-27.493646,28.924885,1256,0,9,CD14+ Monocytes,singlet,713,6.570883,2481.0,...,0.0,0.0,0.0,0.0,0.0,0.0,CD14+ Monocytes - ctrl,314,"(280.0, 325.0]",304.0


### Setup memento with the treatment and covariates

Only select the genes defined to be in chromosome 1 above for testing, via the `gene_list` parameter.

In [173]:
adata_chrom.obs['capture_rate'] = 0.07
memento.setup_memento(adata_chrom, q_column='capture_rate')
memento.create_groups(adata_chrom, label_columns=['stim', 'chr_expr_avg'])
memento.compute_1d_moments(adata_chrom, min_perc_group=.7, gene_list=chr1_genes)

  df_sub[k].cat.remove_unused_categories(inplace=True)
  df_sub[k].cat.remove_unused_categories(inplace=True)


In [176]:
adata_chrom.shape

(5341, 117)

### Perform 1D hypothesis testing

Sample metadata has all the columns we are interested in, both the treatment and the covariates. 

We will separate it out into the treatment and covariate DataFrames.

In [177]:
sample_meta = memento.get_groups(adata_chrom)

In [178]:
sample_meta

Unnamed: 0,stim,chr_expr_avg
sg^0^352.0,0,352.0
sg^0^304.0,0,304.0
sg^0^142.5,0,142.5
sg^0^207.0,0,207.0
sg^0^468.0,0,468.0
sg^0^564.0,0,564.0
sg^0^255.0,0,255.0
sg^0^404.0,0,404.0
sg^0^882.5,0,882.5
sg^0^679.5,0,679.5


In [179]:
# The covariate DataFrame - pick the covariate columns
cov_df = sample_meta[['chr_expr_avg']]

# The treatment DataFrame - pick the treatment column
treat_df = sample_meta[['stim']]

In [180]:
memento.ht_1d_moments(
    adata_chrom, 
    treatment=sample_meta,
    covariate=cov_df,
    resampling='bootstrap',
    num_boot=5000, 
    verbose=1,
    num_cpus=14)

[Parallel(n_jobs=14)]: Using backend LokyBackend with 14 concurrent workers.
[Parallel(n_jobs=14)]: Done  22 tasks      | elapsed:    1.8s
[Parallel(n_jobs=14)]: Done 117 out of 117 | elapsed:    9.4s finished


In [182]:

result_1d = memento.get_1d_ht_result(adata_chrom)

In [183]:
result_1d.query('de_coef > 0').sort_values('de_pval').head(10)

Unnamed: 0,gene,tx,de_coef,de_se,de_pval,dv_coef,dv_se,dv_pval
96,SAT1,stim,1.315939,0.030836,3.916619e-08,0.910213,0.144714,4.629241e-05
52,PLSCR1,stim,1.491328,0.036718,6.500286e-08,-1.052557,0.227541,0.001186501
228,APOBEC3A,stim,3.364282,0.063989,3.375917e-07,-2.178124,0.126437,1.232477e-06
200,RNF114,stim,1.34462,0.069991,6.405498e-07,0.740793,0.414568,0.07758448
186,CCL2,stim,0.953319,0.044157,9.928827e-07,-1.521958,0.060431,1.972572e-06
48,CD47,stim,0.838913,0.053488,1.030917e-06,0.135161,0.27952,0.6284743
8,GBP1,stim,1.806011,0.053052,2.032212e-06,-0.616521,0.157682,0.001528063
74,HLA-A,stim,0.266386,0.018729,2.067598e-06,-0.024669,0.140585,0.8576285
90,TMEM60,stim,1.122225,0.086455,4.376555e-06,0.404528,0.451777,0.3931214
122,IFITM3,stim,3.349291,0.053356,5.734026e-06,-3.07006,0.129696,2.348415e-08
