# Process recount2 data
This notebook does the following:

1. Select template experiment. This template experiment will be used in the next [notebook](2_identify_generic_genes_pathways.ipynb) to simulate experiments with the same experimental design but testing a different biological process.


2. Uses pre-downloaded data from [notebook]()

3. Normalizes data using MRnorm
Check what the distribution looks like

4. Train VAE on recount2 data

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2
import os
import pandas as pd
import seaborn as sns
from ponyo import utils, train_vae_modules
from generic_expression_patterns_modules import process

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
Using TensorFlow backend.







In [2]:
# Set seeds to get reproducible VAE trained models
process.set_all_seeds()




### Set parameters for data processing

Most parameters are read from `config_filename`. We manually selected bioproject [SRP012656](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37764) as the template experiment, which contains primary non-small cell lung adenocarcinoma tumors and adjacent normal tissues of 6 never-smoker Korean female patients with 2 replicates each.

In [3]:
base_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))

# Read in config variables
config_filename = os.path.abspath(
    os.path.join(base_dir, "configs", "config_human_general_MRnorm.tsv")
)

params = utils.read_config(config_filename)

local_dir = params["local_dir"]
dataset_name = params["dataset_name"]

# File that contains gene ranks identified by Crow et. al.
DE_prior_filename = params["reference_gene_filename"]

# Template experiment ID
project_id = params["project_id"]

# Output files of recount2 template experiment data
# processed_template_filename = params["processed_template_filename"]

# Output files of recount2 compendium data
MRnormalized_compendium_filename = params["MRnormalized_compendium_filename"]
normalized_compendium_filename = params["normalized_compendium_filename"]


# Output file: pickled scaler (generated during compendium normalization)
scaler_filename = params["scaler_filename"]

# Output: size factor for MR normalization
sf_filename = "data/metadata/MR_norm_compendium_size_factor.tsv"

### Load template data file

In [4]:
mapped_template_filename = params["mapped_template_filename"]

In [5]:
mapped_template = pd.read_csv(mapped_template_filename, sep="\t", index_col=0, header=0)
mapped_template.head()

Unnamed: 0,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A4GALT,A4GNT,AAAS,AACS,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
SRR493937,244,396,6,144077,2089,7,946,7,941,752,...,214,161,202,569,1618,9,768,12758,5060,773
SRR493938,230,384,8,142807,2062,8,968,3,978,776,...,219,171,215,576,1655,7,808,12925,5061,783
SRR493939,308,396,11,77651,1064,4,321,2,1633,1518,...,345,372,198,387,1315,30,842,4339,3304,743
SRR493940,303,351,13,77739,1125,12,325,0,1637,1481,...,318,317,155,448,1322,36,795,4400,3308,714
SRR493941,203,327,0,235645,2438,6,945,15,1520,1027,...,191,71,240,533,2006,4,1331,14318,4776,936


### Load recount2

In [6]:
mapped_compendium_filename = params["mapped_compendium_filename"]

In [7]:
mapped_compendium = pd.read_csv(
    mapped_compendium_filename, sep="\t", index_col=0, header=0
)

In [8]:
# Add pseudo count otherwise we will get an error:
# Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
# every gene contains at least one zero, cannot compute log geometric means

In [9]:
mapped_pseudo_compendium_filename = params["mapped_pseudo_compendium_filename"]
mapped_compendium_pseudo = mapped_compendium + 1
mapped_compendium_pseudo.to_csv(mapped_pseudo_compendium_filename, sep="\t")

In [10]:
del mapped_compendium_pseudo

### MR normalize

Tutorial on MR normalization is [here](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html)

Normalized count = raw count/scale factor

In [11]:
metadata = pd.DataFrame(
    data=[i for i in range(mapped_compendium.shape[0])],
    index=mapped_compendium.index,
    columns=["group"],
)
del mapped_compendium
metadata.head()

Unnamed: 0,group
DRR000897,0
DRR001173,1
DRR001174,2
DRR001175,3
DRR001176,4


In [12]:
metadata_filename = os.path.join(
    base_dir, dataset_name, "data", "metadata", "MRnorm_compendium_metadata.tsv"
)
metadata.to_csv(metadata_filename, sep="\t")

In [13]:
%%R -i base_dir -i mapped_pseudo_compendium_filename -i metadata_filename -i MRnormalized_compendium_filename -i sf_filename

source(paste0(base_dir, '/generic_expression_patterns_modules/normalize_counts.R'))

MRnorm_expression(mapped_pseudo_compendium_filename, metadata_filename, MRnormalized_compendium_filename, sf_filename)

R[write to console]: Loading required package: S4Vectors

R[write to console]: Loading required package: stats4

R[write to console]: Loading required package: BiocGenerics

R[write to console]: Loading required package: parallel

R[write to console]: 
Attaching package: ‘BiocGenerics’


R[write to console]: The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


R[write to console]: The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


R[write to console]: The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, 

[1] "Checking sample ordering..."
[1] TRUE


R[write to console]: converting counts to integer mode

R[write to console]:   the design formula contains one or more numeric variables with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

R[write to console]:   the design formula contains one or more numeric variables that have mean or
  standard deviation larger than 5 (an arbitrary threshold to trigger this message).
  it is generally a good idea to center and scale numeric variables in the design
  to improve GLM convergence.



[1] 17755 49651


## 0-1 normalize

In [14]:
process.normalize_compendium(
    MRnormalized_compendium_filename, normalized_compendium_filename, scaler_filename
)

input: dataset contains 49651 samples and 17755 genes
