# Bulk RNA-seq counts normalization

Quantile normalization of TPM counts, and TMM normalization of read counts.

## Overview

Currently, we have implemented two pipelines for RNA-seq data normalization along the lines of the GTEx V8 workflow:


- A. Read counts -> TPM (within sample normalization) -> TPM level QC -> Quantile normalization (between sample normalization) -> inverse normal transformation
- B. Read counts -> TMM (via edgeR, between sample normalization) -> inverse normal transformation

The GTEx protocol, described [here](https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods), suggests that:

1. Genes were selected based on expression thresholds of >0.1 TPM in at least 20% of samples and ≥6 reads in at least 20% of samples.
2. Expression values were normalized between samples using TMM as implemented in edgeR (Robinson & Oshlack, Genome Biology, 2010 ).
3. For each gene, expression values were additionally normalized across samples using an inverse normal transform.

In other words, GTEx implemented normalization on the count data using TMM (Pipeline B outlined above) although the TPM QC results were used to select samples and genes. 

## Caveats

A couple of possible improvement over the existing pipeline:

1. Should we try [GeTMM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2246-7) instead? According to their paper, GeTMM improves intra-sample analysis and is very easy to implement (add one line to TMM code, as shown in [this post](https://www.reneshbedre.com/blog/expression_units.html)). However inter-sample analysis such as DEG performs the same as TMM so perhaps not necessary for eQTL studies.
2. Should we control for batch effect if we know the batches explicitly, so we don't rely on hidden factor analysis? What we can do are:
    1. Read counts -> Combat-Seq -> inverse normal transformation
    2. Do what we already have -> Add a batch adjustment using Combat on normalized data

## Input

1. TPM matrix and read count matrix in RNA-SeQC format
    - the first two rows should be commented text with `#` prefix.
    - the matrix should be tab delimited.
    - the matrix files should end with `gct` suffix
    - These requirements are satisfied if the inputs are outputs from [`bulk_expression_QC` pipeline](https://cumc.github.io/xqtl-pipeline/code/molecular_phenotypes/QC/bulk_expression_QC.html).
2. GTF for collapsed gene model
    - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) 
    - chromosome names must have `chr` prefix (although we can make it an option in the pipeline, currently we assume the `chr` prefix convention)
3. Meta-data to match between sample names in expression data and genotype files
    - Required input
    - Tab delimited with header
    - Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
    - **must contains all the sample name in expression matrices even if they don't existing in genotype data**

## Output

Normalized expression file in `bed` format.

## Minimal Working Example

Expression matrices can be generated by the MWE of `bulk_expression_QC.ipynb`. A full set of MWE can be found [on Google Drive](https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk).

In [None]:
sos run bulk_expression_normalization.ipynb normalize \
    --cwd output \
    --tpm-gct data/mwe.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct data/mwe.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf data/gene.gtf  \
    --sample-participant-lookup data/sampleSheetAfterQC.txt \
    --container containers/rna_quantification.sif \
    --count-threshold 1   # to make the MWE work

## Command interface

In [3]:
sos run bulk_expression_normalization.ipynb -h

usage: sos run bulk_expression_normalization.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  normalize

Global Workflow Options:
  --cwd VAL (as path, required)
                        Work directory & output directory
  --counts-gct VAL (as path, required)
                        gene count table
  --tpm-gct VAL (as path, required)
                        gene TPM table
  --annotation-gtf VAL (as path, required)
                        gene gtf annotation table
  --sample-participant-lookup VAL (as path, required)
                        A file to map sample ID from expression to genotype,must
                        contain two columns, sample_id and participant

In [2]:
[global]
# Work directory & output directory
parameter: cwd = path("output")
#  gene count table
parameter: counts_gct = path
#  gene TPM table
parameter: tpm_gct = path
#  gene gtf annotation table
parameter: annotation_gtf = path
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path
parameter: tpm_threshold = 0.1
parameter: count_threshold = 6
parameter: sample_frac_threshold = 0.2
# Normalization method: TMM (tmm) or quantile normalization (qn)
parameter: normalization_method = 'tmm'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""

In [None]:
[normalize]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: tpm_gct, counts_gct, annotation_gtf, sample_participant_lookup
output: f'{cwd:a}/{_input[0]:bnnn}.{normalization_method}.expression.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
bash: expand = "${ }", stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout',container = container
    for i in {1..22} X Y MT; do echo chr$i; done > ${_output[0]:bnnn}.vcf_chr_list
    eqtl_prepare_expression.py ${_input[0]} ${_input[1]} ${_input[2]} \
        ${_input[3]} ${_output[0]:bnnn}.vcf_chr_list ${_output[0]:nnn} \
        --tpm_threshold ${tpm_threshold} \
        --count_threshold ${count_threshold} \
        --sample_frac_threshold ${sample_frac_threshold} \
        --normalization_method ${normalization_method} && \
    rm -f ${_output[0]:bnnn}.vcf_chr_list