# xQTL-protocol-analysis

This is the notebook for the analysis of xQTL protocol as the orientation project in Gao Wang's group.

## Motivation

The motivation of this project is to test a minimal toy data-set following the protocol.

### prepare
Set up all the environment.

Download folders from synapse including protocol data, test data, reference data, and containers.

In [None]:
synapse get -r syn37178491 \
syn36416601 \
syn36416587 \
syn36416610 

### Step 1 Reference data standardization
Since my computer does not meet the requirement of 40GB memory, I downloaded most reference data from Synapse.

Generate RSEM index based on gtf and reference data.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/singularity/rna_quantification.sif 

Generate the SUPPA annotation for psichomics to detect RNA alternative splicing events.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/singularity/psichomics.sif

### Step 2 Quantification of gene expression

Perform data quality summary via fastqc.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq \
    --container containers/singularity/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

I skipped the step Read alignment via STAR and QC via Picard.

Next step is Call gene-level RNA expression via rnaseqc.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/singularity/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list

Then Call transcript level RNA expression via RSEM.
And it takes about 30 mins to complete.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/RNA_calling.ipynb rsem_call  \
    --cwd output/rnaseq   \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist  \
    --data-dir protocol_data/input_data/RNASeq/fastq   \
    --RSEM-index reference_data/RSEM_Index/   \
    --container containers/singularity/rna_quantification.sif   \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list   

Multi-sample RNA-seq QC.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
    --container containers/singularity/rna_quantification.sif 

Multi-sample read count normalization.
First download the reference_data/sample_participant_lookup.rnaseq file from the reference_data folder within the synapses.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --container containers/singularity/rna_quantification.sif \
    --count-threshold 1 
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

Region list generation.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/gene_annotation.ipynb region_list_generation \
    --cwd output/rnaseq  \
    --phenoFile output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup reference_data/sample_participant_lookup.rnaseq \
    --container containers/singularity/bioinfo.sif \
    --phenotype-id-type gene_id

### Step 3 Quantification of alternative splicing events

#### LeafCutter part workflow
Intron usage ratio quantification via leafCutter.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/singularity/leafcutter.sif 

QC and Normalization of leafCutter outputs

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/leaf_cutter/ \
    --ratios output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz \
    --container containers/singularity/leafcutter.sif 

Post-process of leafcutter outputs for them to be TensorQTL ready

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
    --cwd output/leaf_cutter/ \
    --intron_count output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind_numers.counts.gz \
    --phenoFile output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/singularity/bioinfo.sif \
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

#### Psichomics part workflow