# Test and Improve xQTL protocol

This is the notebook for the analysis of xQTL protocol as the orientation project in Gao Wang's group.

## Motivation

The motivation of this project is to test a minimal data-set following the protocol. Try to run the protocol step by step and to learn as much as possible in the process.

## To do

### Prepare
Set up all the environment, including wsl2(Ubuntu), Singularity, miniconda, and Script of Scripts.

Download folders from synapse including protocol data, test data, reference data, and containers.

In [None]:
synapse get -r syn37178491 \
syn36416601 \
syn36416587 \
syn36416610 

### Step 1 Reference data standardization
Since my computer does not meet the requirement of 40GB memory, I downloaded most reference data from Synapse.

Generate RSEM index based on gtf and reference data.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/singularity/rna_quantification.sif 

Generate the SUPPA annotation for psichomics to detect RNA alternative splicing events.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/singularity/psichomics.sif

### Step 2 Quantification of gene expression

Perform data quality summary via fastqc.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq \
    --container containers/singularity/rna_quantification.sif \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

Here is the output example:

##FastQC	0.11.9
>>Basic Statistics	pass
#Measure	Value
Filename	Sample_1.subsetted.2.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	2434685
Sequences flagged as poor quality	0
Sequence length	101
%GC	57
>>END_MODULE
>>Per base sequence quality	pass
#Base	Mean	Median	Lower Quartile	Upper Quartile	10th Percentile	90th Percentile
1	30.185474506969076	31.0	31.0	34.0	26.0	34.0
2	30.391895050078347	33.0	31.0	34.0	26.0	34.0
3	30.40025136721999	34.0	31.0	34.0	26.0	34.0
4	33.722677060892885	37.0	35.0	37.0	28.0	37.0
5	33.68968634546153	37.0	35.0	37.0	28.0	37.0
6	33.76541893509838	37.0	35.0	37.0	30.0	37.0
7	33.55300295520776	37.0	35.0	37.0	28.0	37.0


I skipped the step Read alignment via STAR and QC via Picard.

Next step is Call gene-level RNA expression via rnaseqc.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist \
    --data-dir protocol_data/input_data/RNASeq/fastq \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/singularity/rna_quantification.sif \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list

Then Call transcript level RNA expression via RSEM.
And it takes about 30 mins to complete.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/RNA_calling.ipynb rsem_call  \
    --cwd output/rnaseq   \
    --samples protocol_data/input_data/RNASeq/fastq/xqtl_protocol_data.fastqlist  \
    --data-dir protocol_data/input_data/RNASeq/fastq   \
    --RSEM-index reference_data/RSEM_Index/   \
    --container containers/singularity/rna_quantification.sif   \
    --bam_list output/rnaseq/xqtl_protocol_data_bam_list   

Here is the output example:

gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
ENSG00000000003	ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008	2061.80	1775.53	0.00	0.00	0.00
ENSG00000000005	ENST00000373031,ENST00000485971	873.50	588.52	0.00	0.00	0.00
ENSG00000000419	ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752	974.33	688.16	0.00	0.00	0.00
ENSG00000000457	ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238	3185.80	2899.51	0.00	0.00	0.00
ENSG00000000460	ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289	2431.11	2144.87	0.00	0.00	0.00
ENSG00000000938	ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472	1722.14	1435.95	0.00	0.00	0.00
ENSG00000000971	ENST00000359637,ENST00000367429,ENST00000466229,ENST00000470918,ENST00000496761,ENST00000630130	2560.33	2274.47	0.00	0.00	0.00
ENSG00000001036	ENST00000002165,ENST00000367585,ENST00000451668	1256.00	970.05	0.00	0.00	0.00
ENSG00000001084	ENST00000504353,ENST00000504525,ENST00000505197,ENST00000505294,ENST00000509541,ENST00000510837,ENST00000513939,ENST00000514004,ENST00000514373,ENST00000514933,ENST00000515580,ENST00000616923,ENST00000643939,ENST00000650454	1723.64	1439.58	0.00	0.00	0.00


Multi-sample RNA-seq QC.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.gene_readsCount.gct.gz \
    --container containers/singularity/rna_quantification.sif 

Multi-sample read count normalization.
First download the reference_data/sample_participant_lookup.rnaseq file from the reference_data folder within the synapses.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --container containers/singularity/rna_quantification.sif \
    --count-threshold 1 
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

Region list generation.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/gene_annotation.ipynb region_list_generation \
    --cwd output/rnaseq  \
    --phenoFile output/rnaseq/xqtl_protocol_data.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup reference_data/sample_participant_lookup.rnaseq \
    --container containers/singularity/bioinfo.sif \
    --phenotype-id-type gene_id

### Step 3 Quantification of alternative splicing events

#### LeafCutter part workflow
Intron usage ratio quantification via leafCutter.

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/leaf_cutter/ \
    --samples output/rnaseq/xqtl_protocol_data_bam_list \
    --container containers/singularity/leafcutter.sif 

QC and Normalization of leafCutter outputs

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/leaf_cutter/ \
    --ratios output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz \
    --container containers/singularity/leafcutter.sif 

One of the output files : xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt, is like below:

#Chr	start	end	ID	Sample_1	Sample_2	Sample_3	Sample_4	Sample_5	Sample_6	Sample_7	Sample_8	Sample_9	Sample_10	Sample_11	Sample_12	Sample_13	Sample_14	Sample_15	Sample_16	Sample_17	Sample_18	Sample_19	Sample_20	Sample_21	Sample_22	Sample_23	Sample_24	Sample_25	Sample_26	Sample_27	Sample_28	Sample_29	Sample_30	Sample_31	Sample_32	Sample_33	Sample_34	Sample_35	Sample_36	Sample_37	Sample_38	Sample_39	Sample_40	Sample_41	Sample_42	Sample_43	Sample_44	Sample_45	Sample_46	Sample_47	Sample_48	Sample_49
chr21	5025049	5026280	chr21:5025049:5026280:clu_15_+	1.1801553949206993	0.5906565965145529	0.5960707801745235	-1.6434980544790627	1.2447084836099003	1.4488709651528058	0.024767507497203564	-1.9959345209493737	-0.5975504065484583	0.945681846770402	-0.9599791658266483	-0.9020422684809566	0.8897112621556369	-0.19606680320034311	0.007842322641267207	-0.36938699698914385	1.0488367675600292	-1.1556649946967998	0.5279082999276027	0.22815072998276123	-1.6712040702398938	0.981800939495538	-0.10669146239265827	3.7678444273707323	1.186806881808846	-1.089030823980715	-1.1147835411321065	-0.8799430354726265	-0.7368364708571798	0.01114446967948394	-0.7265869818199944	0.6941822484312622	0.7881748560936481	-2.0239323600402557	-0.09549007630309227	-0.24810816766140223	-1.1678226399714962	1.1444699743170754	0.43518480760880945	0.2036461815242063	0.9612886610800289	1.600439716297775	0.8582654105186588	0.2630341109278683	-1.2766877783993482	-2.060460691392621	-0.6446048822073186	0.23790528939977273	0.0718802793343171
chr21	5025049	5027935	chr21:5025049:5027935:clu_15_+	0.8720677573430018	1.0618024560241432	-0.1278873740051456	0.6109266400380272	-0.0049530153602243615	-0.9691805187253599	-1.258256442012576	1.7253064829667735	0.7926876590693943	-0.7853624895182748	-1.1564701919048186	-0.5029245931109543	-0.7148136668886392	-1.2174105560253587	1.5629533829758746	-1.2729668538962486	0.7808755744809718	1.0830755164912185	-1.2447084836099003	-1.447692932204825	-0.13746457371832715	-0.4861327551140027	1.394451537609808	1.659641162910836	0.1537335141921846	0.7915579475761285	0.820690080662648	-1.0610774583231255	-0.8860382715640819	-0.26944873155251287	-1.2261166435822197	-1.0290112362379562	-0.9170230189823326	-0.48985228076108367	0.23663169092701933	1.1524516725029181	0.9939030943664477	0.5127854159104542	-1.1629389497445062	-0.0227031777637733	0.06732901826999577	1.3561952616968314	0.8964764484025595	-1.0771583753124823	0.5193870484813956	-1.1768491476141065	-1.1334165803461267	-1.5382338897163885	-0.24555506726015877
chr21	5026630	5027935	chr21:5026630:5027935:clu_15_+	-0.4940448664298115	-1.904778837097044	-0.19690839238607402	0.5398074619320817	-2.092536152343439	-1.795233536977089	0.3434372320488766	0.8708609744147066	-0.01816198049378325	-1.3158925479033465	0.8120503860733814	0.4819563097022388	-1.1186334231117163	0.43518480760880945	-1.5222623580510508	0.45478037754771167	-0.5945924576257678	1.5546077403961758	-0.17674728255911915	0.0851291257010744	0.9945797071005923	-1.5355446811843847	1.0756849721637112	1.6612794476971389	-0.869053174130471	1.2070835558471817	1.2243679966548833	0.7943841259039116	0.8074672553294278	-0.04707082994085674	0.5024562552627317	-0.38178863963708987	-0.9547575622932908	1.4512330819760757	-0.4247721259289395	0.8823771956659383	1.4395023904881383	-1.8518037459880667	-0.10295624885536175	-0.3813447226103047	-0.8612519528909637	-2.008183982657377	-1.966556905378204	0.1742323553027833	0.4745521655931572	1.4524171830025492	0.8172268843648766	0.15289825575338362	-0.14747164844821173
chr21	5032217	5033408	chr21:5032217:5033408:clu_16_+	0.46119783184374563	2.0852211125934854	1.9929184553632342	0.6461298109265834	-0.1872383772254422	0.5960707801745235	0.6686717007744576	0.6461298109265834	-1.2358019454753641	0.3951416531555249	0.5970570525458161	0.29780913879856025	0.4424562745481825	0.611921804579183	0.6738421752532086	0.5599726304619902	0.41890707786658327	-3.767844427370683	0.6563347551640993	0.6779915578563864	0.23875456935738734	0.23663169092701933	0.32902428912662757	-1.7670069125843726	0.5350387159599282	0.41125904745563335	0.43427749786884207	-1.4643718468149274	-0.20997126059971338	0.6139139545526131	0.4759384501924495	0.08181560384001503	0.6578713831291685	0.5369447484547609	0.0553366739235069	1.6403203305924554	0.5412404778351537	0.6289338064736658	0.016097929354984927	-0.40632313765204336	0.419357721603267	1.8035671281815073	0.04004741252952372	0.6604358885496658	-1.8035671281815073	0.6588966652179624	0.09051560918668777	-0.5980439060366768	0.6154096671301943


Post-process of leafcutter outputs for them to be TensorQTL ready

In [None]:
sos run ../fork/xqtl-pipeline/pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
    --cwd output/leaf_cutter/ \
    --intron_count output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind_numers.counts.gz \
    --phenoFile output/leaf_cutter/xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.txt \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --container containers/singularity/bioinfo.sif \
    --sample_participant_lookup reference_data/sample_participant_lookup.rnaseq

#### Psichomics part workflow

I encountered with some trouble is this part.

The codes are like below:

In [None]:
sos run splicing_calling.ipynb psichomics \
    --cwd psichomics_output/ \
    --samples sample_fastq_bam_list\
    --splicing_annotation hg38_suppa.rds \
    --container containers/psichomics.sif

One problem is that psichomics could not find where the splicing_annotation is using a relative path. This can be solved by inputting the full path of the suppa.eds file.

However, it still doesn't work. 

The key message is

`
Warning message:
In loadLocalFiles("/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/psichomics") :
  No supported files were found in the given folder.
`

and the only file in the folder have only a `x` in it.

So it is not an i/o error. I raised an issue on the github.

Unfixed.