# TensorQTL QTL association testing

This notebook implements a workflow for using [tensorQTL](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1836-7) to perform QTL association testing.

## Input

- List of molecular phenotype files: a list of `bed.gz` files containing the table for the molecular phenotype. It should have a companion index file in `tbi` format.
- List of genotypes in PLINK binary format (`bed`/`bim`/`fam`) for each chromosome, previously processed through our genotype QC pipelines.
- Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
- Optionally, a list of traits (genes, regions of molecular features etc) to analyze.

## Output

For each chromosome, several of summary statistics files are generated, including both nominal test statistics for each test, as well as region (gene) level association evidence.

The columns of nominal association result are as follows:

- phenotype_id: Molecular trait identifier.(gene)
- variant_id: ID of the variant (rsid or chr:position:ref:alt)
- tss_distance: Distance of the SNP to the gene transcription start site (TSS)
- af: The allele frequency of this SNPs
- ma_samples: Number of samples carrying the minor allele
- ma_count: Total number of minor alleles across individuals
- pval: Nominal P-value from linear regression
- beta: Slope of the linear regression
- se: Standard error of beta
- chr : Variant chromosome.
- pos : Variant chromosomal position (basepairs).
- ref : Variant reference allele (A, C, T, or G).
- alt : Variant alternate allele.


The column specification of region (gene) level association evidence are as follows:

- phenotype_id - Molecular trait identifier. (gene)
- num_var - Total number of variants tested in cis
- beta_shape1 - First parameter value of the fitted beta distribution
- beta_shape2 - Second parameter value of the fitted beta distribution
- true_df - Effective degrees of freedom the beta distribution approximation
- pval_true_df - Empirical P-value for the beta distribution approximation
- variant_id - ID of the top variant (rsid or chr:position:ref:alt)
- tss_distance - Distance of the SNP to the gene transcription start site (TSS)
- ma_samples - Number of samples carrying the minor allele
- ma_count - Total number of minor alleles across individuals
- maf - Minor allele frequency in MiGA cohort
- ref_factor - Flag indicating if the alternative allele is the minor allele in the cohort (1 if AF <= 0.5, -1 if not)
- pval_nominal - Nominal P-value from linear regression
- slope - Slope of the linear regression
- slope_se - Standard error of the slope
- pval_perm - First permutation P-value directly obtained from the permutations with the direct method
- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis

# Command interface 

In [1]:
sos run TensorQTL.ipynb -h

usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  cis
  trans

Global Workflow Options:
  --phenotype-list VAL (as path, required)
                        Path to the input molecular phenotype file, per chrom,
                        in bed.gz format.
  --covariate-file VAL (as path, required)
                        Covariate file
  --genotype-list VAL (as path, required)
                        Genotype file in PLINK binary format (bed/bam/fam)
                        format, per chrom
  --region-list . (as path)
                        An optional subset of regions of molecular features to
                        analyze
  --cwd output (as path)
       

## Minimal working example\n",
An MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1yjTwoO0DYGi-J9ouMsh9fHKfDmsXJ_4I?usp=sharing).
The singularity image (sif) for running this MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1mLOS3AVQM8yTaWtCbO8Q3xla98Nr5bZQ)


In [None]:
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --container containers/TensorQTL.sif --MAC 5

In [None]:
sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file MWE.bed \
    --phenotype-file MWE.log2cpm.mol_phe.bed.gz \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --container containers/TensorQTL.sif --MAC 5 --region-name  gene_name

In [None]:
nohup sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/genotype_qced/GRCh38_liftedover_sorted_all.add_chr.leftnorm.filtered.renamed.filtered.renamed.filtered.filtered.bed \
    --phenotype-file /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/output/data_preprocessing/ALL/phenotype_data/ALL.log2cpm.bed.gz \
    --covariate-file /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/output/data_preprocessing/ALL/covariates/ALL.log2cpm.ALL.covariate.pca.resid.PEER.cov.gz \
    --cwd ./output/trans_tensorQTL/ \
    --region-list /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/reference_data/AD_genes.region_list \
    --container containers/TensorQTL.sif --MAC 5 --region-name gene_name 

## Global parameter settings

In [5]:
[global]
# Covariate file
parameter: covariate_file = path
# For cis, Genotype file in PLINK binary format (bed/bam/fam) format, per chrom, for trans, 1 whole genome genotype file in plink binary format
parameter: genotype_file = path
# An optional subset of regions of molecular features to analyze
parameter: region_list = path()
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, if cis a list of phenotype per chrom, if trans, 1 whole genome phenotype file.
parameter: phenotype_file = path
# Prefix for the analysis output
parameter: name = f"{phenotype_file:bn}_{covariate_file:bn}"
# Minor allele count cutoff
parameter: MAC = 0
# Specify the number of jobs per run.
parameter: job_size = 2
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: region_name = "gene_id"
# The phenotype group file to group molecule_trait into molecule_trait_object.
parameter: phenotype_group = path() 

# Specify the cis window for the up and downstream radius to analyze around the region of interest, in units of bp
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1 # Use the header of covariate file for it being the intersect of geno/pheno/cov.
# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
parameter: maf_threshold = MAC/(2.0*N)

## cisQTL association testing

In [None]:
[cis_1]
# Path to the input molecular phenotype file, per chrom, in bed.gz format.

import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(phenotype_file,sep = "\t")
geno_chr_inv = pd.read_csv(genotype_file,sep = "\t")
input_inv = molecular_pheno_chr_inv.merge(geno_chr_inv, on = "#id")
input_inv = input_inv.values.tolist()
chr_inv = [x[0] for x in input_inv]
file_inv = [x[1:] for x in input_inv]

input: file_inv, group_by = len(file_inv[0]), group_with = "chr_inv" # This design is necessary to avoid using for_each, as sos can not take chr number as an input.
output: parquet = f'{cwd:a}/{name}.{_chr_inv}.cis_qtl_pairs.{_chr_inv}.parquet', # This design is necessary to match the pattern of map_norminal output
        emprical = f'{cwd:a}/{name}.{_chr_inv}.emprical.cis_sumstats.txt',
        long_table = f'{cwd:a}/{name}.{_chr_inv}.norminal.cis_long_table.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'

python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout' , container = container
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, cis, trans
    import os, time 
    from multipy.fdr import qvalue
    from scipy.stats import chi2
        
    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]
    covariates_file = "$[covariate_file:a]"

    ## Load Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]","\t")
        keep_gene = region["$[region_name]"].to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_gene)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_gene)]


    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    if "chr" not in variant_df.chrom[0]:
        phenotype_pos_df.chr = [x.replace("chr","") for x in phenotype_pos_df.chr]

    ## Read phenotype group if availble
    if $[1 if  phenotype_group.is_file() else 0]  > 0:
        group_s = pd.read_csv($[phenotype_group:r], sep='\t', header=None, index_col=0, squeeze=True)

    ## cis-QTL mapping: nominal associations for all variant-phenotype pairs
    cis.map_nominal(genotype_df, variant_df,
                phenotype_df,
                phenotype_pos_df,
                "$[_output[0]:nnn]", covariates_df=covariates_df, window=$[window], maf_threshold = $[maf_threshold] $[", group_s = group_s" if  phenotype_group.is_file() else ""]  )

    ## Load the parquet and save it as txt
    pairs_df = pd.read_parquet("$[_output[0]]")
    ## Adds the group columns to pairs_df, if there is group_s use group_s, else use phenotype_id
    if $[1 if  phenotype_group.is_file() else 0]  > 0:
        pairs_df = pairs_df.merge(pd.DataFrame( {"molecular_trait_object_id": group_s}),left_on = "phenotype_id", right_index = True)
    else:
        pairs_df["molecular_trait_object_id"] = pairs_df.phenotype_id

    lambda_col = pairs_df.groupby("molecular_trait_object_id").apply( lambda x:  chi2.ppf(1. - np.median(x.pval_nominal), 1)/chi2.ppf(0.5,1))
    pairs_df.columns.values[0]  = "molecular_trait_id"
    pairs_df.columns.values[6]  = "pvalue"
    pairs_df.columns.values[7]  = "beta"
    pairs_df.columns.values[8]  = "se"
    pairs_df = pairs_df.assign(maf = lambda dataframe: dataframe['af'].map(lambda af:af if af < 0.5 else 1-af) ).drop("af",axis =  1)
    pairs_df["n"] = len(phenotype_df.columns.values)
    pairs_df = pairs_df.assign(
    alt = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[-1])).assign(
    ref = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[-2])).assign(
    pos = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[0].split(":")[1])).assign(
    chrom = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split(":")[0]))
    pairs_df.to_csv("$[_output[2]]", sep='\t',index = None)
    cis_df = cis.map_cis(genotype_df, variant_df, 
                     phenotype_df,
                     phenotype_pos_df,
                     covariates_df=covariates_df, seed=999, window=$[window], maf_threshold = $[maf_threshold] $[", group_s = group_s" if  phenotype_group.is_file() else ""] )
    
    cis_df.index.name = "molecular_trait_id"
    ## Add groups columns for eQTL analysis
    if "group_id" not in cis_df.columns:
        cis_df["group_id"] = cis_df.index
        cis_df["group_size"] = 1
    cis_df = cis_df.rename({"group_id":"molecular_trait_object_id","group_size":"n_traits","num_var" : "n_variants","variant_id":"variant","pval_perm":"p_perm", "pval_beta":"p_beta" },axis = 1)
    cis_df = cis_df.assign(inflation_factor = lambda dataframe : dataframe["molecular_trait_object_id"].map(lambda molecular_trait_object_id:lambda_col[molecular_trait_object_id]))
     ## Generate Qvalues
   
    significance,cis_df['qvalue'] = qvalue(cis_df.p_beta)
    cis_df.to_csv("$[_output[1]]", sep='\t')

## TransQTL association testing

With a genotype file of size 3.2G, it takes at least 55 GB of memory, also 8 threads is not enough

In [None]:
[trans_1]

# An subset of regions of molecular features to analyze, required by trans (Not allow to do trans for all gene)
parameter: region_list = path

input: phenotype_file,genotype_file
output: long_table = f'{cwd:a}/{_input[0]:bnn}.norminal.trans_long_table.txt'
parameter: batch_size = 10000
parameter: pval_threshold = 1e-5
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container 
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, cis, trans
    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]
    covariates_file = "$[covariate_file:a]"

    ## Loading Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)

    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]","\t")
        keep_gene = region["$[region_name]"].to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_gene)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_gene)]

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    ## Trans analysis
    trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df, batch_size=$[batch_size],
                           return_sparse=True, return_r2 = True, pval_threshold=$[pval_threshold], maf_threshold=$[maf_threshold])
    ## Filter out cis signal
    trans_df = trans.filter_cis(trans_df, phenotype_pos_df.T.to_dict(), variant_df, window=$[window])   
    ## Permutation
    perm_df = trans.map_permutations(genotype_df, covariates_df, batch_size=$[batch_size],
                             maf_threshold=$[maf_threshold])
    perm_output = trans.apply_permutations(perm_df,trans_df)
    
    ## Output
    trans_df.columns.values[1]  = "molecular_trait_id"
    trans_df.columns.values[2]  = "pvalue"
    trans_df.columns.values[3]  = "beta"
    trans_df.columns.values[4]  = "se"
    trans_df = trans_df.assign(maf = lambda dataframe: dataframe['af'].map(lambda af:af if af < 0.5 else 1-af) ).drop("af",axis =  1)
    trans_df = trans_df.assign(
    chrom = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split(":")[0])).assign(
    alt = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[2])).assign(
    ref = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[1])).assign(
    pos = lambda dataframe: dataframe['variant_id'].map(lambda variant_id:variant_id.split("_")[0]))
    trans_df.to_csv("$[_output]", sep='\t',index = None)

## Association results processing
For both cis and trans: Generate the recipe for yml processing
For cis: Also process the consolidates emprical data.

In [1]:
[cis_2]
input:  group_by = "all"
output: f'{cwd:a}/TensorQTL.cis._recipe.tsv',
        f'{cwd:a}/TensorQTL.cis._column_info.txt',
        f'{cwd:a}/{name}.emprical.cis_sumstats.txt',   
        f'{cwd:a}/{name}.emprical.cis_sumstats.summary'    
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    import csv
    import pandas as pd 
    import numpy as np
    import os, time 
    from multipy.fdr import qvalue
    def fdr(p_vals):
        from scipy.stats import rankdata
        ranked_p_values = rankdata(p_vals)
        fdr = p_vals * len(p_vals) / ranked_p_values
        fdr[fdr > 1] = 1
        return fdr
    data_temp = pd.DataFrame({
    "sumstat_dir" : [$[_input["long_table"]:r,]],
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "molecular_trait_id,chromosome,position,ref,alt",
      "chromosome": "chrom",
      "position": "pos",
      "variant": "variant_id",
      "ref": "ref",
      "alt": "alt",
      "beta": "beta",
      "se": "se",
      "pvalue": "pvalue",
      "TSS_D": "tss_distance",
      "maf": "maf",
      "n" : "n" ,
      "ma_samples": "ma_samples",
      "ac": "ma_count",
      "molecular_trait_id": "molecular_trait_id", "molecular_trait_object_id": "molecular_trait_object_id"}), columns = ["TensorQTL"] )

    data_temp["#chr"] = [x.split(".")[-4].replace("chr","") for x in  [$[_input["long_table"]:r,]]]
    data_temp = data_temp[['#chr', 'sumstat_dir', 'column_info']]
    data_temp.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )

R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container 
    library("purrr")
    library("tidyr")
    library("dplyr")
    library("readr")
    library("qvalue")
    emprical_pd = tibble(map(c($[_input["emprical"]:r,]), ~read_delim(.x,"\t")))%>%unnest()
    emprical_pd["q_beta"] = qvalue(emprical_pd$p_beta)$qvalue
    emprical_pd["q_perm"] = qvalue(emprical_pd$p_perm)$qvalue
    emprical_pd["fdr_beta"] = p.adjust(emprical_pd$p_beta,"fdr")    
    emprical_pd["fdr_perm"] = p.adjust(emprical_pd$p_perm,"fdr")    
    summary = tibble("fdr_perm_0.05" =  sum(emprical_pd["fdr_perm"] < 0.05) , 
                      "fdr_beta_0.05" = sum(emprical_pd["fdr_beta"] < 0.05),
                      "q_perm_0.05" = sum(emprical_pd["q_perm"] < 0.05) ,
                      "q_beta_0.05" = sum(emprical_pd["q_beta"] < 0.05) ,
                       "q_perm_0.01" = sum(emprical_pd["q_perm"] < 0.01) ,
                      "q_beta_0.01" = sum(emprical_pd["q_beta"] < 0.01)  )
    emprical_pd%>%write_delim("$[_output[2]]","\t")
    summary%>%write_delim("$[_output[3]]","\t")

In [None]:
[trans_2]
input:  group_by = "all"
output: f'{cwd:a}/TensorQTL.{"trans" if len(_input["long_table"]) == len(_input) else "cis"}._recipe.tsv',
        f'{cwd:a}/TensorQTL.{"trans" if len(_input["long_table"]) == len(_input) else "cis"}._column_info.txt'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    import csv
    import pandas as pd 
    data_temp = pd.DataFrame({
    "sumstat_dir" : [$[_input["long_table"]:r,]],
    "column_info" : $[_output[1]:r]
    })
    if "cis" in data_temp.sumstat_dir[0]:
        column_info_df = pd.DataFrame( pd.Series( {"ID": "molecular_trait_id,chromosome,position,ref,alt",
          "chromosome": "chrom",
          "position": "pos",
          "variant": "variant_id",
          "ref": "ref",
          "alt": "alt",
          "beta": "beta",
          "se": "se",
          "pvalue": "pvalue",
          "TSS_D": "tss_distance",
          "maf": "maf",
          "n" : "n" ,
          "ma_samples": "ma_samples",
          "ac": "ma_count",
          "molecular_trait_id": "molecular_trait_id", "molecular_trait_object_id": "molecular_trait_object_id"}), columns = ["TensorQTL"] )

        data_temp["#chr"] = [x.split(".")[-4].replace("chr","") for x in  [$[_input["long_table"]:r,]]]
        data_temp = data_temp[['#chr', 'sumstat_dir', 'column_info']]

    else:
        column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "chromosome": "chrom",
          "position": "pos",
          "ref": "ref",
          "alt": "alt",
          "variant": "variant_id",
          "beta": "beta",
          "se": "se",
          "pvalue": "pval",
          "maf": "maf",
          "molecular_trait_id": "gene_ID"}), columns = ["TensorQTL"] )
    data_temp.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )