# LDSC-SEG on TWAS Meta-Analysis Results for OUD

**Author**: Jesse Marks<br>
**NIH Project**: [Harnessing Knowledge of Gene Function in Brain Tissue for Discovering Biology Underlying Heroin Addiction](https://reporter.nih.gov/search/RC99reuHhEW0n_3WuFPU6g/project-details/10116351) <br>
**Charge Code**: 0215889.001.001<br>
**GitHub Issue**:  [Opioid Use Disorder TWAS Meta-analysis (Uniform Processing) #183](https://github.com/RTIInternational/bioinformatics/issues/183)<br>


**Description**:<br>
This notebook outlines the steps taken to test our TWAS meta-analysis significantly differentially expressed genes for heritability enrichment for 39 phenotypes.

perform LD Score Regression (LDSC) Analyses. More specifically, we use the approach described in [Finucane et al. 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5896795/), which is LD score regression applied to specifically expressed genes (LDSC-SEG). LDSC-SEG uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. This approach is a powerful way to leverage gene expression data to help interpret GWAS signal. The gene sets we use are from a TWAS meta-analysis of 4 OUD case/control datasets:

- [Corradin et al. (2022) Molecular Psychiatry](https://doi.org/10.1038/s41380-022-01477-y)
- [Mendez et al. (2021) Molecular Psychiatry](https://doi.org/10.1038/s41380-021-01259-y)
- [Seney et al. (2021) Biological Psychiatry](https://doi.org/10.1016/j.biopsych.2021.06.007)
- [Sosnowski et al. (2022) Drug and Alcohol Dependence Reports](https://doi.org/10.1016/j.dadr.2022.100040)

These meta-analysis results are published in [THIS](https://github.com/RTIInternational/bioinformatics/issues/183#issuecomment-1198311459) GitHub comment. We will use the results that have had singletons removed: genes that only appear in one study. The file name is `meta_analysis_sumstats_no_singletons_20220727.tsv.gz` and is under the Results section.

We filter down those results and get two sets of significantly expressed genes. The two sets are:
* List of genes with Benjamini-Hochberg FDR <0.05
* List of genes with Benjamini-Hochberg FDR <0.10

The column named `wfisher_adj_pvalue` is the one we use to create these two gene lists. This column contains the [weighted Fisher's p-value](https://www.nature.com/articles/s41598-021-86465-y) of interest that we will apply the Benjamini-Hochberg FDR thresholds.

The GWAS results we use in these LDSC-SEG analyses are the same set used in the paper [Multi-trait genome-wide association study of opioid addiction: OPRM1 and Beyond](https://www.medrxiv.org/content/10.1101/2021.09.13.21263503v1). In particular, we attempted to determine if the significantly expressed genes from the opioid use disorder meta-analysis (including the GENOA GWAS meta-analysis) results showed enrichment in heritability in any of the following EUR-specific studies:

___

<br><br>

<details>
    <summary>phenotype list</summary>
    
* Age of Initiation  (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* Alcohol Dependence (Walters et al., 2018 Nat Neurosci [30482948](https://pubmed.ncbi.nlm.nih.gov/30482948))
* Alcohol Drinks per Week (DPW) (Liu et al., 2019 Nat Genet [30643251]())
* Alzheimer's Disease (Lambert et al., 2013 Nat Genet [24162737](https://pubmed.ncbi.nlm.nih.gov/24162737))
* Amyotrophic Lateral Sclerosis (Rheenen et al., 2016 Nat Genet [27455348](https://pubmed.ncbi.nlm.nih.gov/27455348))
* Anorexia Nervosa (Watson et al., 2019 Nat Genet [31308545](https://pubmed.ncbi.nlm.nih.gov/31308545))
* Attention Deficit Hyperactivity Disorder (Demontis et al., 2019 Nat Genet [30478444]())
* Autism Spectrum Disorders (Grove et al., 2019 Nat Genet [30804558](https://pubmed.ncbi.nlm.nih.gov/30804558))
* Bipolar Disorder (Stahl et al., 2019 Nat Genet [31043756](https://pubmed.ncbi.nlm.nih.gov/31043756))
* Cannabis Use Disorder (CUD) (Demontis et al., 2019 Nat Neurosci [31209380](https://pubmed.ncbi.nlm.nih.gov/31209380))
* Childhood IQ (Benyamin et al., 2014 Mol Psychiatry [23358156](https://pubmed.ncbi.nlm.nih.gov/23358156))
* Cigarettes Per Day (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* College Completion (Rietveld et al., 2013 Science [23722424](https://pubmed.ncbi.nlm.nih.gov/23722424))
* Cotinine Levels Ware et al., 2016 Sci Rep [26833182](https://pubmed.ncbi.nlm.nih.gov/26833182/)
* Fagerstrom Test for Nicotine Dependence (FTND) (Quach et al., 2020 Nat Commun [33144568](https://pubmed.ncbi.nlm.nih.gov/33144568/))
* Heaviness of Smoking Index (HSI) (Quach et al., 2020 Nat Commun [33144568](https://pubmed.ncbi.nlm.nih.gov/33144568/))
* Intelligence (Sniekers et al., 2017 Nat Genet [28530673](https://pubmed.ncbi.nlm.nih.gov/28530673))
* Lifetime Cannabis Use (Ever vs. Never) (Pasman et al., 2018 Nat Neurosci [30150663](https://pubmed.ncbi.nlm.nih.gov/30150663))
* Major Depressive Disorder (Howard et al., 2018 Nat Commun [29662059](https://pubmed.ncbi.nlm.nih.gov/29662059))
* Mean Accumbens Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Caudate Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Hippocampus Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Pallidum Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Putamen Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Thalamus Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Neo-conscientiousness (de Moor et al., 2012 Mol Psychiatry [21173776](https://pubmed.ncbi.nlm.nih.gov/21173776))
* Neo-openness to Experience (de Moor et al., 2012 Mol Psychiatry [21173776](https://pubmed.ncbi.nlm.nih.gov/21173776))
* Neuroticism (Okbay et al., 2016 Nat Genet [27089181]())
* Opioid Addiction: GENOA GWAS meta-analysis
* Opioid Addiction: gSEM OA GWAS meta-analysis (i.e., GENOA, MVP-SAGE-YP, PGC-SUD, and Partners Health)
* Parkinson's Disease (Sanchez et al., 2009 Nat Genet [19915575](https://pubmed.ncbi.nlm.nih.gov/19915575))
* Post-traumatic Stress Disorder (Nievergelt et al., 2019 Nat Commun [31594949](https://pubmed.ncbi.nlm.nih.gov/31594949))
* Psychiatric Genetics Consortium Cross-disorder GWAS (Schizophrenia, Bipolar Disorder, MDD, ASD and ADHD) (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013 Lancet [23453885](https://pubmed.ncbi.nlm.nih.gov/23453885))
* Schizophrenia (Ripke et al., 2014 Nature [25056061](https://pubmed.ncbi.nlm.nih.gov/25056061))
* Smoking Cessation (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* Smoking Initiation (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* Subjective Well Being (Okbay et al., 2016 Nat Genet [27089181](https://pubmed.ncbi.nlm.nih.gov/27089181))
* Total Intracranial Volume (ICV) (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Years of Schooling (Okbay et al., 2016 Nature [27225129](https://pubmed.ncbi.nlm.nih.gov/27225129))
</details><br><br>
    
    

# Genomic start-end locations for GENCODE v30 IDS
Retrieve the genomic start and end locations for the GENCODE v30 IDs for meta-analysis significant genes at FDR <0.05 when singletons are excluded. Meta-analysis summary statistics without singletons are posted in this [comment](https://github.com/RTIInternational/bioinformatics/issues/183#issuecomment-1198311459) as file `meta_analysis_sumstats_no_singletons_20220727.tsv.gz`

In [17]:
%%bash

gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | head -2

gencode_id	gene_name	base_mean_expression_corradin	base_mean_log2cpm_corradin	fold_change_corradin	log2_fold_change_corradin	log2_fold_change_se_corradin	test_statistic_corradin	b_statistic_corradin	pvalue_corradin	adjusted_pvalue_corradin	base_mean_expression_mendez	base_mean_log2cpm_mendez	fold_change_mendez	log2_fold_change_mendez	log2_fold_change_se_mendez	test_statistic_mendez	b_statistic_mendez	pvalue_mendez	adjusted_pvalue_mendez	base_mean_expression_seney	base_mean_log2cpm_seney	fold_change_seney	log2_fold_change_seney	log2_fold_change_se_seney	test_statistic_seney	b_statistic_seney	pvalue_seney	adjusted_pvalue_seney	base_mean_expression_sosnowski	base_mean_log2cpm_sosnowski	fold_change_sosnowski	log2_fold_change_sosnowski	log2_fold_change_se_sosnowski	test_statistic_sosnowski	b_statistic_sosnowski	pvalue_sosnowski	adjusted_pvalue_sosnowski	num_datasets	fc_sign_corradin	fc_sign_mendez	fc_sign_seney	fc_sign_sosnowski	wfisher_fc_sign	wfisher_pvalue	wfisher_adj_pvalue
ENSG00000000

In [24]:
%%bash

# verify the last column is the wfisher_adj_pvalue column (the Benjamini-Hochberg pvalue)
gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | head -2 | awk \
'{print $NF}'

# create FDR filtered file: 0.05
gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | \
  head -1 > meta_analysis_sumstats_no_singletons_20220727_fdr0.05.tsv

gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | tail -n +2 | awk \
'$NF < 0.05 ' >> meta_analysis_sumstats_no_singletons_20220727_fdr0.05.tsv


# create FDR filtered file: 0.10
gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | \
  head -1 > meta_analysis_sumstats_no_singletons_20220727_fdr0.10.tsv

gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | tail -n +2 | awk \
'$NF < 0.10 ' >> meta_analysis_sumstats_no_singletons_20220727_fdr0.10.tsv

wfisher_adj_pvalue
0.32213328944865


In [None]:
%%bash

# download gencode v30 annotation file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh37_mapping/gencode.v30lift37.annotation.gtf.gz

gunzip --to-stdout gencode.v30lift37.annotation.gtf.gz | head

#description: evidence-based annotation of the human genome, version 30 (Ensembl 96), mapped to GRCh37 with gencode-backmap
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gtf
#date: 2019-04-02
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972.5_2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2_2"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.5_2"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; tag "basic"; havana_gene "OTTHUMG00000000961.2_2"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.5_2"; tr

<br><br>

get the start and end positions for each gene
https://www.gencodegenes.org/pages/data_format.html

| column-number   | content                | values/format           |
|-----------------|------------------------|-------------------------|
| 4               | genomic start location | integer-value (1-based) |
| 5               | genomic end location   | integer-value           |

In [30]:
#import pandas as pd

#df = pd.read_csv("meta_analysis_sumstats_no_singletons_20220727.tsv.gz", compression="gzip", sep="\t")
#ann = pd.read_csv("gencode.v30lift37.annotation.gtf.gz", sep="\t", compression="gzip", skiprows=5, header=None)
#ann.columns = ["chromosome_name", "annotation_source", "feature_type","genomic_start_location","genomic_end_location","score(not_used)","genomic_strand","genomic_phase", "additional_information"]

df.head()
ann.head()

Unnamed: 0,gencode_id,gene_name,base_mean_expression_corradin,base_mean_log2cpm_corradin,fold_change_corradin,log2_fold_change_corradin,log2_fold_change_se_corradin,test_statistic_corradin,b_statistic_corradin,pvalue_corradin,...,pvalue_sosnowski,adjusted_pvalue_sosnowski,num_datasets,fc_sign_corradin,fc_sign_mendez,fc_sign_seney,fc_sign_sosnowski,wfisher_fc_sign,wfisher_pvalue,wfisher_adj_pvalue
0,ENSG00000000003.14,TSPAN6,280.590909,2.15464,0.972325,-0.04049,0.115515,-0.350514,-5.561858,0.728294,...,0.473151,0.887645,4,-1.0,1.0,1.0,1.0,1,0.091392,0.322133
1,ENSG00000000419.12,DPM1,393.045477,2.667048,0.888755,-0.170143,0.092914,-1.831176,-4.277967,0.076581,...,0.762672,0.9596,4,-1.0,-1.0,1.0,-1.0,-1,0.363189,0.609168
2,ENSG00000000457.14,SCYL3,543.378159,3.128846,0.934389,-0.097905,0.064771,-1.511541,-4.832444,0.140654,...,0.439635,0.876792,4,-1.0,-1.0,1.0,1.0,-1,0.403248,0.637638
3,ENSG00000000460.17,C1orf112,155.69175,1.32931,0.975023,-0.036492,0.12713,-0.287042,-5.374356,0.775967,...,0.401802,0.865274,4,-1.0,1.0,-1.0,-1.0,-1,0.832877,0.935349
4,ENSG00000000938.13,FGR,44.590909,-0.829102,0.824932,-0.277653,0.283458,-0.979521,-4.768122,0.334812,...,0.706047,0.949224,4,-1.0,1.0,1.0,1.0,1,0.266319,0.525174


Unnamed: 0,chromosome_name,annotation_source,feature_type,genomic_start_location,genomic_end_location,score(not_used),genomic_strand,genomic_phase,additional_information
0,chr1,HAVANA,gene,11869,14409,.,+,.,"gene_id ""ENSG00000223972.5_2""; gene_type ""tran..."
1,chr1,HAVANA,transcript,11869,14409,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."
2,chr1,HAVANA,exon,11869,12227,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."
3,chr1,HAVANA,exon,12613,12721,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."
4,chr1,HAVANA,exon,13221,14409,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."


In [88]:
import gzip

fdr = "0.05"
in1 = "meta_analysis_sumstats_no_singletons_20220727.tsv.gz"
out1 = f"meta_analysis_sumstats_no_singletons_20220727_fdr{fdr}_startend_pos.tsv"
annfile = "gencode.v30lift37.annotation.gtf.gz"

with gzip.open(in1, 'rt') as inF, gzip.open(annfile, 'rt') as annF, open(out1, 'w') as outF:
#with gzip.open(in1, 'rt') as inF, open(out1, 'w') as outF:
    for _ in range(5):
        next(annF)
    line = annF.readline()
    
    gencode_dic  = {}
    while line:
        sl = line.split("\t")
        if sl[2] == "gene":
            gencode = sl[8].split(";")[0] # remove all additional_info except "gene_id <ENSG...>"
            gencode = gencode.split(" ")[1].strip('"') # remove "gene_id" portion, and double quotes
            gencode = gencode.split("_")[0] # remove trailing underscore+number from <ENSG...>
            gencode_dic[gencode] = [sl[0], sl[3], sl[4]] # chr, start-, and end-genomic position
        line = annF.readline()
    
    print(dict(list(gencode_dic.items())[0:2]))

    outF.write("chr\tgencode_id\tgenomic_start_location\tgenomic_end_location\n")
    next(inF)
    line = inF.readline()
    while line:
        sl = line.split()
        gencode = sl[0]
        wfisher_adj_p = sl[-1]
        if float(wfisher_adj_p) < float(fdr):
            if gencode in gencode_dic:
                chrom = gencode_dic[gencode][0]
                start = gencode_dic[gencode][1]
                end = gencode_dic[gencode][2]
                outline = f"{chrom}\t{gencode}\t{start}\t{end}\n"
                outF.write(outline)
            else:
                print(gencode)
        line = inF.readline()

{'ENSG00000223972.5': ['chr1', '11869', '14409'], 'ENSG00000227232.5': ['chr1', '14404', '29570']}
ENSG00000198521.11
ENSG00000277209.1


In [55]:
dict(list(gencode_dic.items())[0:2])
if '"ENSG00000223972.5' in gencode_dic.keys():
    print("yes")

yes
