# LDSC-SEG on TWAS Meta-Analysis Results for OUD

**Author**: Jesse Marks<br>
**NIH Project**: [Harnessing Knowledge of Gene Function in Brain Tissue for Discovering Biology Underlying Heroin Addiction](https://reporter.nih.gov/search/RC99reuHhEW0n_3WuFPU6g/project-details/10116351) <br>
**Charge Code**: 0215889.001.001<br>
**GitHub Issue**:  [Opioid Use Disorder TWAS Meta-analysis (Uniform Processing) #183](https://github.com/RTIInternational/bioinformatics/issues/183)<br>


**Description**:<br>
This notebook outlines the steps taken to test our TWAS meta-analysis significantly differentially expressed genes for heritability enrichment for 39 phenotypes.

perform LD Score Regression (LDSC) Analyses. More specifically, we use the approach described in [Finucane et al. 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5896795/), which is LD score regression applied to specifically expressed genes (LDSC-SEG). LDSC-SEG uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. This approach is a powerful way to leverage gene expression data to help interpret GWAS signal. The gene sets we use are from a TWAS meta-analysis of 4 OUD case/control datasets:

- [Corradin et al. (2022) Molecular Psychiatry](https://doi.org/10.1038/s41380-022-01477-y)
- [Mendez et al. (2021) Molecular Psychiatry](https://doi.org/10.1038/s41380-021-01259-y)
- [Seney et al. (2021) Biological Psychiatry](https://doi.org/10.1016/j.biopsych.2021.06.007)
- [Sosnowski et al. (2022) Drug and Alcohol Dependence Reports](https://doi.org/10.1016/j.dadr.2022.100040)

These meta-analysis results are published in [THIS](https://github.com/RTIInternational/bioinformatics/issues/183#issuecomment-1198311459) GitHub comment. We will use the results that have had singletons removed: genes that only appear in one study. The file name is `meta_analysis_sumstats_no_singletons_20220727.tsv.gz` and is under the Results section.

We filter down those results and get two sets of significantly expressed genes. The two sets are:
* List of genes with Benjamini-Hochberg FDR <0.05
* List of genes with Benjamini-Hochberg FDR <0.10

The column named `wfisher_adj_pvalue` is the one we use to create these two gene lists. This column contains the [weighted Fisher's p-value](https://www.nature.com/articles/s41598-021-86465-y) of interest that we will apply the Benjamini-Hochberg FDR thresholds.

The GWAS results we use in these LDSC-SEG analyses are the same set used in the paper [Multi-trait genome-wide association study of opioid addiction: OPRM1 and Beyond](https://www.medrxiv.org/content/10.1101/2021.09.13.21263503v1). In particular, we attempted to determine if the significantly expressed genes from the opioid use disorder meta-analysis (including the GENOA GWAS meta-analysis) results showed enrichment in heritability in any of the following EUR-specific studies:

___

<br><br>

<details>
    <summary>phenotype list</summary>
    
* Age of Initiation  (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* Alcohol Dependence (Walters et al., 2018 Nat Neurosci [30482948](https://pubmed.ncbi.nlm.nih.gov/30482948))
* Alcohol Drinks per Week (DPW) (Liu et al., 2019 Nat Genet [30643251]())
* Alzheimer's Disease (Lambert et al., 2013 Nat Genet [24162737](https://pubmed.ncbi.nlm.nih.gov/24162737))
* Amyotrophic Lateral Sclerosis (Rheenen et al., 2016 Nat Genet [27455348](https://pubmed.ncbi.nlm.nih.gov/27455348))
* Anorexia Nervosa (Watson et al., 2019 Nat Genet [31308545](https://pubmed.ncbi.nlm.nih.gov/31308545))
* Attention Deficit Hyperactivity Disorder (Demontis et al., 2019 Nat Genet [30478444]())
* Autism Spectrum Disorders (Grove et al., 2019 Nat Genet [30804558](https://pubmed.ncbi.nlm.nih.gov/30804558))
* Bipolar Disorder (Stahl et al., 2019 Nat Genet [31043756](https://pubmed.ncbi.nlm.nih.gov/31043756))
* Cannabis Use Disorder (CUD) (Demontis et al., 2019 Nat Neurosci [31209380](https://pubmed.ncbi.nlm.nih.gov/31209380))
* Childhood IQ (Benyamin et al., 2014 Mol Psychiatry [23358156](https://pubmed.ncbi.nlm.nih.gov/23358156))
* Cigarettes Per Day (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* College Completion (Rietveld et al., 2013 Science [23722424](https://pubmed.ncbi.nlm.nih.gov/23722424))
* Cotinine Levels (Ware et al., 2016 Sci Rep [26833182](https://pubmed.ncbi.nlm.nih.gov/26833182/))
* Fagerstrom Test for Nicotine Dependence (FTND) (Quach et al., 2020 Nat Commun [33144568](https://pubmed.ncbi.nlm.nih.gov/33144568/))
* Heaviness of Smoking Index (HSI) (Quach et al., 2020 Nat Commun [33144568](https://pubmed.ncbi.nlm.nih.gov/33144568/))
* Intelligence (Sniekers et al., 2017 Nat Genet [28530673](https://pubmed.ncbi.nlm.nih.gov/28530673))
* Lifetime Cannabis Use (Ever vs. Never) (Pasman et al., 2018 Nat Neurosci [30150663](https://pubmed.ncbi.nlm.nih.gov/30150663))
* Major Depressive Disorder (Howard et al., 2018 Nat Commun [29662059](https://pubmed.ncbi.nlm.nih.gov/29662059))
* Mean Accumbens Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Caudate Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Hippocampus Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Pallidum Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Putamen Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Mean Thalamus Volume (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Neo-conscientiousness (de Moor et al., 2012 Mol Psychiatry [21173776](https://pubmed.ncbi.nlm.nih.gov/21173776))
* Neo-openness to Experience (de Moor et al., 2012 Mol Psychiatry [21173776](https://pubmed.ncbi.nlm.nih.gov/21173776))
* Neuroticism (Okbay et al., 2016 Nat Genet [27089181]())
* Opioid Addiction: GENOA GWAS meta-analysis
* Opioid Addiction: gSEM OA GWAS meta-analysis (i.e., GENOA, MVP-SAGE-YP, PGC-SUD, and Partners Health)
* Parkinson's Disease (Sanchez et al., 2009 Nat Genet [19915575](https://pubmed.ncbi.nlm.nih.gov/19915575))
* Post-traumatic Stress Disorder (Nievergelt et al., 2019 Nat Commun [31594949](https://pubmed.ncbi.nlm.nih.gov/31594949))
* Psychiatric Genetics Consortium Cross-disorder GWAS (Schizophrenia, Bipolar Disorder, MDD, ASD and ADHD) (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013 Lancet [23453885](https://pubmed.ncbi.nlm.nih.gov/23453885))
* Schizophrenia (Ripke et al., 2014 Nature [25056061](https://pubmed.ncbi.nlm.nih.gov/25056061))
* Smoking Cessation (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* Smoking Initiation (Liu et al., 2019 Nat Genet [30643251](https://pubmed.ncbi.nlm.nih.gov/30643251/))
* Subjective Well Being (Okbay et al., 2016 Nat Genet [27089181](https://pubmed.ncbi.nlm.nih.gov/27089181))
* Total Intracranial Volume (ICV) (Hibar et al., 2015 Nature [25607358](https://pubmed.ncbi.nlm.nih.gov/25607358/))
* Years of Schooling (Okbay et al., 2016 Nature [27225129](https://pubmed.ncbi.nlm.nih.gov/27225129))
</details><br><br>
    
    

## Genomic start-end locations for GENCODE v30 IDS
Retrieve the genomic start and end locations for the GENCODE v30 IDs for meta-analysis significant genes at FDR <0.05 when singletons are excluded. Meta-analysis summary statistics without singletons are posted in this [comment](https://github.com/RTIInternational/bioinformatics/issues/183#issuecomment-1198311459) as file `meta_analysis_sumstats_no_singletons_20220727.tsv.gz`

In [17]:
%%bash

gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | head -2

gencode_id	gene_name	base_mean_expression_corradin	base_mean_log2cpm_corradin	fold_change_corradin	log2_fold_change_corradin	log2_fold_change_se_corradin	test_statistic_corradin	b_statistic_corradin	pvalue_corradin	adjusted_pvalue_corradin	base_mean_expression_mendez	base_mean_log2cpm_mendez	fold_change_mendez	log2_fold_change_mendez	log2_fold_change_se_mendez	test_statistic_mendez	b_statistic_mendez	pvalue_mendez	adjusted_pvalue_mendez	base_mean_expression_seney	base_mean_log2cpm_seney	fold_change_seney	log2_fold_change_seney	log2_fold_change_se_seney	test_statistic_seney	b_statistic_seney	pvalue_seney	adjusted_pvalue_seney	base_mean_expression_sosnowski	base_mean_log2cpm_sosnowski	fold_change_sosnowski	log2_fold_change_sosnowski	log2_fold_change_se_sosnowski	test_statistic_sosnowski	b_statistic_sosnowski	pvalue_sosnowski	adjusted_pvalue_sosnowski	num_datasets	fc_sign_corradin	fc_sign_mendez	fc_sign_seney	fc_sign_sosnowski	wfisher_fc_sign	wfisher_pvalue	wfisher_adj_pvalue
ENSG00000000

In [24]:
%%bash

# verify the last column is the wfisher_adj_pvalue column (the Benjamini-Hochberg pvalue)
gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | head -2 | awk \
'{print $NF}'

# create FDR filtered file: 0.05
gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | \
  head -1 > meta_analysis_sumstats_no_singletons_20220727_fdr0.05.tsv

gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | tail -n +2 | awk \
'$NF < 0.05 ' >> meta_analysis_sumstats_no_singletons_20220727_fdr0.05.tsv


# create FDR filtered file: 0.10
gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | \
  head -1 > meta_analysis_sumstats_no_singletons_20220727_fdr0.10.tsv

gunzip --to-stdout meta_analysis_sumstats_no_singletons_20220727.tsv.gz | tail -n +2 | awk \
'$NF < 0.10 ' >> meta_analysis_sumstats_no_singletons_20220727_fdr0.10.tsv

wfisher_adj_pvalue
0.32213328944865


In [None]:
%%bash

# download gencode v30 annotation file
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh37_mapping/gencode.v30lift37.annotation.gtf.gz

gunzip --to-stdout gencode.v30lift37.annotation.gtf.gz | head

#description: evidence-based annotation of the human genome, version 30 (Ensembl 96), mapped to GRCh37 with gencode-backmap
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gtf
#date: 2019-04-02
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972.5_2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2_2"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.5_2"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; tag "basic"; havana_gene "OTTHUMG00000000961.2_2"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.5_2"; tr

<br><br>

get the start and end positions for each gene
https://www.gencodegenes.org/pages/data_format.html

| column-number   | content                | values/format           |
|-----------------|------------------------|-------------------------|
| 4               | genomic start location | integer-value (1-based) |
| 5               | genomic end location   | integer-value           |

In [30]:
#import pandas as pd

#df = pd.read_csv("meta_analysis_sumstats_no_singletons_20220727.tsv.gz", compression="gzip", sep="\t")
#ann = pd.read_csv("gencode.v30lift37.annotation.gtf.gz", sep="\t", compression="gzip", skiprows=5, header=None)
#ann.columns = ["chromosome_name", "annotation_source", "feature_type","genomic_start_location","genomic_end_location","score(not_used)","genomic_strand","genomic_phase", "additional_information"]

df.head()
ann.head()

Unnamed: 0,gencode_id,gene_name,base_mean_expression_corradin,base_mean_log2cpm_corradin,fold_change_corradin,log2_fold_change_corradin,log2_fold_change_se_corradin,test_statistic_corradin,b_statistic_corradin,pvalue_corradin,...,pvalue_sosnowski,adjusted_pvalue_sosnowski,num_datasets,fc_sign_corradin,fc_sign_mendez,fc_sign_seney,fc_sign_sosnowski,wfisher_fc_sign,wfisher_pvalue,wfisher_adj_pvalue
0,ENSG00000000003.14,TSPAN6,280.590909,2.15464,0.972325,-0.04049,0.115515,-0.350514,-5.561858,0.728294,...,0.473151,0.887645,4,-1.0,1.0,1.0,1.0,1,0.091392,0.322133
1,ENSG00000000419.12,DPM1,393.045477,2.667048,0.888755,-0.170143,0.092914,-1.831176,-4.277967,0.076581,...,0.762672,0.9596,4,-1.0,-1.0,1.0,-1.0,-1,0.363189,0.609168
2,ENSG00000000457.14,SCYL3,543.378159,3.128846,0.934389,-0.097905,0.064771,-1.511541,-4.832444,0.140654,...,0.439635,0.876792,4,-1.0,-1.0,1.0,1.0,-1,0.403248,0.637638
3,ENSG00000000460.17,C1orf112,155.69175,1.32931,0.975023,-0.036492,0.12713,-0.287042,-5.374356,0.775967,...,0.401802,0.865274,4,-1.0,1.0,-1.0,-1.0,-1,0.832877,0.935349
4,ENSG00000000938.13,FGR,44.590909,-0.829102,0.824932,-0.277653,0.283458,-0.979521,-4.768122,0.334812,...,0.706047,0.949224,4,-1.0,1.0,1.0,1.0,1,0.266319,0.525174


Unnamed: 0,chromosome_name,annotation_source,feature_type,genomic_start_location,genomic_end_location,score(not_used),genomic_strand,genomic_phase,additional_information
0,chr1,HAVANA,gene,11869,14409,.,+,.,"gene_id ""ENSG00000223972.5_2""; gene_type ""tran..."
1,chr1,HAVANA,transcript,11869,14409,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."
2,chr1,HAVANA,exon,11869,12227,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."
3,chr1,HAVANA,exon,12613,12721,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."
4,chr1,HAVANA,exon,13221,14409,.,+,.,"gene_id ""ENSG00000223972.5_2""; transcript_id ""..."


In [10]:
import gzip
# https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/make_annot_sample_files/ENSG_coord.txt

fdr = "0.10"
in1 = "meta_analysis_sumstats_no_singletons_20220727.tsv.gz"
out1 = f"meta_analysis_sumstats_no_singletons_20220727_fdr{fdr}_coord.tsv"
annfile = "gencode.v30lift37.annotation.gtf.gz"

with gzip.open(in1, 'rt') as inF, gzip.open(annfile, 'rt') as annF, open(out1, 'w') as outF:
    for _ in range(5):
        next(annF)
    line = annF.readline()
    
    gencode_dic  = {}
    while line:
        sl = line.split("\t")
        if sl[2] == "gene":
            gencode = sl[8].split(";")[0] # remove all additional_info except "gene_id <ENSG...>"
            gencode = gencode.split(" ")[1].strip('"') # remove "gene_id" portion, and double quotes
            gencode = gencode.split(".")[0] # remove suffix  <ENSG...>
            gencode_dic[gencode] = [sl[0], sl[3], sl[4]] # chr, start-, and end-genomic position
        line = annF.readline()
    
    print(dict(list(gencode_dic.items())[0:2]))

    outF.write("GENE\tCHR\tSTART\tEND\n")
    next(inF)
    line = inF.readline()
    while line:
        sl = line.split()
        gencode = sl[0].split(".")[0]
        wfisher_adj_p = sl[-1]
        if float(wfisher_adj_p) < float(fdr):
            if gencode in gencode_dic:
                chrom = gencode_dic[gencode][0]
                start = gencode_dic[gencode][1]
                end = gencode_dic[gencode][2]
                outline = f"{gencode}\t{chrom}\t{start}\t{end}\n"
                outF.write(outline)
            else:
                print(gencode)
        line = inF.readline()

{'ENSG00000223972': ['chr1', '11869', '14409'], 'ENSG00000227232': ['chr1', '14404', '29570']}
ENSG00000270629
ENSG00000274020
ENSG00000275143
ENSG00000277209
ENSG00000278774
ENSG00000280105


In [8]:
%%bash

# extract just the gene
tail -n +2 meta_analysis_sumstats_no_singletons_20220727_fdr0.05_coord.tsv | \
  cut -f1 > meta_analysis_sumstats_no_singletons_20220727_fdr0.05_geneset.tsv

tail -n +2 meta_analysis_sumstats_no_singletons_20220727_fdr0.10_coord.tsv | \
  cut -f1 > meta_analysis_sumstats_no_singletons_20220727_fdr0.10_geneset.tsv

## munge sumstats
Some of the sumstats were already munged, so we will just use them when we can.
Others we will have to download an munge.


* ~Age of Initiation (Liu et al., 2019 Nat Genet 30643251)~
* ~Alcohol Dependence (Walters et al., 2018 Nat Neurosci 30482948)~
* ~Alcohol Drinks per Week (DPW) (Liu et al., 2019 Nat Genet 30643251)~
* ~Alzheimer's Disease (Lambert et al., 2013 Nat Genet 24162737)~
* ~Amyotrophic Lateral Sclerosis (Rheenen et al., 2016 Nat Genet 27455348)~
* ~Anorexia Nervosa (Watson et al., 2019 Nat Genet 31308545)~
* ~Attention Deficit Hyperactivity Disorder (Demontis et al., 2019 Nat Genet 30478444)~
* ~Autism Spectrum Disorders (Grove et al., 2019 Nat Genet 30804558)~
* ~Bipolar Disorder (Stahl et al., 2019 Nat Genet 31043756)~
* ~Cannabis Use Disorder (CUD) (Demontis et al., 2019 Nat Neurosci 31209380)~
* Childhood IQ (Benyamin et al., 2014 Mol Psychiatry 23358156)
* ~Cigarettes Per Day (Liu et al., 2019 Nat Genet 30643251)~
* College Completion (Rietveld et al., 2013 Science 23722424)
* ~Cotinine Levels (Ware et al., 2016 Sci Rep 26833182)~
* ~Depressive Symptoms (Okbay et al., 2016 Nat Genet 27089181)~
* ~Fagerstrom Test for Nicotine Dependence (FTND) (Quach et al., 2020 Nat Commun 33144568)~
* ~Heaviness of Smoking Index (HSI) (Quach et al., 2020 Nat Commun 33144568)~
* ~Intelligence (Sniekers et al., 2017 Nat Genet 28530673)~
* ~Lifetime Cannabis Use (Ever vs. Never) (Pasman et al., 2018 Nat Neurosci 30150663)~
* ~Major Depressive Disorder (Howard et al., 2018 Nat Commun 29662059)~
* ~Mean Accumbens Volume (Hibar et al., 2015 Nature 25607358)~
* ~Mean Amygdala Volume (Hibar et al., 2015 Nature 25607358)~
* ~Mean Caudate Volume (Hibar et al., 2015 Nature 25607358)~
* ~Mean Hippocampus Volume (Hibar et al., 2015 Nature 25607358)~
* ~Mean Pallidum Volume (Hibar et al., 2015 Nature 25607358)~
* ~Mean Putamen Volume (Hibar et al., 2015 Nature 25607358)~
* ~Mean Thalamus Volume (Hibar et al., 2015 Nature 25607358)~
* Neo-conscientiousness (de Moor et al., 2012 Mol Psychiatry 21173776)
* Neo-openness to Experience (de Moor et al., 2012 Mol Psychiatry 21173776)
* ~Neuroticism (Okbay et al., 2016 Nat Genet 27089181)~
* ~Opioid Addiction: GENOA GWAS meta-analysis~
* ~Opioid Addiction: gSEM OA GWAS meta-analysis (i.e., GENOA, MVP-SAGE-YP, PGC-SUD, and Partners Health)~
* ~Parkinson's Disease (Sanchez et al., 2009 Nat Genet 19915575)~
* ~Post-traumatic Stress Disorder (Nievergelt et al., 2019 Nat Commun 31594949)~
* Psychiatric Genetics Consortium Cross-disorder GWAS (Schizophrenia, Bipolar Disorder, MDD, ASD and ADHD) (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013 Lancet 23453885)
* ~Schizophrenia (Ripke et al., 2014 Nature 25056061)~
* ~Smoking Cessation (Liu et al., 2019 Nat Genet 30643251)~
* ~Smoking Initiation (Liu et al., 2019 Nat Genet 30643251)~
* Subjective Well Being (Okbay et al., 2016 Nat Genet 27089181)
* Total Intracranial Volume (ICV) (Hibar et al., 2015 Nature 25607358)
* ~Years of Education (Okbay et al., 2022 Nature Genetics  35361970)~

In [None]:
cd sumstats/



aws s3 cp s3://rti-shared/ldsc/data/gscan_liu2019/munged/AgeOfInitiation.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/alcdep_walters2018/munged/pgc_alcdep.eur_discovery.aug2018_release.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/gscan_liu2019/munged/DrinksPerWeek.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/alzheimers_disease_lambert2013_nat_genet/munged/alzheimers_disease_lambert2013_nat_genet.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet/munged/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/anorexia_watson2019_nat_genet/munged/anorexia_watson2019_workflow_ready.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/adhd_demontis2018_nat_genet/munged/daner_meta_filtered_NA_iPSYCH23_PGC11_sigPCs_woSEX_2ell6sd_EUR_Neff_70.meta.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/autism_spectrum_disorder_grove2019_nat_genet/munged/iPSYCH-PGC_ASD_Nov2017.munged.merged.txt.gz . 
aws s3 cp s3://rti-shared/ldsc/data/bipolar_disorder_stahl2019_nat_genet/munged/daner_PGC_BIP32b_mds7a_0416a.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/cannabis_use_disorder_demontis2019_nat_neurosci/munged/CUD_GWAS_iPSYCH_June2019.munged.merged.txt.gz .
# Childhood IQ s3://rti-shared/gwas_publicly_available_sumstats/childhood_intelligence_benyamin2014_mol_psych/raw/CHIC_Summary_Benyamin2014.txt.gz
aws s3 cp s3://rti-shared/ldsc/data/gscan_liu2019/munged/CigarettesPerDay.txt.munged.merged.txt.gz .
# College completion s3://rti-shared/gwas_publicly_available_sumstats/educational_attainment_rietveld2013_science/raw/SSGAC_Rietveld2013.zip
aws s3 cp s3://rti-shared/ldsc/data/cotinine_levels_ware2016_sci_rep/munged/cotinine_ware2016_workflow_ready.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/depressive_symptoms_okbay2016/munged/DS_Full.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/nicotine_dependence_quach2020_nat_commun/munged/ftnd_wave3_eur_quach2020_workflow_ready.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/ukb_hsi/munged/ukb_gwa_003_workflow_ready.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/intelligence_sniekers2017_nat_genet/munged/intelligence_sniekers2017_nat_genet_sumstats_formatted.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/lifetime_cannabis_use_pasman2018_nat_neurosci/munged/cannabis_icc_ukb_workflow_ready.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/major_depressive_disorder_howard2018_nat_commun/munged/pgc_ukb_depression_gwas_workflow_ready.txt.munged.merged.txt.gz .

aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.sumstats.gz . # Mean Accumbens Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.sumstats.gz . # Mean Caudate Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.sumstats.gz . # Mean Hippocampus Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl.sumstats.gz . # Mean Pallidum Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl.sumstats.gz . # Mean Putamen Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl.sumstats.gz . # Mean Thalamus Volume (Hibar et al., 2015 Nature 25607358)

# Neo-conscientiousness (de Moor et al., 2012 Mol Psychiatry 21173776)
# Neo-openness to Experience (de Moor et al., 2012 Mol Psychiatry 21173776)

aws s3 cp s3://rti-shared/ldsc/data/neuroticism_okbay2016_nat_genet/munged/neuroticism_okbay2016_nat_genet.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/opioid_addiction_gaddis_mathur2022_sci_rep/munged/cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.sumstats_formatted.sumstats.gz .
aws s3 cp s3://rti-heroin/rti-midas-data/studies/ngc/GenomicSEM/results/29/gSEM/final/munged/genomicSEM_GWAS.oaALL.MVP1_MVP2_YP_SAGE.PGC.Song.table.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/parkinsons_disease_sanchez2009_nat_genet/munged/parkinsons_disease_sanchez2009_nat_genet.sumstats.gz .
aws s3 cp s3://rti-shared/ldsc/data/ptsd_nievergelt2019_nat_commun/munged/pts_eur_freeze2_overall.results.munged.merged.txt.gz .
# pgc needs a liftover from hg18 (s3://rti-shared/gwas_publicly_available_sumstats/cross_disorder_gwas_pgc2013_lancet/raw/)
aws s3 cp s3://rti-shared/ldsc/data/schizophrenia_ripke2014_nature/munged/daner_natgen_pgc_eur.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/gscan_liu2019/munged/SmokingCessation.txt.munged.merged.txt.gz .
aws s3 cp s3://rti-shared/ldsc/data/gscan_liu2019/munged/SmokingInitiation.txt.munged.merged.txt.gz .
# Subjective Well Being (Okbay et al., 2016 Nat Genet 27089181)
# Total Intracranial Volume (ICV) (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/ldsc/data/years_schooling_okbay2022_nat_genet/munged/EA4_additive_excl_23andMe.sumstats.gz . # Years of Education (Okbay et al., 2022 Nature Genetics  35361970)

### Opioid Addiction
initiated restore 11/2/2022

In [None]:
# use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)

s3://rti-heroin/rti-midas-data/studies/ngc/meta/144/processing/
for chr in {1..22}; do
    aws s3 cp s3://rti-heroin/rti-midas-data/studies/ngc/meta/144/processing/oaall/cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.8.tsv.gz .
done

# combine into 1 file
# keep only SNPs with rsIDs, and just keep the rsID portion of MarkerName
zcat cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chr1.maf_gt_0.01.rsq_gt_0.8.tsv.gz  | head -1 > \
    cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.tsv

for chr in {1..22}; do
    awk '{split($1,a,":")} 
        {
            $1=a[1] 
            b=substr($1,1,2)
            {if (b=="rs") 
                {print $0}
            }
        }' OFS="\t" <(zcat cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chr$chr.maf_gt_0.01.rsq_gt_0.8.tsv.gz | tail -n +2) \
            >> cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.tsv
done


# munge:  docker interactive mode
docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.tsv \
    --snp MarkerName \
    --N-cas 7281 \
    --N-con 297550 \
    --a1 Allele1 \
    --a2 Allele2 \
    --p P-value \
    --signed-sumstats Effect,0 \
    --out cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.sumstats_formatted


    
# upload to s3
aws s3 cp cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.sumstats_formatted.sumstats.gz s3://rti-shared/ldsc/data/opioid_addiction_gaddis_mathur2022_sci_rep/munged/
aws s3 cp cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.sumstats_formatted.log s3://rti-shared/ldsc/data/opioid_addiction_gaddis_mathur2022_sci_rep/munged/

### Intelligence
Sniekers et al., 2017 Nat Genet	28530673

```
## Association results of the meta-analysis for intelligence based on 78,308 individuals in 13 cohorts. 

## Version date: 10-07-2017

#Columns:
Chromosome: chromosome number
position: base pair position of the SNP on the chromosome (reported on GRCh37)
rsid: SNP rs number
ref: effect allele
alt: non-effect allele
N: sample size
MAF: minor allele frequency in UK Biobank
Beta: effect size of the effect allele
SE: standard error of the effect
Zscore: Z-score computed in METAL by a weighted Z-score method
p_value: P-value computed in METAL by a weighted Z-score method
direction: direction of the effect in each of the cohorts, order: CHIC (consisting of 6 cohorts), UKB-wb, UKB-ts, ERF, GENR, HU, MCTFR, STR

Beta/SE were calculated from METAL Z-scores using the formula from Zhu et al (Nature Genetics, 2016):

Beta = Zscore / sqrt( 2 * MAF * ( 1 - MAF) * ( N + Zscore^2 ) )
SE = 1 / sqrt( 2 * MAF * ( 1 - MAF ) * ( N + Zscore^2 ) )
```

In [None]:
# use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)

# intelligence s3://rti-shared/gwas_publicly_available_sumstats/intelligence_sniekers2017_nat_genet/raw/sumstats.txt.gz
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/intelligence_sniekers2017_nat_genet/raw/sumstats.txt.gz .
zcat sumstats.txt.gz  | head
#Chromosome      position        rsid    ref     alt     MAF     Beta    SE      Zscore  p_value direction
#1       100000012       rs10875231      T       G       0.234588        0.000298163293384453    0.00596326586768906     0.05    0.9599  +-++--+-

# interactive mode
docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash \
python /opt/ldsc/munge_sumstats.py \
    --sumstats sumstats.txt.gz \
    --snp rsid \
    --N 78308 \
    --a1 ref \
    --a2 alt \
    --p p_value \
    --signed-sumstats Beta,0 \
    --out intelligence_sniekers2017_nat_genet_sumstats_formatted



# upload to s3
aws s3 cp intelligence_sniekers2017_nat_genet_sumstats_formatted.log s3://rti-shared/ldsc/data/intelligence_sniekers2017_nat_genet/munged/
aws s3 cp intelligence_sniekers2017_nat_genet_sumstats_formatted.sumstats.gz s3://rti-shared/ldsc/data/intelligence_sniekers2017_nat_genet/munged/

### Brain Volume
use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)

In [None]:
# Mean Accumbens Volume (Hibar et al., 2015 Nature 25607358)

aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.gz  | head
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 0.9454 1.1303 0.4029 13112

# interactive mode
docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl


# upload to s3
aws s3 cp ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/


In [None]:
# Mean Amygdala Volume (Hibar et al., 2015 Nature 25607358)

aws s3 cp  s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.gz  | head -2
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 2.3536 2.4545 0.3376 13160

# interactive mode
docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl


# upload to s3
aws s3 cp ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/


In [None]:
# Mean Caudate Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.gz  | head
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 3.1005 5.1190 0.5447 13171

docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl


# upload to s3
aws s3 cp ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/

In [None]:
# Mean Hippocampus Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.gz | head -2
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 -7.4896 4.9232 0.1282 13163


docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl

    
# upload to s3
aws s3 cp ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/


In [None]:
# Mean Pallidum Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl.gz  | head -2
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 -3.0672 2.0149 0.1279 13142

docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl


# upload to s3
aws s3 cp ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanPallidum_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/

In [None]:
# Mean Putamen Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl.gz  | head
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 -3.2910 6.2791 0.6002 13145

docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl


# note that I had to manually edit the munge_sumstats.py file, in particular I substituted line 1 for line 2. This was because I was getting an error saying (ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.13 (should be close to 0.0). This column may be mislabeled.). Raymond Walters suggested lessening the tolerance threshold a bit. see https://groups.google.com/g/ldsc_users/c/RLbVw3e_PU0
# verifying median value in R > median(df$Effect_Beta) [1] 0.129
# 1. check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.10, sign_cname))
# 2. check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.15, sign_cname))


# upload to s3
aws s3 cp ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanPutamen_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/

In [None]:
# Mean Thalamus Volume (Hibar et al., 2015 Nature 25607358)
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl.gz  | head -2
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 -3.0636 6.5794 0.6415 13193

docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl


# upload to s3
aws s3 cp ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl.sumstats.gz s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp ENIGMA2_MeanThalamus_Combined_GenomeControlled_Jan23.tbl.log s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/

In [None]:
# Total Intracranial Volume (ICV) (Hibar et al., 2015 Nature 25607358) 
aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/brain_volume_hibar2015_nature/ENIGMA2_ICV_Combined_GenomeControlled_Jan23.tbl.gz .

zcat ENIGMA2_ICV_Combined_GenomeControlled_Jan23.tbl.gz  | head -2
#RSID CHR_BP_hg19b37 Effect_Allele Non_Effect_Allele Freq_European_1000Genomes Effect_Beta StdErr Pvalue N
#rs667647 5:29439275 T C 0.347 -148.8340 2029.8618 0.9415 11373

docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats ENIGMA2_ICV_Combined_GenomeControlled_Jan23.tbl.gz \
    --snp RSID \
    --N-col N \
    --a1 Effect_Allele  \
    --a2 Non_Effect_Allele \
    --p Pvalue \
    --signed-sumstats Effect_Beta,0 \
    --out ENIGMA2_ICV_Combined_GenomeControlled_Jan23.tbl


# upload to s3
aws s3 cp  s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/
aws s3 cp  s3://rti-shared/ldsc/data/brain_volume_hibar2015_nature/munged/

### Personality

In [None]:
# Neo-conscientiousness (de Moor et al., 2012 Mol Psychiatry 21173776)


In [None]:
# Neo-openness to Experience (de Moor et al., 2012 Mol Psychiatry 21173776)


### PGC

In [None]:
# use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)


###  Subjective Well Being (Okbay et al., 2016 Nat Genet 27089181)


In [None]:
# use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)


### Total Intracranial Volume (ICV) (Hibar et al., 2015 Nature 25607358)

In [None]:
# use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)


###  Years of Education (Okbay et al., 2022 Nature Genetics  35361970)

In [None]:
# use ldsc tool munge_sumstats.py to convert to sumstats format (https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format)

aws s3 cp s3://rti-shared/gwas_publicly_available_sumstats/years_schooling_okbay2022_nat_genet/raw/EA4_additive_excl_23andMe.txt.gz .

zcat EA4_additive_excl_23andMe.txt.gz  | head -2
#rsID    Chr     BP      Effect_allele   Other_allele    EAF_HRC Beta    SE      SE_unadj        P       P_unadj
#rs667647        5       29439275        T       C       0.376548        -0.00032        0.00179 0.00167 0.86    0.8504

docker run -it -v $PWD:/data/ rtibiocloud/ldsc:v1.0.1_9501d4d bash
python /opt/ldsc/munge_sumstats.py \
    --sumstats EA4_additive_excl_23andMe.txt.gz \
    --snp rsID \
    --N 765283 \
    --a1 Effect_allele  \
    --a2 Other_allele \
    --p P \
    --signed-sumstats Beta,0 \
    --out EA4_additive_excl_23andMe

# upload to s3
aws s3 cp EA4_additive_excl_23andMe.sumstats.gz s3://rti-shared/ldsc/data/years_schooling_okbay2022_nat_genet/munged/
aws s3 cp EA4_additive_excl_23andMe.log s3://rti-shared/ldsc/data/years_schooling_okbay2022_nat_genet/munged/

# Round 1

## Create an Annotation File
https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial#partitioned-ld-scores

In [None]:
# download files needed for partitioned heritability analysis
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_phase3_baseline_ldscores.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_plinkfiles.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_frq.tgz

wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/weights_hm3_no_hla.tgz
#wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/hapmap3_snps.tgz

# extract files
tar -xvf 1000G_Phase3_baseline_ldscores.tgz
tar -xvf 1000G_Phase3_plinkfiles.tgz
tar -xvf 1000G_Phase3_frq.tgz
#tar -xvf hapmap3_snps.tgz
tar -xvf weights_hm3_no_hla.tgz


# interactive session
docker run -it -v $PWD:/data/ \
    rtibiocloud/ldsc:v1.0.1_0bb574e bash


In [None]:
window=100000
for fdr in {"0.05","0.10"}; do # loop through each BED file
    coord_file=/data/deg_bedfiles/meta_analysis_sumstats_no_singletons_20220727_fdr${fdr}_coord.tsv 
    geneset_file=/data/deg_bedfiles/meta_analysis_sumstats_no_singletons_20220727_fdr${fdr}_geneset.tsv

    # store processing files for each meta in separate dir
    mkdir -p /data/{annotations_ldscores,results}/fdr$fdr/

    for j in {1..22}; do # loop through each chromosome
        python /opt/ldsc/make_annot.py \ # create annotation files
            --gene-set-file $geneset_file \
            --gene-coord-file $coord_file \
            --windowsize $window \
            --bimfile /data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j.bim \
            --annot-file /data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr$j.annot.gz

        python /opt/ldsc/ldsc.py \ # compute LD scores
            --l2 \
            --thin-annot \
            --ld-wind-cm 1 \
            --print-snps /data/1000g/1000G_EUR_Phase3_baseline/print_snps.txt \
            --bfile /data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j \
            --annot /data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr$j.annot.gz \
            --out /data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr$j
    done # end chr loop


    for trait in {"age_of_initiation","alcohol_dependence","drinks_per_week","alzheimers_disease","als","anorexia","adhd","autism","bipolar","cannabus_use_disorder","cigarettes_per_day","cotinine_levels","depressive_symptoms","ftnd","heaviness_smoking_index","lifetime_cannabis_use","major_depressive_disorder","neuroticism","opioid_addiction_gsem","parkinsons","ptsd","schizophrenia","smoking_cessation","smoking_initiation"}; do  # loop through all traits
        case $trait in  # use sumstats files that corresponds to the trait name for the h2 estimate
        
            "age_of_initiation") stats=/data/sumstats/AgeOfInitiation.txt.munged.merged.txt.gz ;;
            "alcohol_dependence") stats=/data/sumstats/pgc_alcdep.eur_discovery.aug2018_release.txt.munged.merged.txt.gz ;;
            "drinks_per_week") stats=/data/sumstats/DrinksPerWeek.txt.munged.merged.txt.gz ;;
            "alzheimers_disease") stats=/data/sumstats/alzheimers_disease_lambert2013_nat_genet.sumstats.gz ;;
            "als") stats=/data/sumstats/amyotrophic_lateral_sclerosis_rheenen2016_nat_genet.sumstats.gz ;;
            "anorexia") stats=/data/sumstats/anorexia_watson2019_workflow_ready.txt.munged.merged.txt.gz ;;
            "adhd") stats=/data/sumstats/daner_meta_filtered_NA_iPSYCH23_PGC11_sigPCs_woSEX_2ell6sd_EUR_Neff_70.meta.munged.merged.txt.gz ;;
            "autism") stats=/data/sumstats/iPSYCH-PGC_ASD_Nov2017.munged.merged.txt.gz ;;
            "bipolar") stats=/data/sumstats/daner_PGC_BIP32b_mds7a_0416a.munged.merged.txt.gz ;;
            "cannabis_use_disorder") stats=/data/sumstats/CUD_GWAS_iPSYCH_June2019.munged.merged.txt.gz ;;
            "cigarettes_per_day") stats=/data/sumstats/CigarettesPerDay.txt.munged.merged.txt.gz ;;
            "cotinine_levels") stats=/data/sumstats/cotinine_ware2016_workflow_ready.txt.munged.merged.txt.gz ;;
            "depressive_symptoms") stats=/data/sumstats/DS_Full.txt.munged.merged.txt.gz ;;
            "ftnd") stats=/data/sumstats/ftnd_wave3_eur_quach2020_workflow_ready.txt.munged.merged.txt.gz ;;
            "heaviness_smoking_index") stats=/data/sumstats/ukb_gwa_003_workflow_ready.txt.munged.merged.txt.gz ;;
            "lifetime_cannabis_use") stats=/data/sumstats/cannabis_icc_ukb_workflow_ready.txt.munged.merged.txt.gz ;;
            "major_depressive_disorder") stats=/data/sumstats/pgc_ukb_depression_gwas_workflow_ready.txt.munged.merged.txt.gz ;;
            "neuroticism") stats=/data/sumstats/neuroticism_okbay2016_nat_genet.sumstats.gz ;;
            "opioid_addiction_gsem") stats=/data/sumstats/genomicSEM_GWAS.oaALL.MVP1_MVP2_YP_SAGE.PGC.Song.table.sumstats.gz ;;
            "parkinsons") stats=/data/sumstats/parkinsons_disease_sanchez2009_nat_genet.sumstats.gz ;;
            "ptsd") stats=/data/sumstats/pts_eur_freeze2_overall.results.munged.merged.txt.gz ;;
            "schizophrenia") stats=/data/sumstats/daner_natgen_pgc_eur.munged.merged.txt.gz ;;
            "smoking_cessation") stats=/data/sumstats/SmokingCessation.txt.munged.merged.txt.gz ;;
            "smoking_initiation") stats=/data/sumstats/SmokingInitiation.txt.munged.merged.txt.gz ;;
        esac

        # computed partitioned heritability estimate
        python /opt/ldsc/ldsc.py \
            --h2 $stats \
            --overlap-annot \
            --print-coefficients \
            --w-ld-chr "/data/weights_hm3_no_hla/weights." \
            --frqfile-chr "/data/1000g/1000G_Phase3_frq/1000G.EUR.QC." \
            --ref-ld-chr "/data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr,/data/1000g/1000G_EUR_Phase3_baseline/baseline." \
            --out "/data/results/fdr$fdr/${trait}_with_oa_twas_meta_analysis_deg_genes_fdr${fdr}_window${window}"
    done
done

In [None]:
for fdr in {"0.05","0.10"}; do
    outfile=fdr${fdr}/all_phenotypes_oa_twas_meta_analysis_deg_fdr${fdr}_window100000_final_results.tsv
    touch $outfile
    head -1 fdr${fdr}/smoking_initiation_with_oa_twas_meta_analysis_deg_genes_fdr${fdr}_window100000.results > $outfile
        
    for file in   fdr${fdr}/*_fdr${fdr}_window100000.results; do
        trait=$(echo $file |  sed "s/_with_oa_twas_meta_analysis_deg_genes_fdr.*//") # remove suffix
        trait=$(echo $trait |  sed "s/fdr$fdr\///") # remove directory prefix
        #echo $trait
        awk -v trait=$trait \
        '$1 = trait {print $0}' OFS="\t" <(tail -n +2 $file | head -1) >> $outfile
    done
done


# Round 2

## Create an Annotation File
https://github.com/bulik/ldsc/wiki/LD-Score-Estimation-Tutorial#partitioned-ld-scores

In [None]:
# download files needed for partitioned heritability analysis
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_phase3_baseline_ldscores.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_plinkfiles.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_frq.tgz

wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/weights_hm3_no_hla.tgz
#wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/hapmap3_snps.tgz

# extract files
tar -xvf 1000G_Phase3_baseline_ldscores.tgz
tar -xvf 1000G_Phase3_plinkfiles.tgz
tar -xvf 1000G_Phase3_frq.tgz
#tar -xvf hapmap3_snps.tgz
tar -xvf weights_hm3_no_hla.tgz


# interactive session
docker run -it -v $PWD:/data/ \
    rtibiocloud/ldsc:v1.0.1_0bb574e bash


In [None]:
window=100000
for fdr in {"0.05","0.10"}; do # loop through each BED file
    coord_file=/data/deg_bedfiles/meta_analysis_sumstats_no_singletons_20220727_fdr${fdr}_coord.tsv 
    geneset_file=/data/deg_bedfiles/meta_analysis_sumstats_no_singletons_20220727_fdr${fdr}_geneset.tsv

    # store processing files for each meta in separate dir
    mkdir -p /data/{annotations_ldscores,results}/fdr$fdr/

    for j in {1..22}; do # loop through each chromosome
        python /opt/ldsc/make_annot.py \ # create annotation files
            --gene-set-file $geneset_file \
            --gene-coord-file $coord_file \
            --windowsize $window \
            --bimfile /data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j.bim \
            --annot-file /data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr$j.annot.gz

        python /opt/ldsc/ldsc.py \ # compute LD scores
            --l2 \
            --thin-annot \
            --ld-wind-cm 1 \
            --print-snps /data/1000g/1000G_EUR_Phase3_baseline/print_snps.txt \
            --bfile /data/1000g/1000G_EUR_Phase3_plink/1000G.EUR.QC.$j \
            --annot /data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr$j.annot.gz \
            --out /data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr$j
    done # end chr loop


    for trait in {"intelligence", "opioid_addiction_144", "mean_accumbens_volume", "mean_amygdala_volume", "mean_caudate_volume", "mean_hippocampus_volume", "mean_pallidum_volume", "mean_putamen_volume", "mean_thalamus_volume", "years_of_education"}; do  # loop through all traits
        case $trait in  # use sumstats files that corresponds to the trait name for the h2 estimate
        
            "intelligence") stats=/data/sumstats/intelligence_sniekers2017_nat_genet_sumstats_formatted.sumstats.gz ;;
            "opioid_addiction_144") stats=/data/sumstats/cats+coga+decode+kreek+odb+uhs+vidus+yale-penn.ea.chrall.maf_gt_0.01.rsq_gt_0.8.sumstats_formatted.sumstats.gz ;;
            "mean_accumbens_volume") stats=/data/sumstats/ENIGMA2_MeanAccumbens_Combined_GenomeControlled_Jan23.tbl.sumstats.gz ;;
            "mean_amygdala_volume") stats=/data/sumstats/./ENIGMA2_MeanAmygdala_Combined_GenomeControlled_Jan23.tbl.sumstats.gz ;;
            "mean_caudate_volume") stats=/data/sumstats/ENIGMA2_MeanCaudate_Combined_GenomeControlled_Jan23.tbl.sumstats.gz ;;
            "mean_hippocampus_volume") stats=/data/sumstats/ENIGMA2_MeanHippocampus_Combined_GenomeControlled_Jan23.tbl.gz ;;
            "mean_pallidum_volume") stats=/data/sumstats/ ;;
            "mean_putamen_volume") stats=/data/sumstats/ ;;
            "mean_thalamus_volume") stats=/data/sumstats/ ;;
            "years_of_education") stats=/data/sumstats/ ;;


        esac

        # computed partitioned heritability estimate
        python /opt/ldsc/ldsc.py \
            --h2 $stats \
            --overlap-annot \
            --print-coefficients \
            --w-ld-chr "/data/weights_hm3_no_hla/weights." \
            --frqfile-chr "/data/1000g/1000G_Phase3_frq/1000G.EUR.QC." \
            --ref-ld-chr "/data/annotations_ldscores/fdr$fdr/oa_twas_meta_fdr${fdr}genes_window${window}_chr,/data/1000g/1000G_EUR_Phase3_baseline/baseline." \
            --out "/data/results/fdr$fdr/${trait}_with_oa_twas_meta_analysis_deg_genes_fdr${fdr}_window${window}"
    done
done

In [None]:
for fdr in {"0.05","0.10"}; do
    outfile=fdr${fdr}/all_phenotypes_oa_twas_meta_analysis_deg_fdr${fdr}_window100000_final_results.tsv
    touch $outfile
    head -1 fdr${fdr}/smoking_initiation_with_oa_twas_meta_analysis_deg_genes_fdr${fdr}_window100000.results > $outfile
        
    for file in   fdr${fdr}/*_fdr${fdr}_window100000.results; do
        trait=$(echo $file |  sed "s/_with_oa_twas_meta_analysis_deg_genes_fdr.*//") # remove suffix
        trait=$(echo $trait |  sed "s/fdr$fdr\///") # remove directory prefix
        #echo $trait
        awk -v trait=$trait \
        '$1 = trait {print $0}' OFS="\t" <(tail -n +2 $file | head -1) >> $outfile
    done
done
