<H3>GWAS data subsetting (DEPRECATED)</h3>
This notebook contains code for investigating and retrieving data from a GWAS catalog associations file available for download <a href=https://www.ebi.ac.uk/gwas/docs/file-downloads>here</a>. This notebook was created for gwas_catalog_v1.0.2.<br>
A description of column headers can be found <a href=https://www.ebi.ac.uk/gwas/docs/fileheaders#_file_headers_for_catalog_version_1_0_1>here</a>.<br>

Traits of interest for this investigation are the following:
 -	Height (negative control)
 -	SLE autoimmune disease
 -	Type I diabetes mellitus




In [1]:
# Imports
import pandas as pd
import numpy as np
from pathlib import Path
import re

In [2]:
DATA_DIR = Path("../Data/GWAS")
MAGMA_DIR = Path("../Data/MAGMA")
ATAC_DIR = Path("../Data/ATAC")
GWAS_ASSO_FILE = "gwas_catalog_v1.0.2-associations_e113_r2025-01-30.tsv"


TRAITS = {
    "T1D" : "type 1 diabetes mellitus", 
    "SLE" : "systemic lupus erythematosus", 
    "Height": "body height" 
}

In [3]:
# Read GWAS associations file
GWAS_data = pd.read_csv(DATA_DIR / GWAS_ASSO_FILE, sep="\t")

  GWAS_data = pd.read_csv(DATA_DIR / GWAS_ASSO_FILE, sep="\t")


In [4]:
# Available data
GWAS_data.keys()

Index(['DATE ADDED TO CATALOG', 'PUBMEDID', 'FIRST AUTHOR', 'DATE', 'JOURNAL',
       'LINK', 'STUDY', 'DISEASE/TRAIT', 'INITIAL SAMPLE SIZE',
       'REPLICATION SAMPLE SIZE', 'REGION', 'CHR_ID', 'CHR_POS',
       'REPORTED GENE(S)', 'MAPPED_GENE', 'UPSTREAM_GENE_ID',
       'DOWNSTREAM_GENE_ID', 'SNP_GENE_IDS', 'UPSTREAM_GENE_DISTANCE',
       'DOWNSTREAM_GENE_DISTANCE', 'STRONGEST SNP-RISK ALLELE', 'SNPS',
       'MERGED', 'SNP_ID_CURRENT', 'CONTEXT', 'INTERGENIC',
       'RISK ALLELE FREQUENCY', 'P-VALUE', 'PVALUE_MLOG', 'P-VALUE (TEXT)',
       'OR or BETA', '95% CI (TEXT)', 'PLATFORM [SNPS PASSING QC]', 'CNV',
       'MAPPED_TRAIT', 'MAPPED_TRAIT_URI', 'STUDY ACCESSION',
       'GENOTYPING TECHNOLOGY'],
      dtype='object')

In [5]:
# Scanning for traits of interest:
# -	Height negative control GWAS
# -	SLE autoimmune disease
# -	Type I diabetes mellitus
# There are 2 columns that can be used to identify the traits of interest: "DISEASE/TRAIT" and "MAPPED_TRAIT",
# where the former is more specific and the latter is more general.

for key, value in TRAITS.items():
    print(f"\033[1m{value} traits:\033[0m ")
    [print(i) for i in GWAS_data.MAPPED_TRAIT[GWAS_data.MAPPED_TRAIT.str.contains(value, case=False, na=False)].unique()]
    print()

[1mtype 1 diabetes mellitus traits:[0m 
type 1 diabetes mellitus
urinary albumin excretion rate, type 1 diabetes mellitus
autoimmune thyroid disease, systemic lupus erythematosus, type 1 diabetes mellitus, ankylosing spondylitis, psoriasis, common variable immunodeficiency, celiac disease, ulcerative colitis, Crohn's disease, autoimmune disease, juvenile idiopathic arthritis
autoimmune thyroid disease, systemic lupus erythematosus, type 1 diabetes mellitus, psoriasis, ankylosing spondylitis, common variable immunodeficiency, celiac disease, ulcerative colitis, Crohn's disease, autoimmune disease, juvenile idiopathic arthritis
disease free survival, type 1 diabetes mellitus
event free survival time, type 1 diabetes mellitus, autoantibody measurement
autoimmune thyroid disease, type 1 diabetes mellitus
age at diagnosis, type 1 diabetes mellitus
type 1 diabetes mellitus, latent autoimmune diabetes in adults
age of onset of type 1 diabetes mellitus
migraine disorder, type 1 diabetes mell

<h4><i>Subsetting data

In [10]:
# Subset data
trait_data = {}
for key, value in TRAITS.items():
    trait_subset = GWAS_data[(GWAS_data['MAPPED_TRAIT'] == value) & (GWAS_data['CHR_POS'].notna()) & (GWAS_data["P-VALUE"] < 5e-8)].copy()
    print(key)
    print(trait_subset.CHR_ID.unique())
    trait_subset["CHR_POS"] = trait_subset["CHR_POS"].astype(int)
    trait_subset["CHR_ID"] = trait_subset["CHR_ID"].replace({"X": 23, "Y": 24}).astype(int)
    trait_data[key] = trait_subset

# Check data
for trait, data in trait_data.items():
    print("".join(data["MAPPED_TRAIT"].unique()))
    print(f"Shape: {data.shape} \n")


T1D
['10' '12' '18' '21' '1' '6' '7' '14' '16' '20' '22' '2' '4' '17' '19' 'X'
 '11' '9' '15' '13' '5' '8' '3']
SLE
['6' '8' '2' '7' '16' '4' '1' '5' '11' '22' '10' '12' '13' '3' '15' '20'
 'X' '19' '9' '17' '14' '18']
Height
['16' '6' '5' '9' '3' '20' '15' '18' '1' '12' '17' '7' '14' '19' '2' '8'
 '13' '4' '11' '10' 'X' '21' '22' 2.0 5.0 14.0 9.0 21.0 1.0 20.0 3.0 4.0
 19.0 16.0 17.0 22.0 6.0 8.0 11.0 12.0 15.0 13.0 18.0 7.0 10.0]
type 1 diabetes mellitus
Shape: (317, 38) 

systemic lupus erythematosus
Shape: (904, 38) 

body height
Shape: (25663, 38) 



There is still some discrepancy between what the amount of associations the website reports and what is seen here. For instance, T1D in the dataset used here has 664 associations (before filtering) while the website reports 896. I'm guessing this is due to the cataloge being an older version and therefore containing less associations, although its quite a decent amount less.

<h4><i>Calculating total samples per SNP

In [7]:
for trait, data in trait_data.items():
    print(data["INITIAL SAMPLE SIZE"][1:10].unique())
    print()
    print(data["REPLICATION SAMPLE SIZE"][1:10].unique())

['16,179 European ancestry individuals'
 '7,514 European ancestry cases, 9,045 European ancestry controls']

[nan
 '4,267 European ancestry cases, 4,670 European ancestry controls, 4,342 European ancestry trios from 2,319 families']
['431 European ancestry cases, 2,155 European ancestry controls'
 '811 anti-dsDNA positive European ancestry cases, 906 anti-dsDNA negative European ancestry cases, 4,813 European ancestry controls']

['447 European ancestry trios, 293 trios' nan]
['8,097 European ancestry tall individuals, 8,099 European ancestry short individuals']

['4,872 European ancestry tall individuals, 4,831 European ancestry short individuals']


In [8]:
pattern = r"(\d{1,3}(?:,\d{3})*)" # Matches numbers with commas

def calc_sample_size(sample_string):
    # Finds all numbers in the string, removes commas and sums them
    return sum(int(number.replace(',', '')) for number in re.findall(pattern, sample_string))

sample_string1 = "300 European cases,, and 250,000,100 controls from another planet, also 5 random guys"
sample_string2 = "200 Asian dudes and 50 random people"
assert calc_sample_size(sample_string1) + calc_sample_size(sample_string2) + calc_sample_size("NaN") == 250000655

for trait, data in trait_data.items():
    data["TOTAL_SAMPLES"] = data.apply(
        lambda row: calc_sample_size(str(row["INITIAL SAMPLE SIZE"])) + 
                calc_sample_size(str(row["REPLICATION SAMPLE SIZE"])), 
     axis=1
)

# for trait, data in trait_data.items():
    # print(data[["INITIAL SAMPLE SIZE", "REPLICATION SAMPLE SIZE","TOTAL_SAMPLES"]].iloc[0])

There might be a better way to get this data. Also this data is a mixture of all different kinds of cohorts, does this not matter? 

<h4><i> Writing files

In [None]:
for trait, data in trait_data.items():
    # SNP location data file
    data[["SNPS", "CHR_ID", "CHR_POS"]].to_csv(MAGMA_DIR / "snp_locations" / f"{trait}_SNPLOC.tsv", sep="\t", index=False, header=False)
    # SNP P-values file
    data[["SNPS", "P-VALUE", "TOTAL_SAMPLES"]].to_csv(MAGMA_DIR / "pvals" / f"{trait}_PVAL.txt", sep="\t", index=False, header=True)
    # BED6 file
    data[["CHR_ID", "CHR_POS", "CHR_POS", "SNPS", "P-VALUE"]].to_csv(ATAC_DIR / "bed_files" / "snps" / f"{trait}.bed", sep="\t", index=False, header=False)

print("Files written successfully")

Files written successfully
