# FinnGen - Data Engineering Notebook

## Data Descriptions

### gwas


The `{endpoint}.gz` file has the following structure:

| Column name   | Description                                                 |
| ------------- | ----------------------------------------------------------- |
| #chrom        | chromosome on build GRCh38 (1-23)                           |
| pos           | position in base pairs on build GRCh38                       |
| ref           | reference allele                                            |
| alt           | alternative allele (effect allele)                           |
| rsids         | variant identifier                                          |
| nearest_genes | nearest gene(s) (comma separated) from variant               |
| pval          | p-value from [source]                                        |
| mlogp         | -log10(p-value)                                             |
| beta          | effect size (log(OR) scale) estimated with [source]          |
| sebeta        | standard error of effect size estimated with [source]        |
| af_alt        | alternative (effect) allele frequency                        |
| af_alt_cases  | alternative (effect) allele frequency among cases            |
| af_alt_controls | alternative (effect) allele frequency among controls         |


### causal

Data taken from:

[Functional characterization of T2D-associated SNP effects on baseline and ER stress-responsive β cell transcriptional activation](https://www.nature.com/articles/s41467-021-25514-6#MOESM8)

### finemap

{endpoint}.SUSIE.snp.bgz` contains variant summaries with credible set information and has the following structure:

| Column name    | Description                                                        |
| -------------- | ------------------------------------------------------------------ |
| trait          | endpoint name                                                      |
| region         | chr:start-end                                                      |
| v              | variant identifier                                                 |
| rsid           | rs variant identifier                                              |
| chromosome     | chromosome on build GRCh38 (1-22, X)                                |
| position       | position in base pairs on build GRCh38                              |
| allele1        | reference allele                                                   |
| allele2        | alternative allele (effect allele)                                  |
| maf            | minor allele frequency                                             |
| beta           | effect size GWAS                                                   |
| se             | standard error GWAS                                                |
| p              | p-value GWAS                                                       |
| mean           | posterior expectation of true effect size                           |
| sd             | posterior standard deviation of true effect size                   |
| prob           | posterior probability of association                                |
| cs             | identifier of 95% credible set (-1 = variant is not part of credible set) |
| lead_r2        | r2 value to a lead variant (the one with maximum PIP) in a credible set |
| alphax         | posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |

## Libraries

In [1]:
import sys
import pandas as pd
import numpy as np
import requests
import time
from concurrent.futures import ThreadPoolExecutor
import myvariant



print("Python version:", sys.version)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)

Python version: 3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)]
Pandas version: 2.0.1
NumPy version: 1.24.3


## Import data

In [2]:
# Read the 'finemap' file into a pandas DataFrame
finemap = pd.read_csv('C:/Users/falty/Desktop/gwas-graph/FinnGen/data/finemapping_full_finngen_R9_T2D.SUSIE.snp.tsv', low_memory=False, sep='\t')

# Read the 'causal' file into a pandas DataFrame
causal = pd.read_csv('C:/Users/falty/Desktop/gwas-graph/FinnGen/data/41467_2021_25514_MOESM8_ESM.csv', low_memory=False)

# Read the 'gwas' file into a pandas DataFrame
gwas = pd.read_csv('C:/Users/falty/Desktop/gwas-graph/FinnGen/data/summary_stats_finngen_R9_T2D.tsv', low_memory=False, sep='\t')

In [3]:
print("NaNs and missing values in 'gwas':")
empty = gwas.isna().sum()
print(empty)

NaNs and missing values in 'gwas':
#chrom                   0
pos                      0
ref                      0
alt                      0
rsids              1366396
nearest_genes       727855
pval                     0
mlogp                    0
beta                     0
sebeta                   0
af_alt                   0
af_alt_cases             0
af_alt_controls          0
dtype: int64


## Explore data

In [4]:
gwas

Unnamed: 0,#chrom,pos,ref,alt,rsids,nearest_genes,pval,mlogp,beta,sebeta,af_alt,af_alt_cases,af_alt_controls
0,1,13668,G,A,rs2691328,OR4F5,0.944365,0.024860,-0.005926,0.084918,0.005842,0.005729,0.005863
1,1,14773,C,T,rs878915777,OR4F5,0.844305,0.073501,0.010088,0.051369,0.013495,0.013547,0.013485
2,1,15585,G,A,rs533630043,OR4F5,0.841908,0.074735,0.031464,0.157751,0.001113,0.001125,0.001110
3,1,16549,T,C,rs1262014613,OR4F5,0.343308,0.464316,0.241377,0.254711,0.000561,0.000620,0.000550
4,1,16567,G,C,rs1194064194,OR4F5,0.129883,0.886447,0.130736,0.086319,0.004170,0.004250,0.004154
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20170001,23,155697920,G,A,,,0.027115,1.566790,-0.013475,0.006097,0.290961,0.286054,0.291879
20170002,23,155698443,C,A,,,0.178417,0.748564,-0.069907,0.051951,0.003259,0.003022,0.003304
20170003,23,155698490,C,T,,,0.279640,0.553400,-0.020245,0.018725,0.024406,0.024312,0.024423
20170004,23,155699751,C,T,,,0.078864,1.103120,-0.011284,0.006421,0.244829,0.241257,0.245498


In [16]:
causal

Unnamed: 0,Trait,IndexSNP,SNP,CHR,POSITION,LOCUS,REF_ALLELE,ALT_ALLELE,MAF_AFR,MAF_AMR,...,REF_HIGHER_Under_Erstress,ALT_HIGHER_Under_Erstress,ATAC_Peak,PDX1,FOXA2,H2A.Z,H3K27ac,MAFB,NKX6.1,CTCF
0,T2D,rs2820446,rs4846567,1,219750717,LYPAL1,G,T,0.05,0.41,...,0.0,-1,1,0,0,0,0,0,0,0
1,T2D,rs35072907,rs72904726,1,51339332,FAF1,T,C,0.02,0.08,...,0.0,0,1,0,0,0,0,0,0,1
2,T2D,rs1861612,rs10197480,2,230521253,DNER,C,T,0.53,0.54,...,0.0,0,0,0,0,0,0,0,0,0
3,T2D,rs34669198,rs13026123,2,45654050,SRBD1,T,A,0.09,0.09,...,1.0,0,0,0,0,0,0,0,0,0
4,T2D,rs3923113,rs6713419,2,165508300,"GRB14,COBLL1",T,C,0.76,0.29,...,0.0,0,0,0,0,0,0,0,0,0
5,T2D,rs6712932,rs10210149,2,105839324,"GPR45, LINC01918",C,T,0.27,0.29,...,0.0,0,1,0,0,0,0,0,0,0
6,T2D,rs6723108,rs6723108,2,135479980,TMEM163/RAB3GAP1/ACMSD,G,T,0.99,0.81,...,0.0,0,0,0,0,0,0,0,0,0
7,T2D,rs7578326,rs2943656,2,227121918,IRS1,A,G,0.6,0.73,...,0.0,0,1,0,0,1,0,0,0,0
8,T2D,rs7578597,rs13405776,2,43739121,THADA,C,T,0.34,0.07,...,0.0,0,0,0,0,0,0,0,0,0
9,T2D,rs9309245,rs10194328,2,53398669,ASB3,G,T,0.57,0.34,...,0.0,0,0,0,0,0,0,0,0,0


In [6]:
finemap.head()

Unnamed: 0,trait,region,v,rsid,chromosome,position,allele1,allele2,maf,beta,...,lbf_variable1,lbf_variable2,lbf_variable3,lbf_variable4,lbf_variable5,lbf_variable6,lbf_variable7,lbf_variable8,lbf_variable9,lbf_variable10
0,T2D,chr1:18908743-21908743,1:18908743:G:A,chr1_18908743_G_A,chr1,18908743,G,A,0.000438,-0.143377,...,-1.448766,-0.536436,-0.533681,-0.531823,-0.531294,-0.532036,-0.533612,-0.535413,-0.536846,-0.537469
1,T2D,chr1:18908743-21908743,1:18909023:T:C,chr1_18909023_T_C,chr1,18909023,T,C,0.198418,-0.008341,...,-1.325844,-0.352317,-0.35002,-0.348464,-0.348012,-0.348617,-0.349921,-0.351417,-0.352614,-0.353139
2,T2D,chr1:18908743-21908743,1:18909112:G:A,chr1_18909112_G_A,chr1,18909112,G,A,0.159798,-0.006078,...,-1.561368,-0.529649,-0.526932,-0.525097,-0.524573,-0.525301,-0.526853,-0.528628,-0.530041,-0.530656
3,T2D,chr1:18908743-21908743,1:18909164:T:C,chr1_18909164_T_C,chr1,18909164,T,C,0.1684,-0.00416,...,-1.657574,-0.652146,-0.649198,-0.647204,-0.646632,-0.64742,-0.649103,-0.651028,-0.652563,-0.653231
4,T2D,chr1:18908743-21908743,1:18909192:C:CTACG,chr1_18909192_C_CTACG,chr1,18909192,C,CTACG,0.000248,0.149161,...,-1.565444,-0.628124,-0.625089,-0.623034,-0.622444,-0.623254,-0.624986,-0.626968,-0.628549,-0.629239


In [7]:
def explore_dataframe(dataframe, dataframe_name):
    print("=== DataFrame Exploration: {} ===".format(dataframe_name))
    print("Number of Rows: {}".format(dataframe.shape[0]))
    print("Number of Columns: {}".format(dataframe.shape[1]))
    print("Column Names: {}".format(", ".join(dataframe.columns)))
    print("\nData Types of Columns:")
    print(dataframe.dtypes)
    print("\nNull Value Counts:")
    print(dataframe.isnull().sum())
    print("\nSummary Statistics:")
    print(dataframe.describe())
    print("=== End of DataFrame Exploration: {} ===\n".format(dataframe_name))
    
#explore_dataframe(gwas, "gwas")
#explore_dataframe(causal, "causal")
#explore_dataframe(finemap, "finemap")

## Data manipulation

### Adjust `chromosome` in `finemap`

In [8]:
# Extract number from 'chromosome' and replace 'X' with '23'
finemap['chromosome'] = finemap['chromosome'].str.extract('(\d+|X)', expand=False).replace('X', '23')

# Convert 'chromosome' column to 'int64'
finemap['chromosome'] = finemap['chromosome'].astype('int64')

# Assertions to verify the data manipulations
assert finemap['chromosome'].dtype == 'int64'  
assert finemap['chromosome'].isin(range(1, 24)).all()  

### Adjust `v` in `finemap`

In [9]:
# Replace 'X' with '23' in 'v' column of finemap
finemap['v'] = finemap['v'].str.replace(r'(^X:)', '23:', regex=True)

# Assert 'X' is not in 'v' column anymore
assert 'X' not in finemap['v']

### Create `causal` in `gwas`

In [10]:
gwas['causal'] = gwas['rsids'].isin(causal['SNP']).astype(int)

# Assertions to verify the data manipulation
assert 'causal' in gwas.columns  
assert gwas['causal'].isin([0, 1]).all()

### Create `LD` in `gwas`

In [11]:
# Filter finemap by 'lead_r2' between 0.8 (inclusive) and 1 (exclusive)
filtered_finemap = finemap[(finemap['lead_r2'] > 0.8) & (finemap['lead_r2'] < 1)].copy()

# Extract the position from 'v' column in filtered_finemap
filtered_finemap['position'] = filtered_finemap['v'].str.split(':').str.get(1).astype(int)

# Create multi-index for fast selection and sort the index
filtered_finemap.set_index(['chromosome', 'position'], inplace=True)
filtered_finemap.sort_index(inplace=True)

# Check existence in filtered_finemap using index
gwas['LD'] = np.asarray(gwas.set_index(['#chrom', 'pos']).index.isin(filtered_finemap.index), dtype=int)

ld_sum = gwas['LD'].sum()
print(ld_sum)

# Assertions to verify the data manipulations
assert len(filtered_finemap) <= ld_sum
assert '#chrom' in gwas.columns
assert 'pos' in gwas.columns
assert 'LD' in gwas.columns
assert gwas['LD'].isin([0, 1]).all()
assert gwas['LD'].sum() >= 1

4143


### Create `lead` in `gwas`

`finemap` df:

| Column name    | Description                                                        |
| -------------- | ------------------------------------------------------------------ |
| trait          | endpoint name                                                      |
| region         | chr:start-end                                                      |
| v              | variant identifier                                                 |
| rsid           | rs variant identifier                                              |
| chromosome     | chromosome on build GRCh38 (1-23)                                |
| position       | position in base pairs on build GRCh38                              |
| allele1        | reference allele                                                   |
| allele2        | alternative allele (effect allele)                                  |
| maf            | minor allele frequency                                             |
| beta           | effect size GWAS                                                   |
| se             | standard error GWAS                                                |
| p              | p-value GWAS                                                       |
| mean           | posterior expectation of true effect size                           |
| sd             | posterior standard deviation of true effect size                   |
| prob           | posterior probability of association                                |
| cs             | identifier of 95% credible set (-1 = variant is not part of credible set) |
| lead_r2        | r2 value to a lead variant (the one with maximum PIP) in a credible set |
| alphax         | posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |

`gwas` df:

| Column name   | Description                                                 |
| ------------- | ----------------------------------------------------------- |
| #chrom        | chromosome on build GRCh38 (1-23)                           |
| pos           | position in base pairs on build GRCh38                       |
| ref           | reference allele                                            |
| alt           | alternative allele (effect allele)                           |
| rsids         | variant identifier                                          |
| nearest_genes | nearest gene(s) (comma separated) from variant               |
| pval          | p-value from [source]                                        |
| mlogp         | -log10(p-value)                                             |
| beta          | effect size (log(OR) scale) estimated with [source]          |
| sebeta        | standard error of effect size estimated with [source]        |
| af_alt        | alternative (effect) allele frequency                        |
| af_alt_cases  | alternative (effect) allele frequency among cases            |
| af_alt_controls | alternative (effect) allele frequency among controls         |
| LD | 1 if variant is in linkage disequilibrium with lead variant, 0 if not        |

- create a new column in `gwas named` 'id' combining #chrom, pos, ref, and alt with : as seperator (string)
- create a new column in `finemap` named 'id' that is a copy of `v` column (string)
- for all target variant (i.e., rows) where `gwas[LD] =1`:
    - find the closest lead variant using `chromosome` and `position` in `finemap` for the target variant:
        - a lead variant is defined as the SNP with the highest `prob` value within a 15000 base pair distance from the target variant 
    - if the id for the lead variant exists, add the id of the lead variant to`gwas[lead]` in that row
    - else, add "NaN' string to`gwas[lead]` in that row
    - return full `gwas` dataframe 

In [12]:
# Make sure that the required columns exist
assert 'prob' in finemap.columns
assert 'v' in finemap.columns
assert 'chromosome' in finemap.columns
assert 'position' in finemap.columns

assert '#chrom' in gwas.columns
assert 'pos' in gwas.columns
assert 'ref' in gwas.columns
assert 'alt' in gwas.columns
assert 'LD' in gwas.columns

# Create a new column in gwas named 'id' combining #chrom, pos, ref, and alt
gwas['id'] = gwas['#chrom'].astype(str) + ':' + gwas['pos'].astype(str) + ':' + gwas['ref'] + ':' + gwas['alt']

# Create a new column in finemap named 'id' from finemap['v']
finemap['id'] = finemap['v']

# Find the lead variants in finemap
lead_variants = finemap[finemap['lead_r2'] == 1].copy()

def get_highest_prob_lead(row):
    # Filter lead_variants for the same chromosome
    same_chrom = lead_variants[lead_variants['chromosome'] == row['#chrom']]
    
    # If there are no lead variants on the same chromosome, return 'NaN'
    if same_chrom.empty:
        return 'NaN'

    # Filter variants within a 15000 base pair distance from the target variant
    same_chrom = same_chrom[abs(same_chrom['position'] - row['pos']) <= 7500]

    # If there are no lead variants within the distance, return 'NaN'
    if same_chrom.empty:
        return 'NaN'

    # Sort same_chrom by 'prob' in descending order and select the first row
    highest_prob = same_chrom.sort_values(by='prob', ascending=False).iloc[0]

    return highest_prob['id']

# Initialize 'lead' column with 'NaN'
gwas['lead'] = 'NaN'

# Apply the function to each row in gwas where LD = 1
for index, row in gwas[gwas['LD'] == 1].iterrows():
    gwas.at[index, 'lead'] = get_highest_prob_lead(row)

# Return full gwas dataframe 
gwas


Unnamed: 0,#chrom,pos,ref,alt,rsids,nearest_genes,pval,mlogp,beta,sebeta,af_alt,af_alt_cases,af_alt_controls,causal,LD,id,lead
0,1,13668,G,A,rs2691328,OR4F5,0.944365,0.024860,-0.005926,0.084918,0.005842,0.005729,0.005863,0,0,1:13668:G:A,
1,1,14773,C,T,rs878915777,OR4F5,0.844305,0.073501,0.010088,0.051369,0.013495,0.013547,0.013485,0,0,1:14773:C:T,
2,1,15585,G,A,rs533630043,OR4F5,0.841908,0.074735,0.031464,0.157751,0.001113,0.001125,0.001110,0,0,1:15585:G:A,
3,1,16549,T,C,rs1262014613,OR4F5,0.343308,0.464316,0.241377,0.254711,0.000561,0.000620,0.000550,0,0,1:16549:T:C,
4,1,16567,G,C,rs1194064194,OR4F5,0.129883,0.886447,0.130736,0.086319,0.004170,0.004250,0.004154,0,0,1:16567:G:C,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20170001,23,155697920,G,A,,,0.027115,1.566790,-0.013475,0.006097,0.290961,0.286054,0.291879,0,0,23:155697920:G:A,
20170002,23,155698443,C,A,,,0.178417,0.748564,-0.069907,0.051951,0.003259,0.003022,0.003304,0,0,23:155698443:C:A,
20170003,23,155698490,C,T,,,0.279640,0.553400,-0.020245,0.018725,0.024406,0.024312,0.024423,0,0,23:155698490:C:T,
20170004,23,155699751,C,T,,,0.078864,1.103120,-0.011284,0.006421,0.244829,0.241257,0.245498,0,0,23:155699751:C:T,


In [13]:
# Count non-"NaN" values in 'lead' column
non_na_count = len(gwas[gwas['lead'] != 'NaN'])

# Print the count
print("Number of non-'NaN' values in 'lead' column: ", non_na_count)

Number of non-'NaN' values in 'lead' column:  1125


### Extract `trait` from `finemap` to `gwas`

In [14]:
unique_trait = finemap['trait'].unique()
trait_string = unique_trait[0]
gwas['trait'] = trait_string

## Export csv

In [15]:
#gwas.to_csv('gwas-causal.csv', index=False)