# FinnGen - Data Engineering Notebook

## Data Descriptions

### gwas


The `{endpoint}.gz` file has the following structure:

| Column name   | Description                                                 |
| ------------- | ----------------------------------------------------------- |
| #chrom        | chromosome on build GRCh38 (1-23)                           |
| pos           | position in base pairs on build GRCh38                       |
| ref           | reference allele                                            |
| alt           | alternative allele (effect allele)                           |
| rsids         | variant identifier                                          |
| nearest_genes | nearest gene(s) (comma separated) from variant               |
| pval          | p-value from [source]                                        |
| mlogp         | -log10(p-value)                                             |
| beta          | effect size (log(OR) scale) estimated with [source]          |
| sebeta        | standard error of effect size estimated with [source]        |
| af_alt        | alternative (effect) allele frequency                        |
| af_alt_cases  | alternative (effect) allele frequency among cases            |
| af_alt_controls | alternative (effect) allele frequency among controls         |


### causal

Data taken from:

[Functional characterization of T2D-associated SNP effects on baseline and ER stress-responsive β cell transcriptional activation](https://www.nature.com/articles/s41467-021-25514-6#MOESM8)

### finemap

{endpoint}.SUSIE.snp.bgz` contains variant summaries with credible set information and has the following structure:

| Column name    | Description                                                        |
| -------------- | ------------------------------------------------------------------ |
| trait          | endpoint name                                                      |
| region         | chr:start-end                                                      |
| v              | variant identifier                                                 |
| rsid           | rs variant identifier                                              |
| chromosome     | chromosome on build GRCh38 (1-22, X)                                |
| position       | position in base pairs on build GRCh38                              |
| allele1        | reference allele                                                   |
| allele2        | alternative allele (effect allele)                                  |
| maf            | minor allele frequency                                             |
| beta           | effect size GWAS                                                   |
| se             | standard error GWAS                                                |
| p              | p-value GWAS                                                       |
| mean           | posterior expectation of true effect size                           |
| sd             | posterior standard deviation of true effect size                   |
| prob           | posterior probability of association                                |
| cs             | identifier of 95% credible set (-1 = variant is not part of credible set) |
| lead_r2        | r2 value to a lead variant (the one with maximum PIP) in a credible set |
| alphax         | posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |

## Libraries

In [1]:
import sys
import pandas as pd
import numpy as np
import requests
import time
from concurrent.futures import ThreadPoolExecutor



print("Python version:", sys.version)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)

Python version: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]
Pandas version: 1.5.3
NumPy version: 1.24.1


## Import data

In [2]:
import os

# Get the current working directory
current_directory = os.getcwd()

print(current_directory)

C:\Users\Windows\Desktop\GeoGWAS\FinnGen\notebooks\finemapping


In [3]:
# Read the 'finemap' file into a pandas DataFrame
finemap = pd.read_csv('C:/Users/Windows/Desktop/GeoGWAS/FinnGen/data/finemapping_full_finngen_R9_I9_HYPTENS.FINEMAP.snp.tsv', low_memory=False, sep='\t')

# Read the 'causal' file into a pandas DataFrame
#precausal = pd.read_csv('C:/Users/Windows/Desktop/GeoGWAS/FinnGen/data/precausal-t2d.csv', low_memory=False)

# Read the 'causal' file into a pandas DataFrame
#causal = pd.read_csv('C:/Users/Windows/Desktop/GeoGWAS/FinnGen/data/causal-t2d.csv', low_memory=False)

# Read the 'gwas' file into a pandas DataFrame
gwas = pd.read_csv('C:/Users/Windows/Desktop/GeoGWAS/FinnGen/data/summary_stats_finngen_R9_I9_HYPTENS.tsv', low_memory=False, sep='\t')

In [4]:
print("NaNs and missing values in 'gwas':")
empty = gwas.isna().sum()
print(empty)

NaNs and missing values in 'gwas':
#chrom                   0
pos                      0
ref                      0
alt                      0
rsids              1366441
nearest_genes       727861
pval                     0
mlogp                    0
beta                     0
sebeta                   0
af_alt                   0
af_alt_cases             0
af_alt_controls          0
dtype: int64


## Explore data

In [5]:
gwas

Unnamed: 0,#chrom,pos,ref,alt,rsids,nearest_genes,pval,mlogp,beta,sebeta,af_alt,af_alt_cases,af_alt_controls
0,1,13668,G,A,rs2691328,OR4F5,0.106658,0.972006,-0.114822,0.071168,0.005846,0.005683,0.005914
1,1,14773,C,T,rs878915777,OR4F5,0.620115,0.207528,-0.021548,0.043470,0.013501,0.013448,0.013524
2,1,15585,G,A,rs533630043,OR4F5,0.859628,0.065689,-0.023716,0.134105,0.001112,0.001117,0.001109
3,1,16549,T,C,rs1262014613,OR4F5,0.321844,0.492355,-0.215787,0.217818,0.000563,0.000556,0.000566
4,1,16567,G,C,rs1194064194,OR4F5,0.764225,0.116779,0.021523,0.071757,0.004192,0.004207,0.004186
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20170229,23,155697920,G,A,,,0.435606,0.360906,0.004066,0.005215,0.291210,0.291159,0.291231
20170230,23,155698443,C,A,,,0.559723,0.252027,0.025926,0.044450,0.003263,0.003298,0.003248
20170231,23,155698490,C,T,,,0.007623,2.117900,-0.043135,0.016165,0.024340,0.023984,0.024489
20170232,23,155699751,C,T,,,0.160090,0.795637,0.007715,0.005492,0.245151,0.245738,0.244904


In [6]:
#precausal

In [7]:
#causal.head()

In [8]:
finemap

Unnamed: 0,trait,region,v,index,rsid,chromosome,position,allele1,allele2,maf,...,cs,cs2,cs3,cs4,cs5,cs6,cs7,cs8,cs9,cs10
0,I9_HYPTENS,chr1:1912095-4912095,1:1912100:C:G,1,chr1_1912100_C_G,chr1,1912100,C,G,0.015211,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1
1,I9_HYPTENS,chr1:1912095-4912095,1:1912140:T:G,2,chr1_1912140_T_G,chr1,1912140,T,G,0.163095,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1
2,I9_HYPTENS,chr1:1912095-4912095,1:1912582:C:T,3,chr1_1912582_C_T,chr1,1912582,C,T,0.002733,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,I9_HYPTENS,chr1:1912095-4912095,1:1912607:C:T,4,chr1_1912607_C_T,chr1,1912607,C,T,0.486808,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1
4,I9_HYPTENS,chr1:1912095-4912095,1:1913112:AATTTTTTT:A,5,chr1_1913112_AATTTTTTT_A,chr1,1913112,AATTTTTTT,A,0.036309,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3501810,I9_HYPTENS,chrX:101186298-104186298,X:104183802:G:A,11389,chrX_104183802_G_A,chrX,104183802,G,A,0.018745,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1
3501811,I9_HYPTENS,chrX:101186298-104186298,X:104184456:G:A,11390,chrX_104184456_G_A,chrX,104184456,G,A,0.000862,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1
3501812,I9_HYPTENS,chrX:101186298-104186298,X:104184803:C:G,11391,chrX_104184803_C_G,chrX,104184803,C,G,0.001040,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3501813,I9_HYPTENS,chrX:101186298-104186298,X:104184890:T:C,11392,chrX_104184890_T_C,chrX,104184890,T,C,0.019200,...,-1,2,-1,-1,-1,-1,-1,-1,-1,-1


In [None]:
def explore_dataframe(dataframe, dataframe_name):
    print("=== DataFrame Exploration: {} ===".format(dataframe_name))
    print("Number of Rows: {}".format(dataframe.shape[0]))
    print("Number of Columns: {}".format(dataframe.shape[1]))
    print("Column Names: {}".format(", ".join(dataframe.columns)))
    print("\nData Types of Columns:")
    print(dataframe.dtypes)
    print("\nNull Value Counts:")
    print(dataframe.isnull().sum())
    print("\nSummary Statistics:")
    print(dataframe.describe())
    print("=== End of DataFrame Exploration: {} ===\n".format(dataframe_name))
    
#explore_dataframe(gwas, "gwas")
#explore_dataframe(causal, "causal")
#explore_dataframe(finemap, "finemap")

## Data manipulation

### Adjust `chromosome` in `finemap`

In [None]:
# Extract number from 'chromosome' and replace 'X' with '23'
finemap['chromosome'] = finemap['chromosome'].str.extract('(\d+|X)', expand=False).replace('X', '23')

# Convert 'chromosome' column to 'int64'
finemap['chromosome'] = finemap['chromosome'].astype('int64')

# Assertions to verify the data manipulations
assert finemap['chromosome'].dtype == 'int64'  
assert finemap['chromosome'].isin(range(1, 24)).all()  

### Adjust `v` in `finemap`

In [None]:
# Replace 'X' with '23' in 'v' column of finemap
finemap['v'] = finemap['v'].str.replace(r'(^X:)', '23:', regex=True)

# Assert 'X' is not in 'v' column anymore
assert 'X' not in finemap['v']

### Create `finemapped` in `gwas`

In [None]:
# Create the 'id' column in the 'gwas' DataFrame
gwas['id'] = gwas['#chrom'].astype(str) + ':' + gwas['pos'].astype(str) + ':' + gwas['ref'] + ':' + gwas['alt']

# Create a set for faster lookup
finemap_set = set(finemap['v'].values)

# Use the set for lookup
gwas['finemapped'] = gwas['id'].apply(lambda x: 1 if x in finemap_set else 0)

# Count the number of 1s in the 'finemapped' column
count_ones = gwas['finemapped'].sum()

# Perform assertions to validate the results
assert len(gwas) == len(gwas['id']) == len(gwas['finemapped']), "Lengths do not match."
assert count_ones <= len(gwas), "Invalid count of 1s."

print("Assertions passed successfully.")

### Create `causal` in `gwas`

### Create `precausal` in `gwas`

### Extract `trait` from `finemap` to `gwas`

In [None]:
unique_trait = finemap['trait'].unique()
trait_string = unique_trait[0]
gwas['trait'] = trait_string

## Export csv

In [None]:
print(gwas['finemapped'].sum())

In [None]:
gwas.to_csv('gwas-hyptens.csv', index=False)