# FinnGen - Data Engineering Notebook - Causal

## Data Descriptions

### gwas


The `{endpoint}.gz` file has the following structure:

| Column name   | Description                                                 |
| ------------- | ----------------------------------------------------------- |
| #chrom        | chromosome on build GRCh38 (1-23)                           |
| pos           | position in base pairs on build GRCh38                       |
| ref           | reference allele                                            |
| alt           | alternative allele (effect allele)                           |
| rsids         | variant identifier                                          |
| nearest_genes | nearest gene(s) (comma separated) from variant               |
| pval          | p-value from [source]                                        |
| mlogp         | -log10(p-value)                                             |
| beta          | effect size (log(OR) scale) estimated with [source]          |
| sebeta        | standard error of effect size estimated with [source]        |
| af_alt        | alternative (effect) allele frequency                        |
| af_alt_cases  | alternative (effect) allele frequency among cases            |
| af_alt_controls | alternative (effect) allele frequency among controls         |


### finemap

{endpoint}.SUSIE.snp.bgz` contains variant summaries with credible set information and has the following structure:

| Column name    | Description                                                        |
| -------------- | ------------------------------------------------------------------ |
| trait          | endpoint name                                                      |
| region         | chr:start-end                                                      |
| v              | variant identifier                                                 |
| rsid           | rs variant identifier                                              |
| chromosome     | chromosome on build GRCh38 (1-22, X)                                |
| position       | position in base pairs on build GRCh38                              |
| allele1        | reference allele                                                   |
| allele2        | alternative allele (effect allele)                                  |
| maf            | minor allele frequency                                             |
| beta           | effect size GWAS                                                   |
| se             | standard error GWAS                                                |
| p              | p-value GWAS                                                       |
| mean           | posterior expectation of true effect size                           |
| sd             | posterior standard deviation of true effect size                   |
| prob           | posterior probability of association                                |
| cs             | identifier of 95% credible set (-1 = variant is not part of credible set) |
| lead_r2        | r2 value to a lead variant (the one with maximum PIP) in a credible set |
| alphax         | posterior inclusion probability for the x-th single effect (x := 1..L where L is the number of single effects (causal variants) specified; default: L = 10) |

## Libraries

In [1]:
import sys
import pandas as pd
import numpy as np
import requests
import time
from concurrent.futures import ThreadPoolExecutor



print("Python version:", sys.version)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)

Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
Pandas version: 2.1.0
NumPy version: 1.24.1


## Import data

In [2]:
import os

# Get the current working directory
current_directory = os.getcwd()

print(current_directory)

C:\Users\Windows\Desktop\Research\PhD\GeoGWAS\FinnGen\notebooks\causal


In [3]:
# Read the 'finemap' file into a pandas DataFrame
finemap = pd.read_csv('C:/Users/Windows/Desktop/Research/PhD/GeoGWAS/FinnGen/data/finemapping_full_finngen_R9_I9_HYPTENS.SUSIE.snp.txt', low_memory=False, sep='\t')


# Read the 'causal' file into a pandas DataFrame
causal = pd.read_csv('C:/Users/Windows/Desktop/Research/PhD/GeoGWAS/FinnGen/data/causal-hyp.tsv', low_memory=False, sep='\t')


# Read the 'gwas' file into a pandas DataFrame
gwas = pd.read_csv('C:/Users/Windows/Desktop/Research/PhD/GeoGWAS/FinnGen/data/summary_stats_finngen_R9_I9_HYPTENS.tsv', low_memory=False, sep='\t')

In [4]:
print("NaNs and missing values in 'gwas':")
empty = gwas.isna().sum()
print(empty)

NaNs and missing values in 'gwas':
#chrom                   0
pos                      0
ref                      0
alt                      0
rsids              1366441
nearest_genes       727861
pval                     0
mlogp                    0
beta                     0
sebeta                   0
af_alt                   0
af_alt_cases             0
af_alt_controls          0
dtype: int64


## Explore data

In [5]:
gwas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20170234 entries, 0 to 20170233
Data columns (total 13 columns):
 #   Column           Dtype  
---  ------           -----  
 0   #chrom           int64  
 1   pos              int64  
 2   ref              object 
 3   alt              object 
 4   rsids            object 
 5   nearest_genes    object 
 6   pval             float64
 7   mlogp            float64
 8   beta             float64
 9   sebeta           float64
 10  af_alt           float64
 11  af_alt_cases     float64
 12  af_alt_controls  float64
dtypes: float64(7), int64(2), object(4)
memory usage: 2.0+ GB


In [6]:
finemap.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3501815 entries, 0 to 3501814
Data columns (total 66 columns):
 #   Column               Dtype  
---  ------               -----  
 0   trait                object 
 1   region               object 
 2   v                    object 
 3   rsid                 object 
 4   chromosome           object 
 5   position             int64  
 6   allele1              object 
 7   allele2              object 
 8   maf                  float64
 9   beta                 float64
 10  se                   float64
 11  p                    float64
 12  mean                 float64
 13  sd                   float64
 14  prob                 float64
 15  cs                   int64  
 16  cs_specific_prob     float64
 17  low_purity           float64
 18  lead_r2              float64
 19  mean_99              float64
 20  sd_99                float64
 21  prob_99              float64
 22  cs_99                int64  
 23  cs_specific_prob_99  float64
 24

In [7]:
causal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 31 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   association_info.ancestry                 49 non-null     object 
 1   association_info.doi                      11 non-null     object 
 2   association_info.gwas_catalog_id          12 non-null     object 
 3   association_info.neg_log_pval             49 non-null     float64
 4   association_info.otg_id                   49 non-null     object 
 5   association_info.pubmed_id                12 non-null     float64
 6   association_info.url                      26 non-null     object 
 7   gold_standard_info.evidence.class         49 non-null     object 
 8   gold_standard_info.evidence.confidence    49 non-null     object 
 9   gold_standard_info.evidence.curated_by    49 non-null     object 
 10  gold_standard_info.evidence.description 

## Data manipulation

### Drop `cs_99=-1` in `finemap`

In [8]:
finemap = finemap[finemap['cs_99'] != -1]

### Print `nunique` in `finemap['cs_99']` across chroms

In [9]:
# Get unique values of cs_99 for each chromosome
pd.set_option('display.max_rows', None)
unique_values = finemap.groupby('chromosome')['cs_99'].unique()

# Print the unique values
for chrom, values in unique_values.items():
    print(f"{chrom}: {values}")

chr1: [1 2 3]
chr10: [1 2 4 3]
chr11: [3 1 2]
chr12: [1 2 4 3]
chr13: [1]
chr14: [1]
chr15: [1 2 3]
chr16: [2 1 4]
chr17: [4 2 1 3 6 5]
chr18: [1 2 7]
chr19: [1 3 2 4 6 5 8]
chr2: [2 1 4 5 3]
chr20: [1 3 2 4 5]
chr21: [1 3]
chr22: [1 2]
chr3: [1 2 9 3]
chr4: [1 4 2 3]
chr5: [1 2 3]
chr6: [1 2]
chr7: [2 1 3 8]
chr8: [1]
chr9: [1 2]
chrX: [1]


In [10]:
# Get frequency of values of cs_99 for each chromosome
value_frequencies = finemap.groupby('chromosome')['cs_99'].value_counts()


# Print the frequency values
print(value_frequencies)
pd.reset_option('display.max_rows')


chromosome  cs_99
chr1        3        18445
            1         4187
            2          687
chr10       2         6928
            4         3812
            1          268
            3           54
chr11       3         2364
            1         2040
            2          373
chr12       4        20491
            3         3099
            1          106
            2           99
chr13       1          106
chr14       1          163
chr15       1          228
            2          176
            3           23
chr16       4        17727
            2          303
            1          255
chr17       2         7592
            4          306
            6          222
            3          182
            1           42
            5            7
chr18       7        20674
            1           63
            2           22
chr19       3        23989
            8        19346
            1           92
            6           65
            2           54
          

### Adjust `chromosome` in `finemap`

In [11]:
# Extract number from 'chromosome' and replace 'X' with '23'
finemap['chromosome'] = finemap['chromosome'].str.extract('(\d+|X)', expand=False).replace('X', '23')

# Convert 'chromosome' column to 'int64'
finemap['chromosome'] = finemap['chromosome'].astype('int64')

# Assertions to verify the data manipulations
assert finemap['chromosome'].dtype == 'int64'  
assert finemap['chromosome'].isin(range(1, 24)).all()  

### Adjust `v` in `finemap`

In [12]:
# Replace 'X' with '23' in 'v' column of finemap
finemap['v'] = finemap['v'].str.replace(r'(^X:)', '23:', regex=True)

# Assert 'X' is not in 'v' column anymore
assert 'X' not in finemap['v']

### Credible Set Analysis for `finemap`

#### lead_r2 selection

#### PIP > 0.5

### Create `finemapped` in `gwas`

In [13]:
# Create the 'id' column in the 'gwas' DataFrame
gwas['id'] = gwas['#chrom'].astype(str) + ':' + gwas['pos'].astype(str) + ':' + gwas['ref'] + ':' + gwas['alt']

# Create the 'id' column in the 'finemap' DataFrame
finemap['id'] = finemap['v']

# Create a set for faster lookup
finemap_set = set(finemap['id'].values)

# Use the set for lookup
gwas['finemapped'] = gwas['id'].apply(lambda x: 1 if x in finemap_set else 0)

# Count the number of 1s in the 'finemapped' column
count_ones = gwas['finemapped'].sum()

# Perform assertions to validate the results
assert len(gwas) == len(gwas['id']) == len(gwas['finemapped']), "Lengths do not match."
assert count_ones <= len(gwas), "Invalid count of 1s."

print("Assertions passed successfully.")

Assertions passed successfully.


### From `finemap`, add `prob`, `lead_r2`, and `cs_99` to gwas

In [14]:
# Use merge to add 'prob', 'lead_r2', and 'cs_99' columns from 'finemap' to 'gwas'
gwas = gwas.merge(finemap[['id', 'prob', 'lead_r2', 'cs_99']], on='id', how='left')

# Fill NA values for rows in 'gwas' that didn't have a matching 'id' in 'finemap'
gwas[['prob', 'lead_r2', 'cs_99']] = gwas[['prob', 'lead_r2', 'cs_99']].fillna(value=0)

In [15]:
gwas

Unnamed: 0,#chrom,pos,ref,alt,rsids,nearest_genes,pval,mlogp,beta,sebeta,af_alt,af_alt_cases,af_alt_controls,id,finemapped,prob,lead_r2,cs_99
0,1,13668,G,A,rs2691328,OR4F5,0.106658,0.972006,-0.114822,0.071168,0.005846,0.005683,0.005914,1:13668:G:A,0,0.0,0.0,0.0
1,1,14773,C,T,rs878915777,OR4F5,0.620115,0.207528,-0.021548,0.043470,0.013501,0.013448,0.013524,1:14773:C:T,0,0.0,0.0,0.0
2,1,15585,G,A,rs533630043,OR4F5,0.859628,0.065689,-0.023716,0.134105,0.001112,0.001117,0.001109,1:15585:G:A,0,0.0,0.0,0.0
3,1,16549,T,C,rs1262014613,OR4F5,0.321844,0.492355,-0.215787,0.217818,0.000563,0.000556,0.000566,1:16549:T:C,0,0.0,0.0,0.0
4,1,16567,G,C,rs1194064194,OR4F5,0.764225,0.116779,0.021523,0.071757,0.004192,0.004207,0.004186,1:16567:G:C,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20170229,23,155697920,G,A,,,0.435606,0.360906,0.004066,0.005215,0.291210,0.291159,0.291231,23:155697920:G:A,0,0.0,0.0,0.0
20170230,23,155698443,C,A,,,0.559723,0.252027,0.025926,0.044450,0.003263,0.003298,0.003248,23:155698443:C:A,0,0.0,0.0,0.0
20170231,23,155698490,C,T,,,0.007623,2.117900,-0.043135,0.016165,0.024340,0.023984,0.024489,23:155698490:C:T,0,0.0,0.0,0.0
20170232,23,155699751,C,T,,,0.160090,0.795637,0.007715,0.005492,0.245151,0.245738,0.244904,23:155699751:C:T,0,0.0,0.0,0.0


### Create `causal` in `gwas` from gold-standard

In [16]:
# create a combined column in both dataframes
causal['combined'] = causal['sentinel_variant.locus_GRCh38.chromosome'].astype(str) + '_' + causal['sentinel_variant.locus_GRCh38.position'].astype(str)
gwas['combined'] = gwas['#chrom'].astype(str) + '_' + gwas['pos'].astype(str)

# create a set from the 'combined' column in causal
causal_combined_set = set(causal['combined'])

# create a new column 'causal' in gwas that checks if 'combined' is in causal_combined_set
gwas['causal'] = gwas['combined'].apply(lambda x: 1 if x in causal_combined_set else 0)

print(gwas['causal'].sum())

33


### Create `causal` in `gwas` from `lead_r2_analysis`

### Create `causal` in `gwas` from `pip_50`

### Extract `trait` from `finemap` to `gwas`

In [17]:
unique_trait = finemap['trait'].unique()
trait_string = unique_trait[0]
gwas['trait'] = trait_string

### Create `gwas_fm` and combine `causal` and `causal_OTG`

In [18]:
# Create a subset where 'finemapped' is equal to 1
gwas_fm = gwas[gwas['finemapped'] == 1]

### Check for `finemap=1` & `causal=1`

In [19]:
# Find rows where 'finemapped', and 'causal' are all equal to 1
filtered_rows = gwas_fm[(gwas_fm['finemapped'] == 1) & (gwas_fm['causal'] == 1)]

# Print the total number of rows meeting the criteria
print(len(filtered_rows))

17


### Plot `pos` vs `mlogp`

### Print total number of unique genes

In [20]:
import pandas as pd

# Assuming gwas_fm is a pandas DataFrame; if not, import pandas and read your data into a DataFrame
# gwas_fm = pd.read_csv('path/to/your/data.csv')

# Split the 'nearest_genes' column into multiple columns using comma as a delimiter
gwas_fm_split = gwas_fm['nearest_genes'].str.split(',', expand=True)

# Name the new columns with prefix 'gene_'
gwas_fm_split.columns = [f'gene_{n}' for n in range(gwas_fm_split.shape[1])]

# Join the new columns back to the original data frame
gwas_fm = gwas_fm.join(gwas_fm_split)

# List of new columns
new_cols = [f'gene_{n}' for n in range(gwas_fm_split.shape[1])]

# Get all unique values across the specified columns and remove NaN
unique_values = pd.unique(gwas_fm[new_cols].stack(dropna=True))

# Print the total number of unique strings
print(f'Total number of unique strings: {len(unique_values)}')

Total number of unique strings: 1577


In [21]:
gwas_fm.columns

Index(['#chrom', 'pos', 'ref', 'alt', 'rsids', 'nearest_genes', 'pval',
       'mlogp', 'beta', 'sebeta', 'af_alt', 'af_alt_cases', 'af_alt_controls',
       'id', 'finemapped', 'prob', 'lead_r2', 'cs_99', 'combined', 'causal',
       'trait', 'gene_0', 'gene_1', 'gene_2'],
      dtype='object')

## Export to parquet

In [22]:
gwas_fm = gwas_fm.reset_index(drop=True)
gwas_fm.to_parquet('gwas_fm_hyp.parquet')