# Data Explorer Notebook

## Data Description

### [FinnGen](https://finngen.gitbook.io/documentation/)

- The FinnGen research project is an expedition to the frontier of genomics and medicine, with significant discoveries potentially arising from any one of Finland’s 500,000 biomedical pioneers.
- The project brings together a nation-wide network of Finnish biobanks, with every Finn able to participate in the study by giving biobank consent.
- As of the last update, there were 589,000 samples available, with a goal to reach 520,000 by 2023. The latest data freeze included combined genotype and health registry data from 473,681 individuals.
- The study utilizes samples collected by a nationwide network of Finnish biobanks and combines genome information with digital health care data from national health registries【8†source】.
- There's a need for samples from all over Finland as solutions in the field of personalized healthcare can be found only by looking at large populations. Every Finn can be a part of the FinnGen study by giving a biobank consent.
- The genome data produced during the project is owned by the Finnish biobanks and remains available for research purposes. The medical breakthroughs that arise from the project are expected to benefit health care systems and patients globally.
- The FinnGen research project is collaborative, involving all the same actors as drug development, with the aim to speed up the emergence of new innovations.
- The project's data freeze 9 results and summary statistics are now available, consisting of over 377,200 individuals, almost 20.2 M variants, and 2,272 disease endpoints. Results can be browsed online using the FinnGen web browser, and the summary statistics downloaded.
- The University of Helsinki is the organization responsible for the study, and the nationwide network of Finnish biobanks is participating in the study, thus covering the whole of Finland. The Helsinki Biobank coordinates the sample collection.
- For more information, the project can be contacted at finngen-info@helsinki.fi.

## Import data

In [1]:
import pandas as pd

In [2]:
finemap = pd.read_csv('C:/Users/falty/Desktop/geometric-omics/FinnGenn/data/finemapping_summary_finngen_R9_I9_HYPTENS.SUSIE.cred.summary.tsv', low_memory=False, sep='\t')
gwas = pd.read_csv('C:/Users/falty/Desktop/geometric-omics/FinnGenn/data/summary_stats_finngen_R9_I9_HYPTENS.tsv', low_memory=False, sep='\t')

In [3]:
print(len(finemap))
print(len(gwas))

288
20170234


In [4]:
# ref match
finemap['ref'] = finemap['rsid'].str.split('_').str[-1]
finemap['ref'] = finemap['ref'].astype(str)
gwas['ref'] = gwas['ref'].astype(str)

# alt match
finemap['alt'] = finemap['rsid'].str.split('_').str[-1].str[-1]
finemap['alt'] = finemap['alt'].astype(str)
gwas['alt'] = gwas['alt'].astype(str)

# pos match
finemap['v'] = finemap['v'].astype(str)
finemap['v'] = finemap['v'].str.split(':').str[1]
finemap['v'] = finemap['v'].astype('int64')

# chrom match
finemap['#chrom'] = finemap['rsid'].str.extract(r'chr(\d+)_')
finemap['#chrom'] = finemap['#chrom'].astype('int64')
gwas['#chrom'] = gwas['#chrom'].astype('int64')

# Create the new 'causal' column
gwas['causal'] = gwas.apply(lambda row: 1 if ((row['pos'] in finemap['v'].values) and (row['ref'] in finemap['ref'].values) and (row['#chrom'] in finemap['#chrom'].values) and (row['alt'] in finemap['alt'].values)) else 0, axis=1)

total = gwas['causal'].sum()
print(total)

280


In [5]:
unique_trait = finemap['trait'].unique()
trait_string = unique_trait[0]
gwas['trait'] = trait_string

In [6]:
gwas.head()

Unnamed: 0,#chrom,pos,ref,alt,rsids,nearest_genes,pval,mlogp,beta,sebeta,af_alt,af_alt_cases,af_alt_controls,causal,trait
0,1,13668,G,A,rs2691328,OR4F5,0.106658,0.972006,-0.114822,0.071168,0.005846,0.005683,0.005914,0,I9_HYPTENS
1,1,14773,C,T,rs878915777,OR4F5,0.620115,0.207528,-0.021548,0.04347,0.013501,0.013448,0.013524,0,I9_HYPTENS
2,1,15585,G,A,rs533630043,OR4F5,0.859628,0.065689,-0.023716,0.134105,0.001112,0.001117,0.001109,0,I9_HYPTENS
3,1,16549,T,C,rs1262014613,OR4F5,0.321844,0.492355,-0.215787,0.217818,0.000563,0.000556,0.000566,0,I9_HYPTENS
4,1,16567,G,C,rs1194064194,OR4F5,0.764225,0.116779,0.021523,0.071757,0.004192,0.004207,0.004186,0,I9_HYPTENS


In [7]:
gwas.to_csv('gwas-causal.csv', index=False)