# Data Explorer Notebook

## Data Description

### UKBB_94traits_release1.{tsv|bed}.gz

This file contains genetic variant data used in a study investigating 94 complex diseases and traits from the UK Biobank. Each row represents a variant with columns detailing characteristics such as its genomic location, allele details, association statistics, and more. It also includes indicators for linkage disequilibrium with variants failing Hardy Weinberg equilibrium or with common structural variants. This file is particularly valuable for those interested in the genetic association results and the fine-mapping of these traits and diseases.

Columns:

- Chromosome: hg19 autosomes only
- Start: 0-indexed hg19 start position
- End: 0-indexed hg19 end position
- Variant: unique variant identifier (chr:pos:ref:alt)
- rsID: rsID identifier
- Allele1: reference allele in hg19
- Allele2: alternative allele in hg19
- Minor allele: minor allele in cohort
- Cohort: GWAS cohort
- Model_marginal: type of regression model used
- Method: fine-mapping method used
- Trait: abbreviation for phenotype in genetic association tests
- Region: fine-mapping region in hg19
- MAF: minor allele frequency in cohort
- Beta_marginal: marginal association effect size (effect allele: alternative)
- SE_marginal: standard error on marginal association effect size
- Chisq_marginal: test statistic for marginal association
- PIP: posterior probability of association from fine-mapping
- CS_ID: ID of 95% credible set (-1 if variant not in 95% CS)
- Beta_posterior: posterior expectation of true effect size (effect allele: alternative)
- SD_posterior: posterior standard deviation of true effect size
- LD_HWE: indicator for LD (R^2 > 0.6) with a variant that failed HWE (p < 10^-12) in UK10K LD
- LD_SV: indicator for LD (R^2 > 0.8) with a common structural variant in gnomAD European samples

### UKBB_94traits_release1_regions.bed.gz

This file also pertains to the same study but instead focuses on genomic regions used for fine-mapping. Each row represents a genomic region with columns providing details about the cohort, trait, and whether the fine-mapping methods (FINEMAP, SuSiE) successfully completed. It also includes the variant identifier for variants located in these regions. This file is useful for exploring the specific regions of the genome under investigation in the study and the outcomes of the fine-mapping process.

Columns:

- Chromosome: hg19 autosomes only
- Start: 0-indexed hg19 start position
- End: 0-indexed hg19 end position
- Cohort: GWAS cohort
- Trait: abbreviation for phenotype in genetic association tests
- Region: fine-mapping region in hg19
- Variant: unique variant identifier (chr:pos:ref:alt)
- Success_FINEMAP: indicator for successful FINEMAP completion
- Success_SuSiE: indicator for successful SuSiE completion

## .bed -> .csv

In [3]:
import pandas as pd

In [2]:
# For UKBB_94traits_release1.{tsv|bed}.gz
bed_columns_94traits = [
    "chromosome", "start", "end", "variant", "rsid", "allele1", "allele2", 
    "minorallele", "cohort", "model_marginal", "method", "trait", "region", 
    "maf", "beta_marginal", "se_marginal", "chisq_marginal", "pip", "cs_id", 
    "beta_posterior", "sd_posterior", "LD_HWE", "LD_SV"
]

# For UKBB_94traits_release1_regions.bed.gz
bed_columns_94traits_regions = [
    "chromosome", "start", "end", "cohort", "trait", "region", 
    "variant", "success_finemap", "success_susie"
]

In [5]:
# load the bed file
# adjust the column names depending on your specific bed file
bed_data_94traits = pd.read_csv('~/Desktop/geometric-omics/UKBB-fine-mapping/data/UKBB_94traits_release1.bed', sep='\t', names=bed_columns_94traits, comment='#')

# write to a csv file
bed_data_94traits.to_csv('UKBB_94traits_release1.csv', index=False)

# repeat the process for the regions file
bed_data_94traits_regions = pd.read_csv('~/Desktop/geometric-omics/UKBB-fine-mapping/data/UKBB_94traits_release1_regions.bed', sep='\t', names=bed_columns_94traits_regions, comment='#')

# write to a csv file
bed_data_94traits_regions.to_csv('UKBB_94traits_release1_regions.csv', index=False)

In [9]:
data = pd.read_csv('UKBB_94traits_release1.csv')

In [10]:
len(data)

5377879

In the graph, the SNP nodes could be encoded using the following features:

1. **Variant**: Unique identifier for the SNP.
2. **rsID**: rsID identifier, another unique identifier often used in genetic research.
3. **Minor allele**: Minor allele in the cohort, can help in SNP characterization.
4. **MAF**: Minor allele frequency in the cohort, provides insights into the rarity or commonality of the SNP.
5. **PIP**: Posterior probability of association from fine-mapping, indicates the statistical significance of the SNP.
6. **Beta_marginal** and **SE_marginal**: These could be used to encode the effect size and its uncertainty.
7. **Beta_posterior** and **SD_posterior**: These represent the posterior expectation and uncertainty of the true effect size.
8. **LD_HWE** and **LD_SV**: Indicators for linkage disequilibrium with other variant types, these could provide information about SNP's genetic context.

The gene nodes could be encoded using the following features:

1. **Trait**: Abbreviation for phenotype in genetic association tests, represents the phenotypic trait associated with the gene.
2. **Region**: Fine-mapping region in hg19, provides the genomic region where the gene is located.