# UKBiobank GWAS Preprocessing
Here we accomplish all the preprocessing steps for our GWAS analysis. <br>
I will be using plink to conduct our basic preprocessing steps.
<br>
Our genotype data was downloaded using UKBiobank's `gfetch` tool for bed and fam data. The bim files were downloaded from the web and uncompressed via terminal.

In [1]:
import pandas as pd
import numpy as np

# Data
Understanding GWAS data. In our data directory we have 3 main file types separated by chromosome: `.bed`, `.bim` and `.fam` files. <br>

### BED
`.bed` file is the binary encoded PED file, where each line corresponds to a sample, which contains the family ID, individual ID, paternal ID, maternal ID, sex (1 for male, 2 for female), and phenotype in the first 6 columns. The remaining columns are the SNP data.
<br>

### FAM
`.fam` file contains the same information as the first 6 columns of the `.bed` file in non-binary format. **For example**: <br>

In [2]:
!head data/ukb22418_c1_b0_v2_s488248.fam

1123282 1123282 0 0 1 Batch_b001
4183264 4183264 0 0 2 Batch_b001
4087767 4087767 0 0 2 Batch_b001
5528041 5528041 0 0 2 Batch_b001
5705948 5705948 0 0 2 Batch_b001
2717517 2717517 0 0 1 Batch_b001
1116903 1116903 0 0 2 Batch_b001
3899174 3899174 0 0 2 Batch_b001
2482884 2482884 0 0 1 Batch_b001
1083504 1083504 0 0 2 Batch_b001


### BIM
In the `.bim` file, each line represents one SNP. The columns contain, in order, information on the corresponding chromosome, SNP identifier, genetic distance in morgans (optional), SNP position, and two alleles of the SNP in the last two columns. **For example**:

In [3]:
!head data/ukb_snp_chr1_v2.bim

1	rs28659788	0	723307	C	G
1	rs116587930	0	727841	G	A
1	rs116720794	0	729632	C	T
1	rs3131972	0	752721	A	G
1	rs12184325	0	754105	C	T
1	rs3131962	0	756604	A	G
1	rs114525117	0	759036	G	A
1	rs3115850	0	761147	T	C
1	rs115991721	0	767096	A	G
1	rs12562034	0	768448	G	A


In [4]:
chr1 = pd.read_csv("data/ukb22418_c1_b0_v2_s488248.fam", header=None, sep="\s", engine="python")
chr1

Unnamed: 0,0,1,2,3,4,5
0,1123282,1123282,0,0,1,Batch_b001
1,4183264,4183264,0,0,2,Batch_b001
2,4087767,4087767,0,0,2,Batch_b001
3,5528041,5528041,0,0,2,Batch_b001
4,5705948,5705948,0,0,2,Batch_b001
...,...,...,...,...,...,...
488372,3790691,3790691,0,0,2,UKBiLEVEAX_b11
488373,4960183,4960183,0,0,2,UKBiLEVEAX_b11
488374,3174069,3174069,0,0,1,UKBiLEVEAX_b11
488375,2426240,2426240,0,0,2,UKBiLEVEAX_b11


### Phenotype
Last is our phenotype data. <br>
In general phenotype data is stored in a separate `.txt` file containing 3 columns with one row per individual: <br>
* Family ID
* Individual ID
* Phenotype 

Columns are separated by white space (\s)

In [5]:
achalasia = pd.read_csv("achalasia/achalasia_binary.csv")
achalasia

Unnamed: 0,eid,achalasia
0,1038831,1.0
1,1058137,1.0
2,1094461,1.0
3,1102623,1.0
4,1112794,1.0
...,...,...
502489,6024904,0.0
502490,6024916,0.0
502491,6024920,0.0
502492,6024937,0.0


In [9]:
#Here we merge using individual ID from the bed file and eid from our achalasia file.
ach_pheno = pd.merge(chr1[[0,1]], achalasia, how="inner", left_on=1, right_on="eid")
ach_pheno = ach_pheno.drop(columns="eid") #Drop the eid columnn
ach_pheno

Unnamed: 0,0,1,achalasia
0,1123282,1123282,0.0
1,4183264,4183264,0.0
2,4087767,4087767,0.0
3,5528041,5528041,0.0
4,5705948,5705948,0.0
...,...,...,...
488243,3790691,3790691,0.0
488244,4960183,4960183,0.0
488245,3174069,3174069,0.0
488246,2426240,2426240,0.0


In [13]:
np.sum(ach_pheno["achalasia"] > 0)

1054

In [4]:
1054/488248

0.0021587390014910454

This is a potential area of concern that will need to be brought up.

# Filter SNPs

Before running the association tests. Common filtering steps are taken to decrease the number of false positives. These include: 
* Ensuring SNPS follow Hardy-Weinberg Equilibrium (HWE). Deviations are assessed by a statistical test through PLINK with a common p-val threshold of 1E-6
* Excluding SNPs with minor allele frequencies that are less than 0.05 or 0.01.
* Exclude individuals wtih large number of missing genotypes and SNPs with high rate of missing genotypes across individuals. PLINK default is excluding individuals with more than 0.1 missing genotype.


Options explanation:
* `--maf` set threshold for minor allele frequencies
* ``

In [7]:
%%bash
plink --bfile data/ukb22418_c1_b0_v2 --maf 0.01 --hwe 1e-6 --mind 0.1 --make-bed --out data/ukb_chr1_filtered

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/ukb_chr1_filtered.log.
Options in effect:
  --bfile data/ukb22418_c1_b0_v2
  --hwe 1e-6
  --maf 0.01
  --make-bed
  --mind 0.1
  --out data/ukb_chr1_filtered

385610 MB RAM detected; reserving 192805 MB for main workspace.
Allocated 108452 MB successfully, after larger attempt(s) failed.
63487 variants loaded from .bim file.
488377 people (223459 males, 264789 females, 129 ambiguous) loaded from .fam.
Ambiguous sex IDs written to data/ukb_chr1_filtered.nosex .
2 people removed due to missing genotype data (--mind).
IDs written to data/ukb_chr1_filtered.irem .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 488375 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758

--geno, and/or applying different p-value thresholds to distinct subsets of
your data.
