# How to filter genomic data using Hail and save to PLINK files

**How to filter genomic data using Hail and save to PLINK files**

**Introduction**

A common practice in Hail is to filter a user's custom list of SNPs and samples, and then save the filtered data to other formats, such as PLINK files, for downstream analysis. This notebook presents an optimized filtering workflow that follows this sequence: filtering samples, then chromosomal intervals, and finally locus and alleles. Through our testing, this workflow has significantly reduced runtime compared to a workflow without the second step of filtering chromosomal intervals.

Given a list of SNPs and sample IDs, this notebook demonstrates how to extract data from acaf_threshold_v7.1/splitMT and save the result to PLINK files.

This notebook is designed to run on a VM with 8 CPUs and 30 GB of memory, utilizing 2 workers and 2 preemptible workers. The estimated cost for running this notebook is approximately $1.25 per hour.

Additionally, this notebook includes examples demonstrating how to create a small dataset for testing runtime and cost estimation purposes (please refer to the last section "Tips").

**Prerequisite: We expect readers to have already gone through our tutorial genomic workspace (click to open in a new tab) https://workbench.researchallofus.org/workspaces/aou-rw-b7598f6e/duplicateofhowtoworkwithallofusgenomicdatahailplinkv7prep/data or have basic knowledge of Hail and relevant genomic file types such as Hail MatrixTable and PLINK files.**

# setup 

**Setup**

In [None]:
### Set up python environment
from datetime import datetime
start = datetime.now()

In [None]:
import os
import pandas as pd
import hail as hl
hl.init(default_reference = "GRCh38")

In [None]:
bucket = os.getenv('WORKSPACE_BUCKET')
bucket

In [None]:
auxiliary_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux"

In [None]:
# use exome split MT
mt_wgs_path = os.getenv("WGS_EXOME_SPLIT_HAIL_PATH")
mt_wgs_path

# Read genomic data and filter out flagged samples

**Read genomic data and filter out flagged samples**

In [None]:
mt = hl.read_matrix_table(mt_wgs_path)

# Prepare MT from relatedness
flagged_samples = f'{auxiliary_path}/relatedness/relatedness_flagged_samples.tsv'
sample_to_remove = hl.import_table(flagged_samples, key="sample_id")

mt = mt.anti_join_cols(sample_to_remove)
mt.count()

**Assuming we have a list of samples (5k) and SNPs (10k) that are already saved to the bucket**

In [None]:
!gsutil ls {bucket}/data/genomics/genomic_pid_s5000.csv

In [None]:
!gsutil ls {bucket}/data/genomics/snps_hg38_s10k.txt

# Filter custom samples first

**Filter custom samples first**

In [None]:
pid = hl.import_table( f'{bucket}/data/genomics/genomic_pid_s5000.csv', delimiter = "," )
pid.count()

In [None]:
pid.show(5)

**Randomly choose a small sample size**

In [None]:
#choose an approximate fraction of samples
pid=pid.sample(0.01)
pid.count()

In [None]:
# Rename key and rekey file
pid = pid.key_by( 'person_id' )

In [None]:
# Filter mt to desired participants
mt = mt.filter_cols( hl.is_defined( pid[ mt.col_key ] ) )

In [None]:
mt.count() 

# Then filter chromosome intervals

**Then filter chromosome intervals**

In [None]:
start3 = datetime.now()

In [None]:
# Load in SNP file from the bucket
file=f'{bucket}/data/genomics/snps_hg38_s10k.txt'
snps = hl.import_table( file, delimiter = '\t', key = 'SNP' )
snps.count()

In [None]:
snps.show(5)

**Randomly choose a small size of SNPs**

In [None]:
snps=snps.sample(0.01)
snps.count()

**Split column SNP to contig and position**

In [None]:
# Split the 'SNP' column into 'contig' and 'position'
snps = snps.annotate(contig=snps.SNP.split(":")[0],
                             position=hl.int(snps.SNP.split(":")[1]))

In [None]:
snps.show(5)

In [None]:
# create interval column
snps = snps.transmute(position = hl.int32(snps.position))
snps = snps.annotate(interval = hl.locus_interval(
    snps.contig, snps.position, snps.position + 1)
                              )

In [None]:
snps.show(5)

**Then filter intervals first**

In [None]:
# be alert that you may encounter memory issue if the number of variants is >50k

mt = hl.filter_intervals(mt, snps.interval.collect())

In [None]:
# comment out this step if needed to save runtime
mt.count()

In [None]:
end = datetime.now()
end-start3

# Lastly, filter locus+alleles

**Lastly, filter locus+alleles**

In [None]:
# create locus and allele and rekey
snps_loc = snps.annotate( locus_allele = hl.parse_variant( snps.SNP, reference_genome = 'GRCh38' ) )
snps_loc = snps_loc.annotate( locus = snps_loc.locus_allele.locus,
                             alleles = snps_loc.locus_allele.alleles )
snps_loc = snps_loc.key_by( snps_loc.locus, snps_loc.alleles )
snps_loc = snps_loc.drop( 'SNP', 'locus_allele' )
snps_loc.show( 5 )

**Filter matrixtable(MT) based on matched exact locus+alleles**

In [None]:
mt2 = mt.semi_join_rows(snps_loc)

In [None]:
# comment out this step if you need to save runtime
mt2.count()

# save to PLINK files

**Save fitered MT to PLINK files**

In [None]:
start5 = datetime.now()

In [None]:
file_path = f'{bucket}/data/test/plink_small2'
file_path

In [None]:
hl.export_plink(mt2, file_path, ind_id = mt2.s)

In [None]:
end = datetime.now()
end-start5

In [None]:
end-start

In [None]:
# comment out these steps if you want to save to another matrixtable
# remember matrixtable can't be saved to your local VM. It has to be in the bucket.

# file_path = f'{bucket}/data/test/small2.mt'
# mt2.write(filt_path, overwrite=True)

# Tips:

1. Use mt.sample() to choose a small dataset.


2. Feel free to test runtime with or without the 2nd step (filtering chromosomal intervals)


3. Don't use any extra Hail commands such as mt.count() if you want to further save runtime.


4. If you're dealing with a large dataset containing 1 million SNPs in 100,000 samples, it's advisable to start with smaller datasets for testing purposes before processing the entire dataset. These smaller datasets can include approximately 50 SNPs by 50 samples, 100 by 100, 1000 by 1000, 5000 by 5000, and up to 10,000 by 10,000. For example, when using a VM with 8 CPUs, 30 GB of RAM and utilizing 2/2 workers, processing a 5000 by 5000 dataset in this notebook took about 6 minutes. Extrapolating to a dataset with 1 million SNPs, it would take roughly 25 hours on the same VM setup. However, by scaling up to 50/50 workers, the cost increases to $20 per hour, but the runtime should decrease proportionally, potentially completing the task within 1-2 hours.


5. Use a background notebook for jobs that need more than 30 mins of runtime (click to open in a new tab) 

https://workbench.researchallofus.org/workspaces/aou-rw-d56fb435/bestpracticesforaoudatascience/analysis/preview/00.How%20to%20Run%20Notebooks%20in%20the%20Background.ipynb
