<div class="alert alert-info" style="font-family:'arial';font-size:25px"> How to calculate prevalence of BRCA1 pathogenic variants carriers in case/control cohorts using Hail </div>

**Introduction**

This use case is provided by John Baierl and Dr.Paul Pharoah at Cedars-Sinai Medical Center, CA USA. 

This notebook illustrates the process of determining the prevalence of BRCA1 pathogenic variant carriers within case (ovarian cancer) and control (non-cancer) cohorts using the Allofus (AOU) exome dataset.

Within the notebook '05.Genomics_use_case_hail_to_plink.ipynb', we examine the effects of filtering the Hail matrix table (MT) based on chromosome intervals alone versus filtering based on exact locus and alleles.

In this notebook, we further demonstrate the effects of these two filtering methods on calculating the prevalence of BRCA1 pathogenic variants.

This notebooks uses 4CPU+26G memory and 2/0 workers,runtime is about 4-5mins.

**Prerequisite**: Please go through our featured genomic workspace before testing this notebook. 

In [None]:
from datetime import datetime
start = datetime.now()

**Setup**

In [None]:
import os
import pandas as pd
import numpy as np
import hail as hl
hl.init(default_reference = "GRCh38")

In [None]:
bucket = os.getenv('WORKSPACE_BUCKET')
bucket

In [None]:
mt_wgs_exome_path = os.getenv("WGS_EXOME_SPLIT_HAIL_PATH")
mt_wgs_exome_path

In [None]:
mt = hl.read_matrix_table(mt_wgs_exome_path)
mt.count()

In [None]:
test_intervals = ['chr17: 43047642-43047643']

In [None]:
mt2 = hl.filter_intervals(
    mt,
    [hl.parse_locus_interval(x,)
     for x in test_intervals])

we can run the cell below to show the shape of data. It means we have 1 row (variant) and 414,830 columns (samples).

In [None]:
mt2.count()

**Get GT counts using agg function**

We have 1 sample that have 1 alternate allele. 

In [None]:
mt2.aggregate_entries(hl.agg.counter(mt2.GT.n_alt_alleles()))

Or we can run the cell below to filter samples that have at least one non-homozygous reference genotype

In [None]:
# Filter samples that have at least one non-homozygous reference genotype
samples_with_alt = mt2.filter_cols(
    hl.agg.any(mt2.GT.is_non_ref())
)

samples_with_alt.count()

**Read BRCA1 pathogenic variants from VAT**

Please refer to these two notebooks '03.Genomics_use_case_VAT_Hail' and '04.Genomics_use_case_VAT_bigquery' on how to extract BRCA1 variants from the VAT. We recommned using bigquery to extract variants info from the VAT.

Assuming this BRCA1 variant file is saved already in the bucket, and now it can be read back in the cell below

In [None]:
vat_filename = f'{bucket}/data/test/vat6_brca1.tsv'
vat_table = hl.import_table(vat_filename,
                            impute = True,
                            )

In [None]:
vat_table.count()

In [None]:
vat_table.show(5)

**Import genomic data and filter out flagged samples**

In [None]:
start2 = datetime.now()

In [None]:
auxiliary_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux"
auxiliary_path

In [None]:
mt = hl.read_matrix_table(mt_wgs_exome_path)

# Prepare MT from relatedness, phenotypesm, and ancestry_pred tables
relatedness = f'{auxiliary_path}/relatedness'
# kin_score_path = f'{relatedness}/relatedness.tsv'
related_samples_path = f'{relatedness}/relatedness_flagged_samples.tsv'

related_remove = hl.import_table(related_samples_path,
                                 types={"sample_id":"tstr"},
                                key="sample_id")
mt = mt.anti_join_cols(related_remove)
mt.count()

**Keep the case/control samples**

The case/control (1/0) samples are stored in this file "eoc_all_phenotypes.tsv" in the bucket

In [None]:
# Import phenotypes table from bucket as Hail matrix table
phenotype_filename = f'{bucket}/data/test/eoc_all_phenotypes.tsv'

phenotypes = hl.import_table(phenotype_filename,
                            types = {'person_id':hl.tstr},
                            impute = True,
                            key = 'person_id')

phenotypes.count()

In [None]:
phenotypes.show(5)

**In total there are 735 person_ids in the case and 116259 person_ids in the control cohort**

In [None]:
phenotypes.aggregate(hl.agg.counter(phenotypes.is_case))

In [None]:
# Filters out person_ids not in phenotypes table
mt = mt.semi_join_cols(phenotypes)
mt = mt.annotate_cols(pheno = phenotypes[mt.s])
mt.count()

## Filter chromosome interval first

**Filter chromosome interval first**

In [None]:
# create interval column
vat_table = vat_table.transmute(position = hl.int32(vat_table.position))
vat_table = vat_table.annotate(interval = hl.locus_interval(vat_table.contig, 
                                                            vat_table.position, vat_table.position + 1))

In [None]:
mt1 = hl.filter_intervals(mt, vat_table.interval.collect())

In [None]:
mt1.count()

After filtering interval, there are 246 variants.

**Calculate carrier prevelance in case/control**

In [None]:
# Compute whether there's at least one heterozygous genotype (is_het) in each individual
# in mt1, and annotate the column field 'ind_het' with the result
# mt_brca1_burden will have an additional column 'ind_het' indicating whether
# there is at least one heterozygous genotype for each individual
mt_brca1_burden = mt1.annotate_cols(
    ind_het = hl.agg.any(mt1.GT.is_het())
)

# Drop unwanted columns from mt_brca1_burden. The resulting matrix table (mt_brca1_burden)
# will only contain the original columns along with the newly added 'ind_het' column.
mt_brca1_burden = mt_brca1_burden.cols()

# Group mt_brca1_burden by the 'pheno.is_case' field
# Aggregate the number of individuals with at least one LOF (Loss of Function) mutation (n_w_lof) and 
# the number of individuals without LOF mutations (n_no_lof) within each group.

brca1_lof = mt_brca1_burden.group_by(mt_brca1_burden.pheno.is_case).aggregate(
    n_w_lof = hl.agg.sum(mt_brca1_burden.ind_het), 
    n_no_lof = hl.agg.sum(~mt_brca1_burden.ind_het)
)

# LoF contingency table in case/control
brca1_lof.show()

The result is in a 2x2 table. 

## Filter exact match on locus+allels

**Filter exact match on locus+allels**

In [None]:
# Reformat variant strings, parse as variant type to link genomic data
vat_table = vat_table.annotate(vid_alt = hl.str('chr') + vat_table.vid.replace('-', ':'))
vat_table = vat_table.key_by(**hl.parse_variant(vat_table.vid_alt, reference_genome = 'GRCh38'))

In [None]:
# Annotates genomic data with VAT then filters
mt2 = mt1.annotate_rows(vat = vat_table[mt1.row_key])
mt2 = mt2.semi_join_rows(vat_table)

In [None]:
mt2.count()

There are 196 variants left, comparing with 246 variants after interval only filtering. 

**Calculate carrier prevelance in case/control**

In [None]:
mt_brca1_burden2 = mt2.annotate_cols(
    ind_het = hl.agg.any(mt2.GT.is_het())
)

mt_brca1_burden2 = mt_brca1_burden2.cols()

brca1_lof2 = mt_brca1_burden2.group_by(mt_brca1_burden2.pheno.is_case).aggregate(
    n_w_lof = hl.agg.sum(mt_brca1_burden2.ind_het), 
    n_no_lof = hl.agg.sum(~mt_brca1_burden2.ind_het)
)

brca1_lof2.show()

In [None]:
end = datetime.now()
end-start

# Conclusion

**Conclusion**

Comparing the two results, a significant difference arises in the frequency of BRCA1 variants within the control group. Specifically, when filtering only by intervals, the frequency is calculated as 0.017% (2090/(2090+114169)), whereas when filtering by exact locus and allele, the frequency is 0.0018% (206/(206+116053)), indicating a difference of about 100x. Such a discrepancy will profoundly impact the comparison of BRCA1 variant frequencies between the case and control groups. Therefore, when utilizing a custom variant list with exact locus and allele information, it is imperative to filter AOU genomic datasets using this precise matched locus and allele.

In [None]:
2090/(2090+114169)

In [None]:
206/(206+116053)