<div class="alert alert-info" style="font-family:'arial';font-size:25px"> How to work with Variant Annotation Table (VAT) </div>

The Allofus (AOU) research workbench (RW) provides two methods to access VAT, one is using Hail; the other one is using bigquery.

In this notebook, we will demonstrate how to extract variant info from the VAT using Hail, given a gene symbol.

Please be aware that this process takes a significant time and we recommend using a background notebook to run it. 

Runtime estimation: using the default dataproac setting (2/0 workers), it takes around 7-8 hrs to finish. Using 50/50 workers, it takes about 30-60mins.

More info about VAT is here, https://support.researchallofus.org/hc/en-us/articles/4615256690836-Variant-Annotation-Table, as well as in this tutorial notebook, 01_Get Started with WGS Data.ipynb in the featured workspace, How to Work with All of Us Genomic Data (Hail - Plink)(v7)

In [None]:
from datetime import datetime
start = datetime.now()

In [None]:
import os
my_bucket=os.getenv("WORKSPACE_BUCKET")
my_bucket

In [None]:
import hail as hl
hl.init(default_reference="GRCh38")

# Where is the VAT table?
**Where is the VAT table?**

In [None]:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux

In [None]:
auxiliary_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux"
auxiliary_path

In [None]:
vat_path = f'{auxiliary_path}/vat/*.gz'
vat_path

**Check the file size**

In [None]:
!gsutil -u $$GOOGLE_PROJECT ls -l {vat_path}

**Let's use v7.1**

In [None]:
vat_path = f'{auxiliary_path}/vat/vat_complete_v7.1.bgz.tsv.gz'
vat_path

**Using Hail to import VAT**

In [None]:
vat_table = hl.import_table(vat_path, force=True, quote='"', delimiter="\t", force_bgz=True)

# Filter the VAT by gene symbol

**Filter the VAT by gene symbol**

In [None]:
gene='BRCA1'

In [None]:
vat_new = vat_table.filter(vat_table["gene_symbol"]==gene)

**Only select the fields you need**

In [None]:
vat_new=vat_new.select('vid', 'contig','position','genomic_location','gene_symbol', 'dbsnp_rsid','ref_allele',
'alt_allele','consequence','clinvar_classification','clinvar_phenotype','variant_type','gvs_all_ac','gvs_all_an','gvs_all_af')

In [None]:
vat_new.describe()

In [None]:
# don't do this, will take at least 2 hours, 
#vat_new.show()

**Save the result to the bucket**

In [None]:
out_path = f'{my_bucket}/data/vat_gene_{gene}_v7_hail.tsv'
out_path

In [None]:
# save to the bucket, will take time
vat_new.export(out_path) 

**Check saved file in the bucket**

In [None]:
!gsutil ls -l {my_bucket}/data/vat_gene_{gene}_v7_hail.tsv

In [None]:
# total time
end = datetime.now()
end-start