<div class="alert alert-info" style="font-family:'arial';font-size:25px"> How to work with Variant Annotation Table (VAT) </div>

The Allofus (AOU) research workbench (RW) provides two methods to access VAT, one is using hail; the other one is using bigquery.

In this notebook, we will demonstrate how to extract variant info from the VAT using bigquery.

Runtime estimation: we only need a standard VM for this purpose and it runs fast (about 1 or 2 min).

Limitations: Only limited fields or columns are available in bigquery. Please see this online article , https://support.researchallofus.org/hc/en-us/articles/4615256690836-Variant-Annotation-Table for more details about VAT.

In [None]:
from datetime import datetime
start = datetime.now()

In [None]:
import os
import pandas as pd
my_bucket=os.getenv("WORKSPACE_BUCKET")
my_bucket

In [None]:
dataset = %env WORKSPACE_CDR
dataset

# Where is the VAT table in bigquery?

**Where is the VAT table in bigquery?**

Two subset tables from the original VAT table are availabe in biggquery. 

One main table is cb_variant_attribute, which has 16 columns. 

The other one is cb_variant_attribute_rs_number, which has rs_number as rsid.

**What columns are there?**

In [None]:
query=f"""

SELECT * FROM {dataset}.cb_variant_attribute
LIMIT 2
"""
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
# transpose the data frame
df.T

**Check cb_variant_attribute_rs_number table**

In [None]:
query=f"""

SELECT * FROM {dataset}.cb_variant_attribute_rs_number
LIMIT 2
"""
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
df

**How many variants are there?**

In [None]:
query=f"""

SELECT COUNT (DISTINCT vid) FROM {dataset}.cb_variant_attribute

"""
df = pd.read_gbq(query, dialect="standard")
df

# Using BRCA1 as an example

**Using BRCA1 as an example**

**Extract all variants, given 'genes' column='BRCA1'**

In [None]:
query=f"""
SELECT *
FROM {dataset}.cb_variant_attribute
-- JOIN {dataset}.cb_variant_attribute_rs_number USING (vid)
WHERE genes IN ('BRCA1')
"""
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
df.genes.unique()

In [None]:
# how many variant ids
df.vid.nunique()

**Extract all variants with rsid, given 'genes'='BRCA1'**

In [None]:
query=f"""
SELECT *
FROM {dataset}.cb_variant_attribute
JOIN {dataset}.cb_variant_attribute_rs_number USING (vid)
WHERE genes IN ('BRCA1')
"""
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
df.head()

**Further check clinical_significance_string**

In [None]:
df.clinical_significance_string.unique()

**Or we can further filter using clinical_significance_string**

In [None]:
query=f"""
SELECT *
FROM {dataset}.cb_variant_attribute
JOIN {dataset}.cb_variant_attribute_rs_number USING (vid)
WHERE genes IN ('BRCA1')
AND (clinical_significance_string LIKE '%likely pathogenic%' 
                    OR clinical_significance_string LIKE '%pathogenic%') 
"""
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
df.head()

**Extract all variants, given 'genes' containing 'BRCA1'**

In [None]:
query=f"""

SELECT *
FROM {dataset}.cb_variant_attribute
-- JOIN {dataset}.cb_variant_attribute_rs_number USING (vid)
WHERE genes LIKE '%BRCA1%'
"""
df = pd.read_gbq(query, dialect="standard")
df.shape

In [None]:
df.genes.unique()

In [None]:
# this number is same as the one from using cohort builder
df.vid.nunique()

**Save the result to the bucket**

In [None]:
out_path = f'{my_bucket}/data/vat_gene_BRCA1_v7_bigquery.tsv'
out_path

In [None]:
df.to_csv(out_path)

**Check the file in the bucket**

In [None]:
!gsutil ls -l {out_path}

# Conclusions

**Conclusions**

1. Bigquery is much faster.

2. We recommned using bigquery to extract variant info from VAT if you don't need extra columns.

In [None]:
# total time
end = datetime.now()
end-start