## dbSNP

Use to create Hail Tables for dbSNP, after downloading raw data from https://ftp.ncbi.nih.gov/snp/. 

Raw data downloaded with Hail Batch, see `hail/datasets/extract/extract_dbSNP.py`.

In [None]:
import hail as hl
hl.init()

### Create Hail Tables from GRCh37 and GRCh38 assembly reports

The contigs in the VCFs are [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) accession numbers, and need to be mapped back to the appropriate chromosome for each reference genome.

The GRCh37 assembly can be found [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25), and the assembly report mapping chromosomes to RefSeq sequences can be found [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt).

The GRCh38 assembly can be found [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39), and the assembly report mapping chromosomes to RefSeq sequences can be found [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt).

#### GRCh37

In [None]:
ht = hl.import_table("gs://hail-datasets-tmp/dbSNP/GCF_000001405.25_GRCh37.p13_assembly_report.txt", 
                     no_header=True, 
                     comment="#",
                     delimiter="\t", 
                     missing="na")

field_names = ['sequence_name','sequence_role','assigned_molecule',
               'assigned_molecule_location/type', 'genbank_accn', 'relationship', 
               'refseq_accn', 'assembly_unit', 'sequence_length', 'ucsc_style_name']

name = "dbSNP"
version = "154"
build = "GRCh37"
n_rows = ht.count()
n_partitions = ht.n_partitions()

ht = ht.annotate_globals(
    metadata=hl.struct(
        name=name,
        version=version,
        reference_genome=build,
        n_rows=n_rows,
        n_partitions=n_partitions
    )
)
ht = ht.rename(dict(zip([f"f{i}" for i in range(10)], field_names)))
ht = ht.drop("relationship").key_by("refseq_accn")

ht.write("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh37.ht")
ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh37.ht")
ht.describe()

#### GRCh38

In [None]:
ht = hl.import_table("gs://hail-datasets-tmp/dbSNP/GCF_000001405.39_GRCh38.p13_assembly_report.txt", 
                     no_header=True, 
                     comment="#",
                     delimiter="\t", 
                     missing="na")

field_names = ['sequence_name','sequence_role','assigned_molecule',
               'assigned_molecule_location/type', 'genbank_accn', 'relationship', 
               'refseq_accn', 'assembly_unit', 'sequence_length', 'ucsc_style_name']

name = "dbSNP"
version = "154"
build = "GRCh38"
n_rows = ht.count()
n_partitions = ht.n_partitions()

ht = ht.annotate_globals(
    metadata=hl.struct(
        name=name,
        version=version,
        reference_genome=build,
        n_rows=n_rows,
        n_partitions=n_partitions
    )
)
ht = ht.rename(dict(zip([f"f{i}" for i in range(10)], field_names)))
ht = ht.drop("relationship").key_by("refseq_accn")

ht.write("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh38.ht")
ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh38.ht")
ht.describe()

### Create Hail Tables for dbSNP

Now we can use the assembly report for each reference genome build to map from RefSeq accession numbers to chromosomes, and create Hail Tables. There are no samples or entries in the dbSNP VCFs. Some helpful information about the dbSNP VCFs is available [here](https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/).

We will create two Hail Tables for each reference genome build, both keyed by `["locus", "alleles"]`:

  - Table with all fields from the imported VCF (e.g. `gs://hail-datasets-us/dbSNP_154_GRCh37.ht`)
  - Table with only the rsID field (e.g. `gs://hail-datasets-us/dbSNP_rsid_154_GRCh37.ht`)

First load VCFs to get all the contigs present in each dataset so we can create a mapping to used to recode contigs from RefSeq accession numbers to GRCh37/38 builds. 

In [None]:
mt37 = hl.import_vcf(f"gs://hail-datasets-tmp/dbSNP/dbSNP_154_GRCh37.vcf.bgz", 
                     header_file=f"gs://hail-datasets-tmp/dbSNP/dbSNP_154_GRCh37_header_only.vcf.txt", 
                     reference_genome=None, 
                     min_partitions=512)

mt38 = hl.import_vcf(f"gs://hail-datasets-tmp/dbSNP/dbSNP_154_GRCh38.vcf.bgz", 
                     header_file=f"gs://hail-datasets-tmp/dbSNP/dbSNP_154_GRCh38_header_only.vcf.txt", 
                     reference_genome=None, 
                     min_partitions=512)

mt37 = mt37.checkpoint(f"gs://hail-datasets-tmp/checkpoints/dbSNP_154_GRCh37_no_coding.mt", 
                       _read_if_exists=True, 
                       overwrite=False)

mt38 = mt38.checkpoint(f"gs://hail-datasets-tmp/checkpoints/dbSNP_154_GRCh38_no_coding.mt", 
                       _read_if_exists=True, 
                       overwrite=False)

# To get all contigs present for recoding to correct reference genome mapping
contigs_present37 = mt37.aggregate_rows(hl.agg.collect_as_set(mt37.locus.contig))
contigs_present38 = mt38.aggregate_rows(hl.agg.collect_as_set(mt38.locus.contig))

In [None]:
# Load NCBI assembly reports with RefSeq mappings
assembly37_ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh37.ht")
assembly37_ht = assembly37_ht.annotate(
    contig = hl.if_else(assembly37_ht.sequence_role == "unlocalized-scaffold", 
                        assembly37_ht.genbank_accn, 
                        assembly37_ht.assigned_molecule)
)
assembly38_ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh38.ht")

# Map RefSeq identifiers to chromosomes for GRCh37 using the "contig" field we created in assembly report
rg37 = hl.get_reference("GRCh37")
refseq_to_chr37 = dict(zip(assembly37_ht.refseq_accn.collect(), assembly37_ht.contig.collect()))
refseq_to_chr37 = {k: v for k, v in refseq_to_chr37.items() if k in contigs_present37 and v in rg37.contigs}

# Map RefSeq identifiers to chromosomes for GRCh38 using the "ucsc_style_name" field in assembly report
rg38 = hl.get_reference("GRCh38")
refseq_to_chr38 = dict(zip(assembly38_ht.refseq_accn.collect(), assembly38_ht.ucsc_style_name.collect()))
refseq_to_chr38 = {k: v for k, v in refseq_to_chr38.items() if k in contigs_present38 and v in rg38.contigs}

recodings = {
    "GRCh37": refseq_to_chr37, 
    "GRCh38": refseq_to_chr38
}

Use the function and known keys below to convert allele frequency arrays to structs:

In [None]:
# Convert array of strings like hl.array(["GnomAD:.,1,3.187e-05","TOPMED:.,1,2.389e-05"]) to a struct
def arr_str_to_struct(hl_array, known_keys):
    _dict = hl.dict(
        hl_array.map(
            lambda x: ("_" + x.split(":")[0], 
                       x.split(":")[1].split(",").map(lambda x: hl.if_else(x == ".", hl.missing(hl.tfloat), hl.float(x))))
        )
    )
    _struct =  hl.rbind(_dict, lambda d: hl.struct(**{k: _dict.get(k) for k in known_keys}))
    return _struct

# To get all possible keys for allele frequency arrays after loading VCF as MatrixTable
#     known_keys_FREQ = mt.aggregate_rows(
#         hl.agg.explode(
#             lambda x: hl.agg.collect_as_set(x), mt.info.FREQ.split("\\|").map(lambda x: x.split(":")[0])
#         )
#     )

known_keys = ['GENOME_DK','TWINSUK','dbGaP_PopFreq','Siberian','Chileans',
              'FINRISK','HapMap','Estonian','ALSPAC','GoESP',
              'TOPMED','PAGE_STUDY','1000Genomes','Korea1K','ChromosomeY',
              'ExAC','Qatari','GoNL','MGP','GnomAD',
              'Vietnamese','GnomAD_exomes','PharmGKB','KOREAN','Daghestan',
              'HGDP_Stanford','NorthernSweden','SGDP_PRJ']
known_keys_FREQ = list(map(lambda x: f"_{x}", known_keys))

Now can read in VCF files again as MatrixTables with the correct contig recodings, and reformat the allele frequency information in `info.FREQ` and the clinical attributes in `info`.

Note that we are specifying a separate header file in the `hl.import_vcf` calls in the cell below. 

To make parsing strings easier, the following INFO fields in the VCF headers were changed from `Number=.` to `Number=1`: FREQ, CLNHGVS, CLNVI, CLNORIGIN, CLNSIG, CLNDISB, CLNDN, CLNREVSTAT, CLNACC. 

The modified VCF headers used are available in `gs://hail-datasets-tmp/dbSNP`.

In [None]:
name = "dbSNP"
version = "154"
builds = ["GRCh37", "GRCh38"]

for build in builds:
    mt = hl.import_vcf(f"gs://hail-datasets-tmp/{name}/{name}_{version}_{build}.vcf.bgz", 
                       header_file=f"gs://hail-datasets-tmp/{name}/{name}_{version}_{build}_header_only.vcf.txt", 
                       contig_recoding=recodings[build], 
                       skip_invalid_loci=True, 
                       reference_genome=build, 
                       min_partitions=512)

    # First annotation, go from str to array<str> for FREQ
    mt = mt.annotate_rows(
        info = mt.info.annotate(
            FREQ = hl.or_missing(hl.is_defined(mt.info.FREQ), mt.info.FREQ.split("\\|"))
        )
    )
    # Second annotation, turn array<str> into a struct for FREQ
    mt = mt.annotate_rows(
        info = mt.info.annotate(
            FREQ = hl.or_missing(hl.is_defined(mt.info.FREQ), 
                                 arr_str_to_struct(mt.info.FREQ, known_keys_FREQ))
        )
    )
    # Reformat clinical attributes from str to array
    mt = mt.annotate_rows(
        info = mt.info.annotate(
            CLNHGVS = hl.or_missing(
                hl.is_defined(mt.info.CLNHGVS), 
                mt.info.CLNHGVS.split("(?:(\|)|(\,))")).map(lambda x: hl.if_else((x == "."), hl.missing(hl.tstr), x)),
            CLNVI = hl.or_missing(
                hl.is_defined(mt.info.CLNVI), 
                mt.info.CLNVI.split("(?:(\|)|(\,))")).filter(lambda x: x != "."),
            CLNORIGIN = hl.or_missing(
                hl.is_defined(mt.info.CLNORIGIN), 
                mt.info.CLNORIGIN.split("(?:(\|)|(\,))")).filter(lambda x: x != "."),
            CLNSIG = hl.or_missing(
                hl.is_defined(mt.info.CLNSIG), 
                mt.info.CLNSIG.split("(?:(\|)|(\,))")).filter(lambda x: x != "."),
            CLNDISDB = hl.or_missing(
                hl.is_defined(mt.info.CLNDISDB), 
                mt.info.CLNDISDB.split("(?:(\|)|(\,))")).filter(lambda x: x != "."),
            CLNDN = hl.or_missing(
                hl.is_defined(mt.info.CLNDN), 
                mt.info.CLNDN.split("(?:(\|)|(\,))")).filter(lambda x: x != "."),
            CLNREVSTAT = hl.or_missing(
                hl.is_defined(mt.info.CLNREVSTAT), 
                mt.info.CLNREVSTAT.split("(?:(\|)|(\,))")).filter(lambda x: x != "."),
            CLNACC = hl.or_missing(
                hl.is_defined(mt.info.CLNACC), 
                mt.info.CLNACC.split("(?:(\|)|(\,))")).filter(lambda x: x != ".")
        )
    )
    
    mt = mt.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}.mt", 
                       _read_if_exists=True, 
                       overwrite=False)

Then we can just grab the `rows` table since we have no sample or entry information in the MatrixTable. 

From there, we need to filter the biallelic and multiallelic variants into separate tables, split the multiallelic variants, and then union the split multiallelic table rows back with the biallelic table rows.

The allele frequency arrays start with the reference allele which is then followed by alternate alleles as ordered in the ALT column (from the VCF). So after splitting we can index the array with `a_index` to pull out the relevant allele frequency.

In [None]:
name = "dbSNP"
version = "154"
builds = ["GRCh37", "GRCh38"]

for build in builds:
    # No samples or entries in MT, just grab table with the rows
    mt = hl.read_matrix_table(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}.mt")
    ht = mt.rows()
       
    ht_ba = ht.filter(hl.len(ht.alleles) <= 2)
    ht_ba = ht_ba.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}_biallelic.ht", 
                             _read_if_exists=True, 
                             overwrite=False)

    ht_ma = ht.filter(hl.len(ht.alleles) > 2)
    ht_ma = ht_ma.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}_multiallelic.ht", 
                             _read_if_exists=True, 
                             overwrite=False)

    ht_split = hl.split_multi(ht_ma, keep_star=True, permit_shuffle=True)
    ht_split = ht_split.repartition(64, shuffle=False)
    ht_split = ht_split.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}_split_multiallelic.ht", 
                                   _read_if_exists=True, 
                                   overwrite=False)
    
    # Next, have to fix indices and union ht_split with ht_ba
    ht_union = ht_ba.union(ht_split, unify=True)
    ht_union = ht_union.annotate(
        a_index = hl.if_else(hl.is_missing(ht_union.a_index), 1, ht_union.a_index),
        was_split = hl.if_else(hl.is_missing(ht_union.was_split), False, ht_union.was_split),
        old_locus = hl.if_else(hl.is_missing(ht_union.old_locus), ht_union.locus, ht_union.old_locus),
        old_alleles = hl.if_else(hl.is_missing(ht_union.old_alleles), ht_union.alleles, ht_union.old_alleles)
    )
    ht_union = ht_union.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}_unioned.ht", 
                                   _read_if_exists=True, 
                                   overwrite=False)
    
    # Arrays for AFs start w/ ref allele in index 0, so just use a_index to get alternate AFs
    ht = ht_union.annotate(
        info = ht_union.info.annotate(
            FREQ = ht_union.info.FREQ.annotate(
                **{k: hl.or_missing(hl.is_defined(ht_union.info.FREQ[k]), 
                                    ht_union.info.FREQ[k][ht_union.a_index]) 
                   for k in known_keys_FREQ}
            )
        )
    )
    ht = ht.repartition(512, shuffle=True)
    ht = ht.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}.ht", 
                       _read_if_exists=True, 
                       overwrite=False)

    n_rows = ht.count()
    n_partitions = ht.n_partitions()

    ht = ht.annotate_globals(
        metadata=hl.struct(
            name=name,
            version=version,
            reference_genome=build,
            n_rows=n_rows,
            n_partitions=n_partitions
        )
    )
    ht.write(f"gs://hail-datasets-us/{name}_{version}_{build}.ht")
    ht = hl.read_table(f"gs://hail-datasets-us/{name}_{version}_{build}.ht")
    ht.describe()
    print(str(hl.eval(ht.metadata)) + "\n")

Also write tables with only the rsID field, for smaller tables that just map `[locus, alleles]` to `rsID`.

In [None]:
name = "dbSNP"
version = "154"
builds = ["GRCh37", "GRCh38"]

for build in builds:
    # Write table with only rsid's
    ht_rsid = hl.read_table(f"gs://hail-datasets-us/{name}_{version}_{build}.ht")
    ht_rsid = ht_rsid.select("rsid")

    n_rows = ht_rsid.count()
    n_partitions = ht_rsid.n_partitions()

    ht_rsid = ht_rsid.annotate_globals(
        metadata=hl.struct(
            name=f"{name}_rsid",
            version=version,
            reference_genome=build,
            n_rows=n_rows,
            n_partitions=n_partitions
        )
    )
    ht_rsid.write(f"gs://hail-datasets-us/{name}_rsid_{version}_{build}.ht")
    ht_rsid = hl.read_table(f"gs://hail-datasets-us/{name}_rsid_{version}_{build}.ht")
    ht_rsid.describe()
    print(str(hl.eval(ht_rsid.metadata)) + "\n")

In [None]:
# To check uniqueness of keys
tables = ["gs://hail-datasets-us/dbSNP_rsid_154_GRCh37.ht", "gs://hail-datasets-us/dbSNP_rsid_154_GRCh38.ht"]
for table in tables:
    ht = hl.read_table(table)
    
    ht_count = ht.count()
    print(f"n = {ht_count}")
    ht_distinct_count = ht.distinct().count()
    print(f"n_distinct = {ht_distinct_count}")
    
    if ht_count == ht_distinct_count:
        print(f"{table} rows unique\n")
    else:
        print(f"{table} rows NOT unique\n")

### Add dbSNP to datasets API and annotation DB

Now we can add the tables we created to `hail/python/hail/experimental/datasets.json`:

In [None]:
import os
import json

datasets_path = os.path.abspath("../../hail/python/hail/experimental/datasets.json")
with open(datasets_path, "r") as f:
    datasets = json.load(f)

names = ["dbSNP", "dbSNP_rsid"]
version = "154"
builds = ["GRCh37", "GRCh38"]

gcs_us_url_root = "gs://hail-datasets-us"
gcs_eu_url_root = "gs://hail-datasets-eu"
aws_us_url_root = "s3://hail-datasets-us-east-1"

for name in names:
    json_entry = {
        "annotation_db": {
            "key_properties": []
        },
        "description": "dbSNP: Reference SNP (rs or RefSNP) Hail Table. The database includes both common and rare single-base nucleotide variation (SNV), short (=< 50bp) deletion/insertion polymorphisms, and other classes of small genetic variations.",
        "url": "https://www.ncbi.nlm.nih.gov/snp/docs/RefSNP_about/",
        "versions": [
            {
                "reference_genome": builds[0],
                "url": {
                    "aws": {
                        "us": f"{aws_us_url_root}/{name}_{version}_{builds[0]}.ht"
                    },
                    "gcp": {
                        "eu": f"{gcs_eu_url_root}/{name}_{version}_{builds[0]}.ht",
                        "us": f"{gcs_us_url_root}/{name}_{version}_{builds[0]}.ht"
                    }
                },
                "version": version
            },
            {
                "reference_genome": builds[1],
                "url": {
                    "aws": {
                        "us": f"{aws_us_url_root}/{name}_{version}_{builds[1]}.ht"
                    },
                    "gcp": {
                        "eu": f"{gcs_eu_url_root}/{name}_{version}_{builds[1]}.ht",
                        "us": f"{gcs_us_url_root}/{name}_{version}_{builds[1]}.ht"
                    }
                },
                "version": version
            }            
        ]
    }
    
    if name == "dbSNP_rsid":
        json_entry["description"] = "dbSNP: This Hail Table contains a mapping from locus/allele pairs to Reference SNP IDs (rsID). For the full dataset, see dbSNP."
    
    datasets[name] = json_entry

# Write new entries back to datasets.json config:
with open(datasets_path, "w") as f:
    json.dump(datasets, f, sort_keys=True, ensure_ascii=False, indent=2)

After adding tables to `datasets.json`, create .rst schema files for docs:

In [None]:
# Create/update schema .rst file
import textwrap

output_dir = os.path.abspath("../../hail/python/hail/docs/datasets/schemas")
datasets_path = os.path.abspath("../../hail/python/hail/experimental/datasets.json")
with open(datasets_path, "r") as f:
    datasets = json.load(f)

names = ["dbSNP", "dbSNP_rsid"]
for name in names:
    versions = sorted(set(dataset["version"] for dataset in datasets[name]["versions"]))
    if not versions:
        versions = [None]
    reference_genomes = sorted(set(dataset["reference_genome"] for dataset in datasets[name]["versions"]))
    if not reference_genomes:
        reference_genomes = [None]

    print(name)
    print(versions[0])
    print(reference_genomes[0] + "\n")

    path = [dataset["url"]["gcp"]["us"]
            for dataset in datasets[name]["versions"]
            if all([dataset["version"] == versions[0],
                    dataset["reference_genome"] == reference_genomes[0]])]
    assert len(path) == 1
    path = path[0]
    if path.endswith(".ht"):
        table = hl.methods.read_table(path)
        table_class = "hail.Table"
    else:
        table = hl.methods.read_matrix_table(path)
        table_class = "hail.MatrixTable"

    description = table.describe(handler=lambda x: str(x)).split("\n")
    description = "\n".join([line.rstrip() for line in description])

    template = """.. _{dataset}:

{dataset}
{underline1}

*  **Versions:** {versions}
*  **Reference genome builds:** {ref_genomes}
*  **Type:** :class:`{class}`

Schema ({version0}, {ref_genome0})
{underline2}

.. code-block:: text

{schema}

"""
    context = {
        "dataset": name,
        "underline1": len(name) * "=",
        "version0": versions[0],
        "ref_genome0": reference_genomes[0],
        "versions": ", ".join([str(version) for version in versions]),
        "ref_genomes": ", ".join([str(reference_genome) for reference_genome in reference_genomes]),
        "underline2": len("".join(["Schema (", str(versions[0]), ", ", str(reference_genomes[0]), ")"])) * "~",
        "schema": textwrap.indent(description, "    "),
        "class": table_class
    }
    with open(output_dir + f"/{name}.rst", "w") as f:
        f.write(template.format(**context).strip())