## dbSNP

Use to create Hail Tables for dbSNP, after downloading raw data from https://ftp.ncbi.nih.gov/snp/. 

Raw data downloaded with Hail Batch, see `hail/datasets/extract/extract_dbSNP.py`.

In [None]:
import hail as hl
hl.init(spark_conf={'spark.hadoop.fs.gs.requester.pays.mode': 'AUTO', 
                    'spark.hadoop.fs.gs.requester.pays.project.id': 'broad-ctsa'})

### Create Hail Tables from GRCh37 and GRCh38 assembly reports

The contigs in the VCFs are [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) accession numbers, and need to be mapped back to the appropriate chromosome for each reference genome.

The GRCh37 assembly can be found [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25), and the assembly report mapping chromosomes to RefSeq sequences can be found [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt).

The GRCh38 assembly can be found [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39), and the assembly report mapping chromosomes to RefSeq sequences can be found [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt).

#### GRCh37

In [None]:
ht = hl.import_table("gs://hail-datasets-tmp/dbSNP/GCF_000001405.25_GRCh37.p13_assembly_report.txt", 
                     no_header=True, 
                     comment="#",
                     delimiter="\t", 
                     missing="na")

field_names = ['sequence_name','sequence_role','assigned_molecule',
               'assigned_molecule_location/type', 'genbank_accn', 'relationship', 
               'refseq_accn', 'assembly_unit', 'sequence_length', 'ucsc_style_name']

name = "dbSNP"
version = "154"
build = "GRCh37"
n_rows = ht.count()
n_partitions = ht.n_partitions()

ht = ht.annotate_globals(
    metadata=hl.struct(
        name=name,
        version=version,
        reference_genome=build,
        n_rows=n_rows,
        n_partitions=n_partitions
    )
)
ht = ht.rename(dict(zip([f"f{i}" for i in range(10)], field_names)))
ht = ht.drop("relationship").key_by("refseq_accn")

ht.write("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh37.ht")
ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh37.ht")
ht.describe()

#### GRCh38

In [None]:
ht = hl.import_table("gs://hail-datasets-tmp/dbSNP/GCF_000001405.39_GRCh38.p13_assembly_report.txt", 
                     no_header=True, 
                     comment="#",
                     delimiter="\t", 
                     missing="na")

field_names = ['sequence_name','sequence_role','assigned_molecule',
               'assigned_molecule_location/type', 'genbank_accn', 'relationship', 
               'refseq_accn', 'assembly_unit', 'sequence_length', 'ucsc_style_name']

name = "dbSNP"
version = "154"
build = "GRCh38"
n_rows = ht.count()
n_partitions = ht.n_partitions()

ht = ht.annotate_globals(
    metadata=hl.struct(
        name=name,
        version=version,
        reference_genome=build,
        n_rows=n_rows,
        n_partitions=n_partitions
    )
)
ht = ht.rename(dict(zip([f"f{i}" for i in range(10)], field_names)))
ht = ht.drop("relationship").key_by("refseq_accn")

ht.write("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh38.ht")
ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh38.ht")
ht.describe()

### Create Hail Tables for dbSNP

Now we can use the assembly report for each reference genome build to map from RefSeq accession numbers to chromosomes, and create Hail Tables. There are no samples or entries in the dbSNP VCFs. 

We will create two Hail Tables for each reference genome build, both keyed by `["locus", "alleles"]`:

  - Table with all fields from the imported VCF (e.g. `gs://hail-datasets-us/dbSNP_154_GRCh37.ht`)
  - Table with only the rsID field (e.g. `gs://hail-datasets-us/dbSNP_rsids_154_GRCh37.ht`)

In [None]:
# Load NCBI assembly reports with RefSeq mappings
assembly37_ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh37.ht")
assembly38_ht = hl.read_table("gs://hail-datasets-us/NCBI_assembly_report_p13_GRCh38.ht")

# Map RefSeq identifiers to chromosomes for GRCh37 using the "assigned_molecule" field in assembly report
rg37 = hl.get_reference("GRCh37")
refseq_to_chr37 = dict(zip(assembly37_ht.refseq_accn.collect(), assembly37_ht.assigned_molecule.collect()))
refseq_to_chr37 = {k: v for k, v in refseq_to_chr37.items() if v in rg37.contigs}

# Map RefSeq identifiers to chromosomes for GRCh38 using the "ucsc_style_name" field in assembly report
rg38 = hl.get_reference("GRCh38")
refseq_to_chr38 = dict(zip(assembly38_ht.refseq_accn.collect(), assembly38_ht.ucsc_style_name.collect()))
refseq_to_chr38 = {k: v.split("_")[0] for k, v in refseq_to_chr38.items() if v in rg38.contigs and "chrUn" not in v}
refseq_to_chr38.pop(None, None)

recodings = {"GRCh37": refseq_to_chr37, 
             "GRCh38": refseq_to_chr38}

# For use in filenames/metadata
name = "dbSNP"
version = "154"

for build in ["GRCh37", "GRCh38"]:
    mt = hl.import_vcf(f"gs://hail-datasets-tmp/{name}/{name}_{version}_{build}.vcf.bgz", 
                   contig_recoding=recodings[build],
                   skip_invalid_loci=True,
                   reference_genome=build)
    
    # No samples or entries, just grab table with the rows and write the full table
    ht = mt.rows()
    ht = ht.repartition(512, shuffle=True)
    ht = ht.checkpoint(f"gs://hail-datasets-tmp/checkpoints/{name}_{version}_{build}_repartitioned.ht", 
                       _read_if_exists=True,
                       overwrite=False)

    n_rows = ht.count()
    n_partitions = ht.n_partitions()

    ht = ht.annotate_globals(
        metadata=hl.struct(
            name=name,
            version=version,
            reference_genome=build,
            n_rows=n_rows,
            n_partitions=n_partitions
        )
    )
    ht.write(f"gs://hail-datasets-us/{name}_{version}_{build}.ht")
    ht = hl.read_table(f"gs://hail-datasets-us/{name}_{version}_{build}.ht")
    ht.describe()
    
    # Write table with only rsid's
    ht_rsid = hl.read_table(f"gs://hail-datasets-us/{name}_{version}_{build}.ht")
    ht_rsid = ht_rsid.select("rsid")

    n_rows = ht_rsid.count()
    n_partitions = ht_rsid.n_partitions()

    ht_rsid = ht_rsid.annotate_globals(
        metadata=hl.struct(
            name=f"{name}_rsids",
            version=version,
            reference_genome=build,
            n_rows=n_rows,
            n_partitions=n_partitions
        )
    )
    ht_rsid.write(f"gs://hail-datasets-us/{name}_rsids_{version}_{build}.ht")
    ht_rsid = hl.read_table(f"gs://hail-datasets-us/{name}_rsids_{version}_{build}.ht")
    ht_rsid.describe()

### Add dbSNP to datasets API and annotation DB

Now we can add the tables we created to `hail/python/hail/experimental/datasets.json` and create schemas for the docs.

In [None]:
import os
import json

output_dir = os.path.abspath("../../hail/python/hail/docs/datasets/schemas")
datasets_path = os.path.abspath("../../hail/python/hail/experimental/datasets.json")
with open(datasets_path, "r") as f:
    datasets = json.load(f)

names = ["dbSNP", "dbSNP_rsids"]
version = "154"
builds = ["GRCh37", "GRCh38"]

gcs_us_url_root = "gs://hail-datasets-us"
gcs_eu_url_root = "gs://hail-datasets-eu"
aws_us_url_root = "s3://hail-datasets-us-east-1"

for name in names:
    json_entry = {
        "annotation_db": {
            "key_properties": []
        },
        "description": "dbSNP: Reference SNP (rs or RefSNP) Hail Table. The database includes both common and rare single-base nucleotide variation (SNV), short (=< 50bp) deletion/insertion polymorphisms, and other classes of small genetic variations.",
        "url": "https://www.ncbi.nlm.nih.gov/snp/docs/RefSNP_about/",
        "versions": [
            {
                "reference_genome": builds[0],
                "url": {
                    "aws": {
                        "us": f"{aws_us_url_root}/{name}_{version}_{builds[0]}.ht"
                    },
                    "gcp": {
                        "eu": f"{gcs_eu_url_root}/{name}_{version}_{builds[0]}.ht",
                        "us": f"{gcs_us_url_root}/{name}_{version}_{builds[0]}.ht"
                    }
                },
                "version": version
            },
            {
                "reference_genome": builds[1],
                "url": {
                    "aws": {
                        "us": f"{aws_us_url_root}/{name}_{version}_{builds[1]}.ht"
                    },
                    "gcp": {
                        "eu": f"{gcs_eu_url_root}/{name}_{version}_{builds[1]}.ht",
                        "us": f"{gcs_us_url_root}/{name}_{version}_{builds[1]}.ht"
                    }
                },
                "version": version
            }            
        ]
    }
    
    if name == "dbSNP_rsids":
        json_entry["description"] = "dbSNP: This Hail Table only contains a mapping from Reference SNP IDs (rsID) to locus and alleles. For full dataset, see dbSNP."
    
    datasets[name] = json_entry

# Write new entries back to datasets.json config:
with open(datasets_path, "w") as f:
    json.dump(datasets, f, sort_keys=True, ensure_ascii=False, indent=2)

In [None]:
# {k: v for k, v in datasets.items() if "dbSNP" in k}

In [None]:
# Verify we can load tables
datasets_path = os.path.abspath("../../hail/python/hail/experimental/datasets.json")
with open(datasets_path, "r") as f:
    datasets = json.load(f)

# regions = [("aws", "us"), ("gcp", "us"), ("gcp", "eu")]
regions = [("gcp", "us"), ("gcp", "eu")]
for version in datasets["dbSNP"]["versions"]:
    v = version["version"]
    if v == "154":
        rg = version["reference_genome"]
        print(f"v{v}_{rg}")
        for r in regions:
            cloud, region = r
            print(f"{cloud}_{region}")
            url = version["url"][cloud][region]
            print(url)
            ht = hl.read_table(url)
            ht.describe()
            print("\n")
            
for version in datasets["dbSNP_rsids"]["versions"]:
    v = version["version"]
    if v == "154":
        rg = version["reference_genome"]
        print(f"v{v}_{rg}")
        for r in regions:
            cloud, region = r
            print(f"{cloud}_{region}")
            url = version["url"][cloud][region]
            print(url)
            ht = hl.read_table(url)
            ht.describe()
            print("\n")

In [None]:
# Create/update schema .rst file
import textwrap

output_dir = os.path.abspath("../../hail/python/hail/docs/datasets/schemas")
datasets_path = os.path.abspath("../../hail/python/hail/experimental/datasets.json")
with open(datasets_path, "r") as f:
    datasets = json.load(f)

names = ["dbSNP", "dbSNP_rsids"]
for name in names:
    versions = sorted(set(dataset["version"] for dataset in datasets[name]["versions"]))
    if not versions:
        versions = [None]
    reference_genomes = sorted(set(dataset["reference_genome"] for dataset in datasets[name]["versions"]))
    if not reference_genomes:
        reference_genomes = [None]

    print(name)
    print(versions[0])
    print(reference_genomes[0] + "\n")

    path = [dataset["url"]["gcp"]["us"]
            for dataset in datasets[name]["versions"]
            if all([dataset["version"] == versions[0],
                    dataset["reference_genome"] == reference_genomes[0]])]
    assert len(path) == 1
    path = path[0]
    if path.endswith(".ht"):
        table = hl.methods.read_table(path)
        table_class = "hail.Table"
    else:
        table = hl.methods.read_matrix_table(path)
        table_class = "hail.MatrixTable"

    description = table.describe(handler=lambda x: str(x)).split("\n")
    description = "\n".join([line.rstrip() for line in description])

    template = """.. _{dataset}:

{dataset}
{underline1}

*  **Versions:** {versions}
*  **Reference genome builds:** {ref_genomes}
*  **Type:** :class:`{class}`

Schema ({version0}, {ref_genome0})
{underline2}

.. code-block:: text

{schema}

"""
    context = {
        "dataset": name,
        "underline1": len(name) * "=",
        "version0": versions[0],
        "ref_genome0": reference_genomes[0],
        "versions": ", ".join([str(version) for version in versions]),
        "ref_genomes": ", ".join([str(reference_genome) for reference_genome in reference_genomes]),
        "underline2": len("".join(["Schema (", str(versions[0]), ", ", str(reference_genomes[0]), ")"])) * "~",
        "schema": textwrap.indent(description, "    "),
        "class": table_class
    }
    with open(output_dir + f"/{name}.rst", "w") as f:
        f.write(template.format(**context).strip())