<H1>SCN2A: Generate Transcript HGVS </H1>
<p>In this notebook, we generate phenopackets HGVS expressions for the data listed in 
  <a href="https://pubmed.ncbi.nlm.nih.gov/33731876/" target="__blank">Crawford et al, Computational analysis of 10,860 phenotypic annotations in individuals with SCN2A-related disorders</a>. In the Supplementary Table of this publication, variants were given in Protein notation. However, for our software, we require transcript level HGVS notations. We used the <a href="http://useast.ensembl.org/Homo_sapiens/Tools/VR" target="__blank">Variant Recoder</a> tool of Ensembl to extract the corresponding data and here create a mapping file that will be used in the other SCN2A notebook to generate phenopackets.</p>
 <p>Note that we first need to generate a variant file with the notebook <tt>generate_hgvs</tt>.</p>
  

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
import math
from csv import DictReader
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import re
from IPython.display import display, HTML, JSON
from pyphetools.creation import *
from pyphetools.output import PhenopacketTable
# last tested with pyphetools version 0.3.1

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)

In [3]:
df = pd.read_csv("input/41436_2021_1120_MOESM2_ESM.csv", delimiter='\t')
df.head()

Unnamed: 0,famID,broad_phx,variant,variant_type_1,variant_type_2,location,domain,segment,pmid/pmcid,HPO,hpo_term
0,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,HP:0007359,Focal-onset seizure
1,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,HP:0002069,Generalized tonic-clonic seizures
2,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,NP:0003808,No Abnormal muscle tone
3,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,NP:0012443,No Abnormality of brain morphology
4,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,NP:0000717,No Autism


In [4]:
var_df = pd.read_csv("variant_mapping.tsv")

In [5]:
var_df.head()

Unnamed: 0,famID,var,MANE
0,fam1,A202V,c.605C>T
1,fam10,V423L,c.1267G>T
2,fam100,L501*,c.1501_1502delinsTA
3,fam101,N503K*19,c.1508dup
4,fam102,R583*,c.1747C>T


In [6]:
var_dict = defaultdict(list)
for _, row in var_df.iterrows():
    famID = row["famID"]
    var = row["var"]
    mane = row["MANE"]
    var_dict[famID] = [var, mane]

In [7]:
def row_to_hpo(hpo_id, hpo_label):
    """Transform a row of the dataframe to an HPO term
    """
    # excluded terms are coded with NP:0001234 instead of HP:0001234
    if hpo_id.startswith("NP"):
        hpo_id = "H" + hpo_id[1:]
        return HpTerm(hpo_id=hpo_id, label=hpo_label, observed=False) 
    else:
        return HpTerm(hpo_id=hpo_id, label=hpo_label)            

In [8]:
patient_d = defaultdict(list)
patient_pheno_d = defaultdict()
skipped = 0
for _, row in df.iterrows():
    patID = row["famID"]
    if patID not in  var_dict:
        #print(f"Skipping {patID} because we could not decode variant")
        skipped += 1
        continue
    patient_pheno_d[patID] = row["broad_phx"]
    hpoid = str(row["HPO"])
    hpo_label = str(row["hpo_term"])
    if hpoid == "nan":
        continue  # A few rows are missing data
    hpo = row_to_hpo(hpo_id=hpoid, hpo_label=hpo_label)
    patient_d[patID].append(hpo)
print(f"We got {len(patient_d)} patients and skipped {skipped} patients because we could not decode variant")

We got 395 patients and skipped 146 patients because we could not decode variant


<H2>Disease/Phenotype groups</H2>
<p>The authors of the SCN2 paper divide the probands into five groups, which we can use for clustering</p>

In [9]:
phenotypes = {v for v in patient_pheno_d.values() }
for ph in  phenotypes:
    print(ph)

atypical
benign
encephalopathy
epilepsy
ASD


<h3>Get variant objects from VariantValidator</h3>

In [10]:
validator = VariantValidator(genome_build='hg38')
validated_var_d = defaultdict()
transcript = "NM_001040142.2"
c = 0
for patid, var_array in var_dict.items():
    mane_var = var_array[1]
    #print(f"{patid} - {gtype.hgvs}")
    total_hgvs = f"{transcript}:{mane_var}"
    if total_hgvs in validated_var_d:
        pass
    else:
        print(f"{patid}:{total_hgvs}")
        try:
            v = validator.encode_hgvs(hgvs=mane_var, custom_transcript=transcript)
            print(v)
            validated_var_d[total_hgvs] = v
        except Exception as EEE:
            print(EEE)
print(f"We got a total of {len(validated_var_d)} validated variants")

fam1:NM_001040142.2:c.605C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.605C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165308794C>T
fam10:NM_001040142.2:c.1267G>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1267G>T/NM_001040142.2?content-type=application%2Fjson
chr2:165313992G>T
fam100:NM_001040142.2:c.1501_1502delinsTA
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1501_1502delinsTA/NM_001040142.2?content-type=application%2Fjson
chr2:165315588CT>TA
fam101:NM_001040142.2:c.1508dup
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1508dup/NM_001040142.2?content-type=application%2Fjson
chr2:165315590G>GA
fam102:NM_001040142.2:c.1747C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1747C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165323231C

chr2:165374719C>A
fam155:NM_001040142.2:c.4013T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4013T>C/NM_001040142.2?content-type=application%2Fjson
chr2:165374725T>C
fam156:NM_001040142.2:c.4025T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4025T>C/NM_001040142.2?content-type=application%2Fjson
chr2:165374737T>C
fam16:NM_001040142.2:c.2644G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2644G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165344636G>A
fam160:NM_001040142.2:c.4193G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4193G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165374905G>A
fam161:NM_001040142.2:c.4259C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4259C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165377601C>

chr2:165296068A>G
fam204:NM_001040142.2:c.1136G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1136G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165313721G>A
fam206:NM_001040142.2:c.2056_2058delinsTGA
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2056_2058delinsTGA/NM_001040142.2?content-type=application%2Fjson
chr2:165326891AGT>TGA
fam207:NM_001040142.2:c.2810G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2810G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165344802G>A
fam209:NM_001040142.2:c.2549G>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2549G>C/NM_001040142.2?content-type=application%2Fjson
chr2:165342456G>C
fam21:NM_001040142.2:c.2567G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2567G>A/NM_001040142.2?content-type=applicati

chr2:165388851T>C
fam265:NM_001040142.2:c.4966T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4966T>C/NM_001040142.2?content-type=application%2Fjson
chr2:165388772T>C
fam266:NM_001040142.2:c.1705C>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1705C>G/NM_001040142.2?content-type=application%2Fjson
chr2:165323189C>G
fam269:NM_001040142.2:c.4044G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4044G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165374756G>A
fam27:NM_001040142.2:c.1270G>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1270G>C/NM_001040142.2?content-type=application%2Fjson
chr2:165313995G>C
fam270:NM_001040142.2:c.5317G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.5317G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165389123G>

chr2:165309386T>C
fam334:NM_001040142.2:c.3955C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.3955C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165373330C>T
fam336:NM_001040142.2:c.3850-2A>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.3850-2A>C/NM_001040142.2?content-type=application%2Fjson
chr2:165373223A>C
fam337:NM_001040142.2:c.3956G>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.3956G>C/NM_001040142.2?content-type=application%2Fjson
chr2:165373331G>C
fam340:NM_001040142.2:c.562C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.562C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165308751C>T
fam341:NM_001040142.2:c.5408A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.5408A>G/NM_001040142.2?content-type=application%2Fjson
chr2:16538921

chr2:165344580C>T
fam40:NM_001040142.2:c.2197G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2197G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165331377G>A
fam400:NM_001040142.2:c.4877G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.4877G>A/NM_001040142.2?content-type=application%2Fjson
chr2:165388683G>A
fam401:NM_001040142.2:c.2642T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.2642T>C/NM_001040142.2?content-type=application%2Fjson
chr2:165344634T>C
fam402:NM_001040142.2:c.5294T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.5294T>C/NM_001040142.2?content-type=application%2Fjson
chr2:165389100T>C
fam404:NM_001040142.2:c.5798A>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.5798A>T/NM_001040142.2?content-type=application%2Fjson
chr2:165389604A>

chr2:165309384T>A
fam84:NM_001040142.2:c.653C>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.653C>A/NM_001040142.2?content-type=application%2Fjson
chr2:165309399C>A
fam85:NM_001040142.2:c.658A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.658A>G/NM_001040142.2?content-type=application%2Fjson
chr2:165309404A>G
fam87:NM_001040142.2:c.707C>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.707C>G/NM_001040142.2?content-type=application%2Fjson
chr2:165310332C>G
fam88:NM_001040142.2:c.718G>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.718G>T/NM_001040142.2?content-type=application%2Fjson
chr2:165310343G>T
fam89:NM_001040142.2:c.754A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.754A>G/NM_001040142.2?content-type=application%2Fjson
chr2:165310379A>G
fam96:NM_001

In [11]:
individual_list = []

for patID, var in var_dict.items():
    mane = var[1]
    total_hgvs = f"{transcript}:{mane}"
    v = validated_var_d.get(total_hgvs)
    if v is None:
        # should never happen
        print(f"Could not find {total_hgvs}")
        continue
    v.set_heterozygous()
    pheno = patient_pheno_d.get(patID)
    pheno_id = f"CUSTOM:{pheno}"
    hpo_list =patient_d.get(patID)
    ind = Individual(individual_id=patID, hpo_terms=hpo_list, variant_list=[v], 
                   disease_id=pheno_id, disease_label=pheno )
    individual_list.append(ind)
    
print(f"Created {len(individual_list)} individual objects")

Could not find NM_001040142.2:c.2379+1G>A
Could not find NM_001040142.2:c.698‚Äê1G>T
Created 394 individual objects


In [12]:
i1 = individual_list[10]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
from pprint import pprint
print(json_string)

{
  "id": "fam109",
  "subject": {
    "id": "fam109"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0001272",
        "label": "Cerebellar atrophy"
      }
    },
    {
      "type": {
        "id": "HP:0002072",
        "label": "Chorea"
      }
    },
    {
      "type": {
        "id": "HP:0011193",
        "label": "EEG with focal spikes"
      }
    },
    {
      "type": {
        "id": "HP:0006892",
        "label": "Frontotemporal cerebral atrophy"
      }
    },
    {
      "type": {
        "id": "HP:0010818",
        "label": "Generalized tonic seizures"
      }
    },
    {
      "type": {
        "id": "HP:0002069",
        "label": "Generalized tonic-clonic seizures"
      }
    },
    {
      "type": {
        "id": "HP:0002079",
        "label": "Hypoplasia of the corpus callosum"
      }
    },
    {
      "type": {
        "id": "HP:0002521",
        "label": "Hypsarrhythmia"
      }
    },
    {
      "type": {
        "id": "HP:0012469",
   

In [13]:
ppacket_list = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individual_list]
table = PhenopacketTable(phenopacket_list=ppacket_list)
display(HTML(table.to_html()))

Individual,Genotype,Phenotypic features
fam1 (UNKNOWN; ),NM_001040142.2:c.605C>T (heterozygous),Focal-onset seizure (HP:0007359); Generalized tonic-clonic seizures (HP:0002069); Seizures (HP:0001250); Neonatal onset (HP:0003623)
fam10 (UNKNOWN; ),NM_001040142.2:c.1267G>T (heterozygous),"EEG with burst suppression (HP:0010851); Epileptic encephalopathy (HP:0200134); Focal autonomic seizure (HP:0011154); Generalized myoclonic seizures (HP:0002123); Generalized tonic seizures (HP:0010818); Intellectual disability, severe (HP:0010864); Microcephaly (HP:0000252); Multifocal epileptiform discharges (HP:0010841); Muscular hypotonia (HP:0001252); Refractory (HP:0031375); Neonatal onset (HP:0003623)"
fam100 (UNKNOWN; ),NM_001040142.2:c.1501_1502delinsTA (heterozygous),Intellectual disability (HP:0001249)
fam101 (UNKNOWN; ),NM_001040142.2:c.1508dup (heterozygous),"Absent speech (HP:0001344); Intellectual disability, severe (HP:0010864); Self-injurious behavior (HP:0100716)"
fam102 (UNKNOWN; ),NM_001040142.2:c.1747C>T (heterozygous),"Attention deficit hyperactivity disorder (HP:0007018); Autism (HP:0000717); Intellectual disability, moderate (HP:0002342)"
fam103 (UNKNOWN; ),NM_001040142.2:c.1825_1827delinsTGA (heterozygous),Intellectual disability (HP:0001249)
fam104 (UNKNOWN; ),NM_001040142.2:c.1831_1832del (heterozygous),"Autism (HP:0000717); Hypermetropia (HP:0000540); Intellectual disability, severe (HP:0010864); Inverted nipples (HP:0003186); Narrow palate (HP:0000189); Neurological speech impairment (HP:0002167); Pineal cyst (HP:0012683); Self-injurious behavior (HP:0100716); Sleep disturbance (HP:0002360); Tapered finger (HP:0001182)"
fam105 (UNKNOWN; ),NM_001040142.2:c.1945G>A (heterozygous),"Febrile seizures (HP:0002373); Focal seizures, afebril (HP:0040168); Seizures (HP:0001250)"
fam106 (UNKNOWN; ),NM_001040142.2:c.2021C>A (heterozygous),Autism (HP:0000717)
fam108 (UNKNOWN; ),NM_001040142.2:c.2558G>A (heterozygous),"Chorea (HP:0002072); Epileptic spasms (HP:0011097); Focal T2 hyperintense basal ganglia lesion (HP:0007183); Focal T2 hyperintense thalamic lesion (HP:0012692); Focal-onset seizure (HP:0007359); Generalized myoclonic seizures (HP:0002123); Global brain atrophy (HP:0002283); Hypsarrhythmia (HP:0002521); Infantile spasms (HP:0012469); Intellectual disability, severe (HP:0010864); Muscular hypotonia (HP:0001252); Refractory (HP:0031375); Infantile onset (HP:0003593)"


In [14]:
Individual.output_individuals_as_phenopackets(individual_list=individual_list,metadata=metadata.to_ga4gh())

394