<H1>SCN2A: Generate Transcript HGVS </H1>
<p>In this notebook, we generate phenopackets HGVS expressions for the data listed in 
  <a href="https://pubmed.ncbi.nlm.nih.gov/33731876/" target="__blank">Crawford et al, Computational analysis of 10,860 phenotypic annotations in individuals with SCN2A-related disorders</a>. In the Supplementary Table of this publication, variants were given in Protein notation. However, for our software, we require transcript level HGVS notations. We used the <a href="http://useast.ensembl.org/Homo_sapiens/Tools/VR" target="__blank">Variant Recoder</a> tool of Ensembl to extract the corresponding data and here create a mapping file that will be used in the other SCN2A notebook to generate phenopackets.</p>
 <p>Note that we first need to generate a variant file with the notebook <tt>generate_hgvs</tt>.</p>
  

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from IPython.display import display, HTML, JSON
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.9.8


In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
PMID = "PMID:33731876"
title = "Computational analysis of 10,860 phenotypic annotations in individuals with SCN2A-related disorders"
cite = Citation(pmid=PMID, title=title)
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199", citation=cite)
metadata.default_versions_with_hpo(version=hpo_version)
metadata.mondo()
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


In [3]:
df = pd.read_csv("input/41436_2021_1120_MOESM2_ESM.csv", delimiter='\t')
df.head()

Unnamed: 0,famID,broad_phx,variant,variant_type_1,variant_type_2,location,domain,segment,pmid/pmcid,HPO,hpo_term
0,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,HP:0007359,Focal-onset seizure
1,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,HP:0002069,Generalized tonic-clonic seizures
2,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,NP:0003808,No Abnormal muscle tone
3,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,NP:0012443,No Abnormality of brain morphology
4,fam1,benign,A202V,missense,missense,Helical Repeat I,DI,S3,28379373,NP:0000717,No Autism


In [4]:
var_df = pd.read_csv("variant_mapping.tsv")

In [5]:
var_df.head()

Unnamed: 0,famID,var,MANE
0,fam1,A202V,c.605C>T
1,fam10,V423L,c.1267G>T
2,fam100,L501*,c.1501_1502delinsTA
3,fam101,N503K*19,c.1508dup
4,fam102,R583*,c.1747C>T


In [6]:
from collections import defaultdict
var_dict = defaultdict(list)
for _, row in var_df.iterrows():
    famID = row["famID"]
    var = row["var"]
    mane = row["MANE"]
    var_dict[famID] = [var, mane]

In [7]:
def row_to_hpo(hpo_id, hpo_label):
    """Transform a row of the dataframe to an HPO term
    """
    obsolete_id ={
        "HP:0011398": "HP:0001252", # Hypotonia
        "HP:0040168": "HP:0007359", # Focal-onset seizure
        "HP:0007095": "HP:0012650" , # Perisylvian polymicrogyria 
    }
    # outdates
    if hpo_id in obsolete_id:
        hpo_id = obsolete_id.get(hpo_id)
    # excluded terms are coded with NP:0001234 instead of HP:0001234
    if hpo_id.startswith("NP"):
        hpo_id = "H" + hpo_id[1:]
        hpotk_term = hpo_ontology.get_term(hpo_id)
        hpotk_label = hpotk_term.name
        #if hpotk_label != hpo_label:
        #    print(f"Replacing obsolete label {hpo_label} with {hpotk_label}")
        return HpTerm(hpo_id=hpo_id, label=hpotk_label, observed=False) 
    else:
        hpotk_term = hpo_ontology.get_term(hpo_id)
        hpotk_label = hpotk_term.name
        #if hpotk_label != hpo_label:
        #    print(f"Replacing obsolete label {hpo_label} with {hpotk_label}")
        return HpTerm(hpo_id=hpo_id, label=hpotk_label)            

In [8]:
patient_d = defaultdict(list)
patient_pheno_d = defaultdict()
skipped = 0
for _, row in df.iterrows():
    patID = row["famID"]
    if patID not in  var_dict:
        #print(f"Skipping {patID} because we could not decode variant")
        skipped += 1
        continue
    patient_pheno_d[patID] = row["broad_phx"]
    hpoid = str(row["HPO"])
    hpo_label = str(row["hpo_term"])
    if hpoid == "nan":
        continue  # A few rows are missing data
    hpo = row_to_hpo(hpo_id=hpoid, hpo_label=hpo_label)
    patient_d[patID].append(hpo)
print(f"We got {len(patient_d)} patients and skipped {skipped} patients because we could not decode variant")

We got 395 patients and skipped 146 patients because we could not decode variant


<H2>Disease/Phenotype groups</H2>
<p>The authors of the SCN2 paper divide the probands into five groups, which we can use for clustering</p>
<p>The authors state: The range of clinical presentation among the SCN2A-related disorders is perplexing. Historically, the SCN2A gene was identified independently in three distinct phenotypes: benign familial infantile seizures, autism spectrum disorders (ASD), and DEE, . While these conditions still represent the most well-recognized SCN2A-related phenotypes, many clinical presentations overlap, and others have been suggested. It has been hypothesized that early-onset epilepsy phenotypes are mainly associated with GoF variants, while later-onset epilepsy and nonepilepsy phenotypes including autism and intellectual disability are associated with LoF variants.</p>
<p>We will divde the five groups as follows. 
    <ul>
        <li>ASD: autism spectrum disorder (MONDO:0005258)</li>
        <li>atypical: Atypical SCN2A-related disease. There is no code for this, but for the purposes of clustering we will label this with the OMIM:182390, which is the OMIM term for the SCN2A gene</li>
        <li>encephalopathy: Developmental and epileptic encephalopathy 11	(MONDO:0013388 or OMIM:613721)</li>
        <li>epilepsy: epilepsy (MONDO:0005027)</li>
        <li>benign: Seizures, benign familial infantile, 3	(MONDO:0011904 or OMIM:607745)</li>
    </ul>
    </p>

In [9]:
phenotypes = {v for v in patient_pheno_d.values() }

<h3>Get variant objects from VariantValidator</h3>

In [10]:
validator = VariantValidator(genome_build='hg38')
validated_var_d = defaultdict()
transcript = "NM_001040142.2"
c = 0
for patid, var_array in var_dict.items():
    mane_var = var_array[1]
    #print(f"{patid} - {gtype.hgvs}")
    total_hgvs = f"{transcript}:{mane_var}"
    if total_hgvs in validated_var_d:
        pass
    else:
        try:
            v = validator.encode_hgvs(hgvs=mane_var, custom_transcript=transcript)
            print(v)
            validated_var_d[total_hgvs] = v
        except Exception as EEE:
            print(EEE)
print(f"We got a total of {len(validated_var_d)} validated variants")

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.605C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165308794C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1267G>T/NM_001040142.2?content-type=application%2Fjson
chr2:165313992G>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1501_1502delinsTA/NM_001040142.2?content-type=application%2Fjson
chr2:165315588CT>TA
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1508dup/NM_001040142.2?content-type=application%2Fjson
chr2:165315590G>GA
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1747C>T/NM_001040142.2?content-type=application%2Fjson
chr2:165323231C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001040142.2%3Ac.1825_1827delinsTGA/NM_001040142.2?content-type=application%2Fjson
chr2:

In [11]:
def get_disease(disease_category):
    if disease_category == "ASD":
        return Disease(disease_id= "MONDO:0005258", disease_label="autism spectrum disorder")
    elif disease_category == "atypical":
        return Disease(disease_id="OMIM:182390", disease_label="Atypical SCN2A-related disease")
    elif disease_category == "encephalopathy":
        return Disease(disease_id="OMIM:613721", disease_label="Developmental and epileptic encephalopathy 11")
    elif disease_category == "epilepsy": 
        return Disease(disease_id="MONDO:0005027", disease_label="epilepsy") 
    elif disease_category == "benign":
        return Disease(disease_id="OMIM:607745", disease_label="Seizures, benign familial infantile, 3")
    else:
        raise ValueError(f"Did not recognize disease category {disease_category}")

In [12]:
individual_list = []

for patID, var in var_dict.items():
    mane = var[1]
    total_hgvs = f"{transcript}:{mane}"
    v = validated_var_d.get(total_hgvs)
    if v is None:
        # should never happen
        print(f"Could not find {total_hgvs}")
        continue
    v.set_heterozygous()
    pheno = patient_pheno_d.get(patID)
    disease = get_disease(pheno)
    hpo_list =patient_d.get(patID)
    ind = Individual(individual_id=patID, 
                     hpo_terms=hpo_list, 
                     citation=cite,
                     interpretation_list=[v.to_ga4gh_variant_interpretation()], 
                     disease=disease)
    individual_list.append(ind)
    
print(f"Created {len(individual_list)} individual objects")

Created 396 individual objects


In [13]:
cvalidator = CohortValidator(cohort=individual_list, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.MONO_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

Level,Error category,Count
ERROR,CONFLICT,24
ERROR,INSUFFICIENT_HPOS,1
ERROR,OBSERVED_AND_EXCLUDED,1
WARNING,REDUNDANT,156

ID,Level,Category,Message,HPO Term
PMID_33731876_fam285,ERROR,INSUFFICIENT_HPOS,Minimum HPO terms required 1 but only 0 found,
PMID_33731876_fam353,ERROR,OBSERVED_AND_EXCLUDED,Term Seizure (HP:0001250) was annotated to be both observed and excluded.,Seizure (HP:0001250)


fam353 is annotated with the HPO term Seizure as both observed and excluded. This is an irrecoverable error, so pyphetools will drop this case from the cohort.

In [14]:
individuals = cvalidator.get_error_free_individual_list()
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.MONO_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

In [15]:
individuals = cvalidator.get_error_free_individual_list()
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
fam1 (UNKNOWN; ),"Seizures, benign familial infantile, 3 (OMIM:607745)",NM_001040142.2:c.605C>T (heterozygous),Focal-onset seizure (HP:0007359); Bilateral tonic-clonic seizure (HP:0002069); Neonatal onset (HP:0003623); excluded: Abnormal muscle tone (HP:0003808); excluded: Abnormal brain morphology (HP:0012443); excluded: Autism (HP:0000717); excluded: Neurodevelopmental abnormality (HP:0012759)
fam10 (UNKNOWN; ),Developmental and epileptic encephalopathy 11 (OMIM:613721),NM_001040142.2:c.1267G>T (heterozygous),"EEG with burst suppression (HP:0010851); Epileptic encephalopathy (HP:0200134); Focal autonomic seizure (HP:0011154); Generalized myoclonic seizure (HP:0002123); Generalized tonic seizure (HP:0010818); Intellectual disability, severe (HP:0010864); Microcephaly (HP:0000252); Multifocal epileptiform discharges (HP:0010841); Hypotonia (HP:0001252); Refractory (HP:0031375); Neonatal onset (HP:0003623); excluded: Autism (HP:0000717)"
fam100 (UNKNOWN; ),Atypical SCN2A-related disease (OMIM:182390),NM_001040142.2:c.1501_1502delinsTA (heterozygous),Intellectual disability (HP:0001249); excluded: Seizure (HP:0001250)
fam101 (UNKNOWN; ),Atypical SCN2A-related disease (OMIM:182390),NM_001040142.2:c.1508dup (heterozygous),"Absent speech (HP:0001344); Intellectual disability, severe (HP:0010864); Self-injurious behavior (HP:0100716); excluded: Autism (HP:0000717); excluded: Seizure (HP:0001250)"
fam102 (UNKNOWN; ),autism spectrum disorder (MONDO:0005258),NM_001040142.2:c.1747C>T (heterozygous),"Attention deficit hyperactivity disorder (HP:0007018); Autism (HP:0000717); Intellectual disability, moderate (HP:0002342); excluded: Short columella (HP:0002000); excluded: Absent speech (HP:0001344); excluded: Seizure (HP:0001250)"
fam103 (UNKNOWN; ),Atypical SCN2A-related disease (OMIM:182390),NM_001040142.2:c.1825_1827delinsTGA (heterozygous),Intellectual disability (HP:0001249); excluded: Seizure (HP:0001250)
fam104 (UNKNOWN; ),autism spectrum disorder (MONDO:0005258),NM_001040142.2:c.1831_1832del (heterozygous),"Autism (HP:0000717); Hypermetropia (HP:0000540); Intellectual disability, severe (HP:0010864); Inverted nipples (HP:0003186); Narrow palate (HP:0000189); Abnormality of speech or vocalization (HP:0002167); Pineal cyst (HP:0012683); Self-injurious behavior (HP:0100716); Sleep abnormality (HP:0002360); Tapered finger (HP:0001182); excluded: Seizure (HP:0001250)"
fam105 (UNKNOWN; ),epilepsy (MONDO:0005027),NM_001040142.2:c.1945G>A (heterozygous),Febrile seizure (within the age range of 3 months to 6 years) (HP:0002373); Focal-onset seizure (HP:0007359)
fam106 (UNKNOWN; ),autism spectrum disorder (MONDO:0005258),NM_001040142.2:c.2021C>A (heterozygous),Autism (HP:0000717); excluded: Seizure (HP:0001250)
fam107 (UNKNOWN; ),Developmental and epileptic encephalopathy 11 (OMIM:613721),NM_001040142.2:c.2388+1G>A (heterozygous),"Abnormal cerebral white matter morphology (HP:0002500); Ataxia (HP:0001251); Autism (HP:0000717); Cerebellar atrophy (HP:0001272); EEG with generalized spikes (HP:0012000); Focal-onset seizure (HP:0007359); Generalized tonic seizure (HP:0010818); Global brain atrophy (HP:0002283); Intellectual disability, severe (HP:0010864); Refractory (HP:0031375); Sleep abnormality (HP:0002360); Abnormal repetitive mannerisms (HP:0000733); Ventriculomegaly (HP:0002119); Childhood onset (HP:0011463)"


In [16]:
Individual.output_individuals_as_phenopackets(individual_list=individuals, 
                                              metadata=metadata)

We output 394 GA4GH phenopackets to the directory phenopackets


In [17]:
# pxf validate --hpo hp.json *.json
# No errors