<h1>ERI1: Guo et al 2013 </h1>
<p>Extract the clinical data from <a href="https://pubmed.ncbi.nlm.nih.gov/37352860/"target="__blank">Guo L, et al. (2023) Null and missense mutations of ERI1 cause a recessive phenotypic dichotomy in humans. Am J Hum Genet.  PMID:37352860</a>.<p>
<p>The authors report a phenotypic dichotomy associated with bi-allelic ERI1 variants by reporting eight affected individuals from seven unrelated families. A severe spondyloepimetaphyseal dysplasia (SEMD) was identified in five affected individuals with missense variants but not in those with bi-allelic null variants, who showed mild intellectual disability and digital anomalies.</p>

In [17]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
import math
from csv import DictReader
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import re
from pyphetools.creation import *
from pyphetools.output import PhenopacketTable
# last tested with pyphetools version 0.4.6

In [81]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)

In [123]:
df = pd.read_excel("input/Guo_2023.xlsx")

In [124]:
patient_id = df.columns

In [125]:
df = df.set_index('Individual').T.reset_index()
df["patient_id"] = df["index"]

In [126]:
scg = SimpleColumnMapperGenerator(df=df,
                                   observed='+',
                                   excluded='-',
                                  hpo_cr=hpo_cr)

In [127]:
column_mapper_d = scg.try_mapping_columns()

In [128]:
from IPython.core.display import display, HTML
display(HTML(scg.to_html()))

  from IPython.core.display import display, HTML


Result,Columns
Mapped,Syndactyly; Cardiac anomaly; Hydronephrosis; Vesicoureteral reflux; Asthma; Conductive hearing impairment; hypernasal speech; Dislocated radial head; Scoliosis; Hip pain; Short stature; Long face; Narrow face; proptosis; Coarse facies; Low-set ears; Limited elbow extension; Finger joint hypermobility; Clinodactyly of the 5th finger; Pes planus; Slender metacarpals; Increased vertebral height; Velopharyngeal insufficiency; Hip dislocation; Patellar dislocation; Narrow forehead; Upslanted palpebral fissure; High palate; Pectus excavatum; Tapered finger; Prominent forehead; Depressed nasal bridge; Micrognathia; Cutaneous syndactyly; Macrotia; Narrow chest; Pulmonary arterial hypertension; Oligodactyly; Tricuspid regurgitation; Platyspondyly; Intrauterine growth retardation; Motor delay; Failure to thrive; Trigonocephaly; Frontal bossing; Sparse hair; Pectus carinatum; Wormian bones; Osteopenia; Delayed skeletal maturation; Inguinal hernia; Ventricular septal defect; Brachycephaly; Anonychia; Strabismus; Low anterior hairline; Epicanthus
Unmapped,index; DNA; Protein; Sex; Age at last follow-up; Weight; Height; Consanguinity; Fetal ultrasound; Gestation age; Birth weight; Birth length; Spine anomaly; Metaphyseal anomaly; Epiphyseal anomaly; Brachydactyly/clinodactyly/camptodactyly; Intellectual disability/developmental delay; Zygomatic hypoplasia; Posteriorly rotated ear; Cupped ear ; patient_id


In [129]:
# Now get the unmapped columns and try option mappers
unmapped_columns = scg.get_unmapped_columns()
omit_columns = set(column_mapper_d.keys())
omit_columns.update(['index','DNA','Protein','Age at last follow-up','Consanguinity'])

In [130]:
auto_results = OptionColumnMapper.autoformat(df=df, concept_recognizer=hpo_cr, omit_columns=omit_columns)

In [131]:
print(auto_results)

index_d = {'1A': 'PLACEHOLDER',
 '1B': 'PLACEHOLDER',
 'Hoxha': 'PLACEHOLDER',
 'Choucair': 'PLACEHOLDER'}
indexMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=index_d)
indexMapper.preview_column(df['index'])
column_mapper_d['index'] = indexMapper

dna_d = {'c.[450A>T]; [893A>G]': 'PLACEHOLDER',
 'c.[464C>T]; [893A>C]': 'PLACEHOLDER',
 'c.[401A>G]; [895T>C]': 'PLACEHOLDER',
 'c.[464C>T]; [62C>A]': 'PLACEHOLDER',
 'c.[514C>T]; [514C>T]': 'PLACEHOLDER',
 'c.[730C>T]; [730C>T]': 'PLACEHOLDER',
 'c.[582+1G>A]; [582+1G>A]': 'PLACEHOLDER',
 'c.[ 352A>T]; [352A>T]': 'PLACEHOLDER',
 'g.[8783887_9068578del]; [8783887_9068578del]': 'PLACEHOLDER'}
dnaMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=dna_d)
dnaMapper.preview_column(df['DNA'])
column_mapper_d['DNA'] = dnaMapper

protein_d = {'p.[Glu150Asp]; [Asp298Gly]': 'PLACEHOLDER',
 'p.[Pro155Leu]; [Asp298Ala]': 'PLACEHOLDER',
 'p.[Asp134Gly]; [Ser299Pro]': 'PLACEHOLDER',
 'p.[Pro155Leu]; [Ser21∗]': 'PLACEHOLDER',


In [132]:
weight_d = {'24\xa0kg (−5 SD)': 'Decreased body weight',
 '26\xa0kg (−5 SD)': 'Decreased body weight',
 '3.3\xa0kg (- 4 SD)': 'Decreased body weight',
 'failure to thrive': 'Failure to thrive'}
excluded_d = {
    '22\xa0kg (8th centile)': 'Decreased body weight',
    '62\xa0kg (85th centile)': 'Decreased body weight',
    '27.6\xa0kg (50th centile)': 'Decreased body weight',
    'normal': 'Failure to thrive',
}
weightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=weight_d,
                                 excluded_d=excluded_d)
weightMapper.preview_column(df['Weight'])
column_mapper_d['Weight'] = weightMapper

In [133]:
height_d = {'112\xa0cm (−8 SD)': 'Short stature',
 '128\xa0cm (−7 SD)': 'Short stature',
 '50.3\xa0cm (−5 SD)': 'Short stature',
 'short stature': 'Short stature',
 
 '105\xa0cm (<3rd centile)': 'Short stature'}

excluded_d = {
    '130.8\xa0cm (46th centile)': 'Short stature',
 '155\xa0cm (25th centile)': 'Short stature',
 '130\xa0cm (90th centile)': 'Short stature',
}
heightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=height_d,
                                excluded_d=excluded_d)
heightMapper.preview_column(df['Height'])
column_mapper_d['Height'] = heightMapper

In [134]:
fetal_ultrasound_d = {'hydronephrosis': 'Hydronephrosis',
 'short limbs': 'Limb undergrowth',
 'severe IUGR': 'Intrauterine growth retardation',
 }
excluded = {
    'unremarkable': 'Intrauterine growth retardation'
}
fetal_ultrasoundMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=fetal_ultrasound_d)
fetal_ultrasoundMapper.preview_column(df['Fetal ultrasound'])
column_mapper_d['Fetal ultrasound'] = fetal_ultrasoundMapper

In [135]:
birth_weight_d = {
 '2180\xa0g (−3.2 SD)': 'Small for gestational age',
 '000\xa0g (−3.3 SD)': 'Small for gestational age',}
birth_weightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=birth_weight_d)
birth_weightMapper.preview_column(df['Birth weight'])
column_mapper_d['Birth weight'] = birth_weightMapper

In [136]:
# Omitting these because we manually curated detailed phenotypes and added them to the input table
#spine_anomaly_d 
#metaphyseal_anomaly_d = {'nan': 'PLACEHOLDER'}
#epiphyseal_anomaly_d = {'+ (wrists)': 'PLACEHOLDER'}


In [137]:
id_gdd_d = {
 'Motor delay': 'Motor delay',
 'Delayed speech and language development': 'Delayed speech and language development',
 'generalized hypotonia': 'Generalized hypotonia',
 'Global developmental delay': 'Global developmental delay',
 'Autism': 'Autism',
 'Intellectual disability mild': 'Intellectual disability, mild',}
id_gddMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=id_gdd_d)
id_gddMapper.preview_column(df['Intellectual disability/developmental delay'])
column_mapper_d['Intellectual disability/developmental delay'] = id_gddMapper

<H2>Variants</H2>

In [138]:
variant_set = set()
for v_string in df["DNA"]:
    fields = v_string.split(";")
    if len(fields) != 2:
        raise ValueError(f"Malformed variant line {v_string}")
    for var in fields:
        variant_str = var.strip()
        variant_str = re.sub(r"(c.)?\[","", variant_str)
        variant_str = variant_str.replace("]", "").strip()
        if "8783887" in variant_str:  # this is the structural variant
            variant_set.add("g.8783887_9068578del")
        else:
            variant_str = "c." + variant_str
            variant_set.add(variant_str)
for v in variant_set:
    print(f"\"{v}\"")

"c.893A>C"
"c.450A>T"
"c.464C>T"
"g.8783887_9068578del"
"c.401A>G"
"c.514C>T"
"c.352A>T"
"c.893A>G"
"c.730C>T"
"c.62C>A"
"c.582+1G>A"
"c.895T>C"


In [139]:
validator = VariantValidator(genome_build='hg38', transcript="NM_153332.4")
validated_var_d = defaultdict()

In [140]:
for var in variant_set:
    print(f"Validating {var}")
    if var == 'g.8783887_9068578del':
        sv = StructuralVariant.chromosomal_deletion(cell_contents='Deletion exons 1-4',
                 gene_symbol="ERI1",
                 gene_id="HGNC:23994",
                 genotype="homozygous")
        validated_var_d[var] = sv
    else:
        var_object = validator.encode_hgvs(hgvs=var)
        validated_var_d[var] = var_object
print(f"We got {len(validated_var_d)} variant objects")

Validating c.893A>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_153332.4%3Ac.893A>C/NM_153332.4?content-type=application%2Fjson
Validating c.450A>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_153332.4%3Ac.450A>T/NM_153332.4?content-type=application%2Fjson
Validating c.464C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_153332.4%3Ac.464C>T/NM_153332.4?content-type=application%2Fjson
Validating g.8783887_9068578del
Validating c.401A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_153332.4%3Ac.401A>G/NM_153332.4?content-type=application%2Fjson
Validating c.514C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_153332.4%3Ac.514C>T/NM_153332.4?content-type=application%2Fjson
Validating c.352A>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_153332.4%3Ac.352A>T/NM_153332.4?content-type=application%2Fjson
Validati

In [146]:
ageMapper = AgeColumnMapper.by_year('Age at last follow-up')
#ageMapper.preview_column(df['Age at last follow-up'])
sexMapper = SexColumnMapper(male_symbol='M', female_symbol='F', column_name='Sex')
#sexMapper.preview_column(df['Sex'])

<h2>Disease diagnosis</h2>
<p>Diseases related to ERI1 are currenttly not represented in OMIM. For this reason, we represent the diagnosis as preliminary below. The authors write:  SEMD was present in the five individuals with at least one missense variant (Table 1). In contrast, three individuals with ERI1 null mutations and the Eri1 KO mice showed a much milder skeletal phenotype without any evidence for SEMD, consistent with the two individuals reported previously, who had homozygous a 284 kb deletion and p.Lys118∗. Notably, of the five individuals with SEMD, three died within 2 years after birth, suggesting missense variants lead to a poor prognosis.</p>

In [147]:
pmid="PMID:37352860"
encoder = CohortEncoder(df=df, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper,
                       metadata=metadata,
                       pmid=pmid)
encoder.set_disease(disease_id='Preliminary:Preliminary', label='ERI1-related disease')

In [148]:
individuals = encoder.get_individuals()

ValueError: Error: cell_contents argument (Individual
Hydronephrosis    +
Hydronephrosis    +
Name: 0, dtype: object) must be string but was <class 'pandas.core.series.Series'> -- coerced to string