<H1>Creation of phenopackets from PMID:31278393</H1>
<P>In this notebook, we show how to create phenopackets from table 1 of <a href="https://pubmed.ncbi.nlm.nih.gov/31278393/" target="__blank">Dyment DA et al. (2019) De novo substitutions of TRPM3 cause intellectual disability and epilepsy. Eur J Hum Genet. 27:1611-1618</a>.</P>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import pyphetools
from pyphetools.creation import *
from pyphetools.validation import ContentValidator
from pyphetools.visualization import *

print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.8.2


In [2]:
# Import HPO data
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
PMID = "PMID:31278393"
title = "De novo substitutions of TRPM3 cause intellectual disability and epilepsy"
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199", pmid=PMID, pubmed_title=title)
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


In [3]:
df = pd.read_excel('data/PMID_31278393.xlsx')

In [4]:
df

Unnamed: 0,Individual,1,2,3,4,5,6,7,8
0,cDNA (NM_020952.4),c.2509G>A,c.2509G>A,c.2509G>A,c.2509G>A,c.2509G>A,c.2509G>A,c.2509G>A,c.2810C>A
1,Polypeptide (NP_066003.3),p.(Val837Met),p.(Val837Met),p.(Val837Met),p.(Val837Met),p.(Val837Met),p.(Val837Met),p.(Val837Met),p.(Pro937Gln)
2,Genomic DNA (NC_000009.11),g.73213379C>T,g.73213379C>T,g.73213379C>T,g.73213379C>T,g.73213379C>T,g.73213379C>T,g.73213379C>T,g.73168145G>T
3,Zygosity,Heterozygous,Heterozygous,Heterozygous,Heterozygous,Heterozygous,Heterozygous,Heterozygous,Heterozygous
4,Segregation,De novo,De novo,De novo,De novo,De novo,De novo,De novo,De novo
5,Clinical features,,,,,,,,
6,Gestation (weeks),38,40,42,39,38 + 3,40,39,Term
7,Perinatal history,C/S,N,N,N,N,N,C/S,C/S (repeat)
8,Birth weight (kg),NR,3.6,3.2,3.48,3.378,3.89,3.1,2.9
9,Sex,M,M,F,M,M,M,M,F


In [5]:
# Convert to row based format
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.head()
# Note that the Individual is now the row index but we need it to be available as a column
# Therefore, add it as an explicit, new column
dft['patient_id'] = dft.index
dft.head()

Individual,cDNA (NM_020952.4),Polypeptide (NP_066003.3),Genomic DNA (NC_000009.11),Zygosity,Segregation,Clinical features,Gestation (weeks),Perinatal history,Birth weight (kg),Sex,...,Craniofacial gestalt,Morphological features,Other clinical features,Brain MRI,Apparent heat or pain insensitivity,Genetic investigations,aCGH,Fragile X,Other (nondiagnostic) genetic investigations,patient_id
1,c.2509G>A,p.(Val837Met),g.73213379C>T,Heterozygous,De novo,,38,C/S,NR,M,...,Nondysmorphic,"Broad forehead, deeply set eyes, ptosis, bulbous nasal tip, micrognathia, prominent lobule of ear, tapering fingers",C1 spinal stenosis; Chiari I malformation; scoliosis; torticollis; plagiocephaly; thickened filum terminale; bilateral talipes equinovarus; strabismus (exotropia OU),Possible mild cerebral volume loss,+ (Heat),,Normal,Normal,"ID panel (170 genes), PHF6",1
2,c.2509G>A,p.(Val837Met),g.73213379C>T,Heterozygous,De novo,,40,N,3.6,M,...,Nondysmorphic,"Short philtrum, long nose, turricephaly",EMG/NCS normal,Normal,NR,,Normal,Normal,NR,2
3,c.2509G>A,p.(Val837Met),g.73213379C>T,Heterozygous,De novo,,42,N,3.2,F,...,Nondysmorphic,NR,−,Normal,NR,,Normal,Normal,"MECP2, SMA",3
4,c.2509G>A,p.(Val837Met),g.73213379C>T,Heterozygous,De novo,,39,N,3.48,M,...,NR,"Broad forehead, deeply set eyes, flat midface, short philtrum, micrognathia, broad halluces, fifth-finger clinodactyly, pectus excavatum",Strabismus,Normal,NR,,Normal,Normal,NR,4
5,c.2509G>A,p.(Val837Met),g.73213379C>T,Heterozygous,De novo,,38 + 3,N,3.378,M,...,NR,"Broad forehead, low nasal bridge, unilateral preauricular pit, short broad thumbs","Cryptorchidism, micropenis, bilateral talipes equinovarus","Ventriculomegaly, nonspecific periventricular white matter hyperintensities",+ (Pain),,Normal,,NR,5


<h2>Column mappers</h2>

In [6]:
column_mapper_d = defaultdict(ColumnMapper)

In [7]:
# Developmental delay/intellectual disability  -- use code to intellectual disability 
severity_id = {'+ (Severe)': 'Intellectual disability, severe',
               '+ (Moderate)': 'Intellectual disability, moderate',
               '+ (Moderate-to-severe)':'Intellectual disability, moderate'}
idMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=severity_id)
idMapper.preview_column(dft['Developmental delay/intellectual disability'])
column_mapper_d['Developmental delay/intellectual disability'] = idMapper

In [24]:
# when building the encoder, inspect all the columns
# dft.columns

In [9]:
# By inspection, all entries of this column indicate delayed ability to walk. Therefore, use ConstantColumnMapper
# the alternative would be to code each of the varied entries
delayedWalkColumn = ConstantColumnMapper(hpo_id='HP:0031936', hpo_label='Delayed ability to walk')
#delayedWalkColumn.preview_column(dft['Ambulate independently (age achieved)'])
column_mapper_d['Ambulate independently (age achieved)'] = delayedWalkColumn

In [10]:
## Same comments for speech
delayedSpeechColumn = ConstantColumnMapper(hpo_id='HP:0000750', hpo_label='Delayed speech and language development')
# delayedSpeechColumn.preview_column(dft['Any speech (age attained)'])
column_mapper_d['Any speech (age attained)'] = delayedSpeechColumn

In [11]:
## 'Autism-like features' # Autistic behavior HP:
autisticFeaturesMapper = SimpleColumnMapper(hpo_id='HP:0000729', hpo_label='Autistic behavior', observed="+", excluded="−")
#autisticFeaturesMapper.preview_column(dft['Autism-like features'])
column_mapper_d['Autism-like features'] = autisticFeaturesMapper

In [12]:
seizure_d = {'Absence': 'Typical absence seizure',
             'Infantile spasms': 'Infantile spasms',
             'GTC':'Bilateral tonic-clonic seizure',
             'ESES': 'Status epilepticus'}
seizureMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=seizure_d)
#seizureMapper.preview_column(dft['Seizure types'])
column_mapper_d['Seizure types'] = seizureMapper

In [13]:
# Hypotonia HP:0001252 -- note that we include   + (mixed tone abnormality)  as Hypotonia
hypotoniaMapper = SimpleColumnMapper(hpo_id='HP:0001252', hpo_label='Hypotonia', 
                                     observed=['+', '+ (mixed tone abnormality)'], excluded='−')
#hypotoniaMapper.preview_column(dft['Hypotonia'])
column_mapper_d['Hypotonia'] = hypotoniaMapper

In [14]:
#dft['Morphological features']
morph_d = {
    'bulbous nasal tip': 'Bulbous nose'
}
morphologicalMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=morph_d)
morphologicalMapper.preview_column(dft['Morphological features'])
column_mapper_d['Morphological features'] = morphologicalMapper

In [15]:
other_d = {
    'Chiari I malformation': 'Chiari type I malformation',
    'C1 spinal stenosis':'Cervical spinal canal stenosis'
}
otherMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=other_d)
otherMapper.preview_column(dft['Other clinical features'])
column_mapper_d['Other clinical features'] = otherMapper

In [16]:
ageMapper = AgeColumnMapper.by_year('Age (years)')
#ageMapper.preview_column(dft['Age (years)'])
sexMapper = SexColumnMapper(male_symbol='M', female_symbol='F', column_name='Sex')
#sexMapper.preview_column(dft['Sex'])

In [17]:
hg38 = 'hg38'
TRPM3_transcript='NM_020952.6'
vvalidator = VariantValidator(genome_build=hg38, transcript=TRPM3_transcript)
var_list = dft['cDNA (NM_020952.4) '].unique()
var_d = {}
for v in var_list:
    var = vvalidator.encode_hgvs(v)
    var_d[v] = var

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_020952.6%3Ac.2509G>A /NM_020952.6?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_020952.6%3Ac.2810C>A /NM_020952.6?content-type=application%2Fjson


In [18]:
genome = 'hg38'
transcript='NM_020952.6' # latest version of TRPM3 transcript used in publlication (original: version 4)
# Note there is an extra space at the end of the column name
varMapper = VariantColumnMapper(variant_d=var_d,
                                variant_column_name='cDNA (NM_020952.4) ', 
                                default_genotype='heterozygous')
#varMapper.preview_column(column=dft['cDNA (NM_020952.4) '])

In [20]:
encoder = CohortEncoder(df=dft, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        variant_mapper=varMapper,
                        metadata=metadata,
                        pmid=PMID)
omim_id = "OMIM:620224"
omim_label = "Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures"
disease = Disease(disease_id=omim_id, disease_label=omim_label)
encoder.set_disease(disease=disease)

In [21]:
individuals = encoder.get_individuals()

<h2>Validate</h2>
<p>pyphetools offers a quick validation that phenopackets contain a mininum number of variants and HPO terms.
We recommend additional validation with <a href="https://github.com/phenopackets/phenopacket-tools">phenopacket-tools</a>.</p>

In [22]:
cvalidator = ContentValidator(min_var=1, min_hpo=1)
errors = cvalidator.validate_phenopacket_list([i.to_ga4gh_phenopacket(metadata.to_ga4gh()) for i in individuals])
print(f"We found {len(errors)} validation errors")

We found 0 validation errors


<h2>Visualization</h2>
<p>pyphetools can output summary tables of the main data contained in the cohort.</p>

In [23]:
from IPython.display import HTML, display
phenopackets = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individuals]
table = PhenopacketTable(phenopacket_list=phenopackets)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
1 (MALE; P16Y),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, severe (HP:0010864); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Autistic behavior (HP:0000729); Typical absence seizure (HP:0011147); Hypotonia (HP:0001252); Bulbous nose (HP:0000414); Cervical spinal canal stenosis (HP:0008445); Chiari type I malformation (HP:0007099)"
2 (MALE; P4Y9M),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, moderate (HP:0002342); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Infantile spasms (HP:0012469); Hypotonia (HP:0001252)"
3 (FEMALE; P6Y),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, moderate (HP:0002342); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Autistic behavior (HP:0000729); Bilateral tonic-clonic seizure (HP:0002069); Hypotonia (HP:0001252)"
4 (MALE; P5Y11M),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, severe (HP:0010864); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Autistic behavior (HP:0000729); Status epilepticus (HP:0002133); Hypotonia (HP:0001252)"
5 (MALE; P6Y3M),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, severe (HP:0010864); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Autistic behavior (HP:0000729); Hypotonia (HP:0001252)"
6 (MALE; P28Y),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, severe (HP:0010864); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Bilateral tonic-clonic seizure (HP:0002069); Typical absence seizure (HP:0011147)"
7 (MALE; P38Y),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2509G>A (heterozygous),"Intellectual disability, moderate (HP:0002342); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Typical absence seizure (HP:0011147); Hypotonia (HP:0001252); Bulbous nose (HP:0000414)"
8 (FEMALE; P8Y1M),"Neurodevelopmental disorder with hypotonia, dysmorphic facies, and skeletal anomalies, with or without seizures (OMIM:620224)",NM_020952.6:c.2810C>A (heterozygous),"Intellectual disability, moderate (HP:0002342); Delayed ability to walk (HP:0031936); Delayed speech and language development (HP:0000750); Typical absence seizure (HP:0011147); Hypotonia (HP:0001252)"


<h2>Output</h2>

In [23]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                             metadata=metadata.to_ga4gh(),
                                             pmid=PMID,
                                             outdir=output_directory)

We output 8 GA4GH phenopackets to the directory phenopackets
