<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/29907796/" target="__blank">Diets, et al. (2019) SMARCB1 causes severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus</a></p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
import math
from csv import DictReader
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import re
from IPython.display import display, HTML, JSON
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.8.30


<h2>Importing HPO data</h2>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
PMID = "PMID:29907796"
title = "A recurrent de novo missense pathogenic variant in SMARCB1 causes severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus"
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155", pmid=PMID, pubmed_title=title)
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


<h2>Importing the supplemental table</h2>
<p>The Table of the Iwasawa et al (2019) paper was copied into an Excel file that is included in the data subfolder</p>

In [3]:
df = pd.read_excel('input/PMID_29907796.xlsx')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,Patient 1,Patient 2,Patient 3,Patient 4,Count features
0,Pathogenic variant,c.110G>A,c.110G>A,c.110G>A,c.110G>A,
1,,p.Arg37His,p.Arg37His,p.Arg37His,p.Arg37His,
2,Inheritance,De novo,De novo,De novo,De novo,
3,Age at examination,9.5 y,5 y 8 mo,12 y,17 mo,
4,Development,,,,,
5,Intellectual disability,Severe,Severe,Severe,Severe,4/4 (100%)
6,Speech delay,Severe,Severe,Severe,Severe,4/4 (100%)
7,Motor delay,Severe,Severe,Severe,Severe,4/4 (100%)
8,Congenital anomalies,,,,,
9,Congenital heart defect,-,+,-,+,2/4 (50%)


<h1>Converting to row-based format</h1>

In [5]:
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft = dft[dft.index.astype(str).str.contains('Patient')]
dft['patient_id'] = dft.index
dft.head()

Unnamed: 0,Pathogenic variant,NaN,Inheritance,Age at examination,Development,Intellectual disability,Speech delay,Motor delay,Congenital anomalies,Congenital heart defect,...,Brachycephaly,Joint hypermobility,Hip dysplasia,Contractures,Other,Obstructive sleep apnea,Precocious puberty,(History of) anemia,Thrombocytopenia,patient_id
Patient 1,c.110G>A,p.Arg37His,De novo,9.5 y,,Severe,Severe,Severe,,-,...,+,-,-,+,,+,-,+,-,Patient 1
Patient 2,c.110G>A,p.Arg37His,De novo,5 y 8 mo,,Severe,Severe,Severe,,+,...,-,+,-,-,,+,+,+,-,Patient 2
Patient 3,c.110G>A,p.Arg37His,De novo,12 y,,Severe,Severe,Severe,,-,...,-,+,-,+,,+,-,+,-,Patient 3
Patient 4,c.110G>A,p.Arg37His,De novo,17 mo,,Severe,Severe,Severe,,+,...,+,+,+,-,,+,,-,+,Patient 4


Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that:

In [6]:
dft.columns = dft.columns.str.strip()
dft = dft.dropna(axis=1, how='all')
dft['Intellectual disability, severe'] = dft['Intellectual disability']
del dft['Intellectual disability']
dft.head()

Unnamed: 0,Pathogenic variant,NaN,Inheritance,Age at examination,Speech delay,Motor delay,Congenital heart defect,Laryngomalacia,Kidney anomalies,Genital anomalies,...,Brachycephaly,Joint hypermobility,Hip dysplasia,Contractures,Obstructive sleep apnea,Precocious puberty,(History of) anemia,Thrombocytopenia,patient_id,"Intellectual disability, severe"
Patient 1,c.110G>A,p.Arg37His,De novo,9.5 y,Severe,Severe,-,-,-,+,...,+,-,-,+,+,-,+,-,Patient 1,Severe
Patient 2,c.110G>A,p.Arg37His,De novo,5 y 8 mo,Severe,Severe,+,-,+,-,...,-,+,-,-,+,+,+,-,Patient 2,Severe
Patient 3,c.110G>A,p.Arg37His,De novo,12 y,Severe,Severe,-,-,+,+,...,-,+,-,+,+,-,+,-,Patient 3,Severe
Patient 4,c.110G>A,p.Arg37His,De novo,17 mo,Severe,Severe,+,+,+,-,...,+,+,+,-,+,,-,+,Patient 4,Severe


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [7]:
hpo_cr = parser.get_hpo_concept_recognizer()
generator = SimpleColumnMapperGenerator(df=dft.loc[:,['Intellectual disability, severe','Speech delay', 'Motor delay']],
                                                    observed='Severe',
                                                    excluded='-',
                                                    hpo_cr=hpo_cr)

column_mapper_d = generator.try_mapping_columns()
from IPython.display import display, HTML
display(HTML(generator.to_html()))

Result,Columns
Mapped,"Intellectual disability, severe; Speech delay; Motor delay"
Unmapped,


In [8]:
generator_plus = SimpleColumnMapperGenerator(df=dft.loc[:,'Congenital heart defect': 'patient_id'],
                                                    observed='+',
                                                    excluded='-',
                                                    hpo_cr=hpo_cr)

column_mapper_d_plus = generator_plus.try_mapping_columns()
display(HTML(generator_plus.to_html()))

Result,Columns
Mapped,Congenital heart defect; Laryngomalacia; Genital anomalies; Hydrocephalus; Myopia; Hearing loss; Brachycephaly; Joint hypermobility; Hip dysplasia; Obstructive sleep apnea; Precocious puberty; Thrombocytopenia
Unmapped,Kidney anomalies; Childhood hypotonia; CVI; Eye movement disorder; Other eye problems; Contractures; (History of) anemia; patient_id


In [9]:
correct_hpo_ids = {}
correct_hpo_ids['Speech delay'] = 'HP:0000750'
correct_hpo_ids['Eye movement disorder'] = 'HP:0000496'
correct_hpo_ids['Contractures'] = 'HP:0001371'
correct_hpo_ids['Kidney anomalies'] = 'HP:0000077'
correct_hpo_ids['CVI'] = 'HP:0100704'
correct_hpo_ids['Genital anomalies'] = 'HP:0000078'
correct_hpo_ids['Congenital heart defect'] = 'HP:0001627'
correct_hpo_ids['Hearing loss'] = 'HP:0000365'
correct_hpo_ids['Other eye problems'] = 'HP:0000478'
for k,v in correct_hpo_ids.items():
    if v == 'HP:0000750':
        t_mapper = SimpleColumnMapper(hpo_id=v,
                                      hpo_label=hpo_cr._id_to_primary_label[v],
                                      observed='Severe',
                                      excluded='-')
    else:
        t_mapper = SimpleColumnMapper(hpo_id=v,
                                      hpo_label=hpo_cr._id_to_primary_label[v],
                                      observed='+',
                                      excluded='-')
    #print(t_mapper.preview_column(dft[k]))
    column_mapper_d[k] = t_mapper

<h2>Variant Data</h2>

In [10]:
genome = 'hg38'
default_genotype = 'heterozygous'
SMARCB1_transcript='NM_003073.3'
variant_list = dft['Pathogenic variant'].unique()
print(variant_list)
vvalidator = VariantValidator(genome_build=genome, transcript=SMARCB1_transcript)
# There is just one variant in this dataset
SMARCB1_var = vvalidator.encode_hgvs('c.110G>A')
print(SMARCB1_var)
variant_d = {'c.110G>A' : SMARCB1_var}

['c.110G>A']
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003073.3%3Ac.110G>A/NM_003073.3?content-type=application%2Fjson
chr22:23791772G>A


In [11]:
varMapper = VariantColumnMapper(variant_d=variant_d, variant_column_name='Pathogenic variant', default_genotype=default_genotype)

<h1>Demographic data</h1>

In [12]:
import numpy as np
#age is in years, so manually correct it
dft['Age at examination'] = np.array([10, 6, 12, 1], dtype=int)
ageMapper = AgeColumnMapper.by_year('Age at examination')
ageMapper.preview_column(dft['Age at examination'])

Unnamed: 0,original column contents,age
0,10,P10Y
1,6,P6Y
2,12,P12Y
3,1,P1Y


In [13]:
#sex is not in columns, since the individuals were all females in this paper
dft['Sex'] = ['Female'] * 4
sexMapper = SexColumnMapper(male_symbol='Male', female_symbol='Female', column_name='Sex')
sexMapper.preview_column(dft['Sex'])

Unnamed: 0,original column contents,sex
0,Female,FEMALE
1,Female,FEMALE
2,Female,FEMALE
3,Female,FEMALE


In [14]:
encoder = CohortEncoder(df=dft, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        variant_mapper=varMapper, 
                        metadata=metadata,
                        pmid=PMID)

<h2>Disease diagnosis</h2>
<p>The authors name the diagnosis "severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus". This
is not available in OMIM or MONDO, and so we will add an "a" to the OMIM id for the gene, 
 OMIM:601607a </p>

In [15]:
custom_disease_label ='severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus'
custom_disease_id = 'OMIM:601607a'
custom_disease = Disease(disease_id=custom_disease_id, disease_label=custom_disease_label)
encoder.set_disease(disease=custom_disease)

In [16]:
individuals = encoder.get_individuals()
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.MONO_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator, ontology=hpo_ontology)
display(HTML(qc.to_summary_html()))

Level,Error category,Count
WARNING,REDUNDANT,4


In [17]:
individuals = cvalidator.get_error_free_individual_list()
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
Patient 1 (FEMALE; P10Y),severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus (OMIM:601607a),NM_003073.3:c.110G>A (heterozygous),Intellectual disability (HP:0001249); Delayed speech and language development (HP:0000750); Motor delay (HP:0001270); Flexion contracture (HP:0001371); Cerebral visual impairment (HP:0100704); Abnormality of the genital system (HP:0000078); excluded: Abnormality of eye movement (HP:0000496); excluded: Abnormality of the kidney (HP:0000077); excluded: Abnormal heart morphology (HP:0001627); excluded: Hearing impairment (HP:0000365)
Patient 2 (FEMALE; P6Y),severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus (OMIM:601607a),NM_003073.3:c.110G>A (heterozygous),Intellectual disability (HP:0001249); Delayed speech and language development (HP:0000750); Motor delay (HP:0001270); Abnormality of eye movement (HP:0000496); Abnormality of the kidney (HP:0000077); Cerebral visual impairment (HP:0100704); Abnormal heart morphology (HP:0001627); Hearing impairment (HP:0000365); excluded: Flexion contracture (HP:0001371); excluded: Abnormality of the genital system (HP:0000078)
Patient 3 (FEMALE; P12Y),severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus (OMIM:601607a),NM_003073.3:c.110G>A (heterozygous),Intellectual disability (HP:0001249); Delayed speech and language development (HP:0000750); Motor delay (HP:0001270); Abnormality of eye movement (HP:0000496); Flexion contracture (HP:0001371); Abnormality of the kidney (HP:0000077); Abnormality of the genital system (HP:0000078); excluded: Cerebral visual impairment (HP:0100704); excluded: Abnormal heart morphology (HP:0001627); excluded: Hearing impairment (HP:0000365)
Patient 4 (FEMALE; P1Y),severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus (OMIM:601607a),NM_003073.3:c.110G>A (heterozygous),Intellectual disability (HP:0001249); Delayed speech and language development (HP:0000750); Motor delay (HP:0001270); Abnormality of eye movement (HP:0000496); Abnormality of the kidney (HP:0000077); Cerebral visual impairment (HP:0100704); Abnormal heart morphology (HP:0001627); Hearing impairment (HP:0000365); excluded: Flexion contracture (HP:0001371); excluded: Abnormality of the genital system (HP:0000078)


In [18]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                              metadata=metadata,
                                              outdir=output_directory)

We output 4 GA4GH phenopackets to the directory phenopackets
