<h1>Example 2: Creation of phenopackets from supplemental tables (individuals in columns)</h1>
<p>The Python Phenopackets Tools (pyphetools) library offers several functions for the creation, validation, and display of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a>.</p>
<p></p>In this notebook, we show how to create phenopackets from table 1 of  <a href="https://pubmed.ncbi.nlm.nih.gov/30945334/" target="__blank">Iwasawa S., et al. (2019) De Recurrent de novo MAPK8IP3 variants cause neurological phenotypes</a>.
<p>See <a href="https://github.com/monarch-initiative/pyphetools/blob/main/notebooks/Example_1_TRPM3_PMID_31278393.ipynb">Example 1</a> for additional explanations.</p>
<p>The main difference between the two examples is that the Supplemental tables show data about individuals in rows (See Example 1) or in columns (this example).</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
from IPython.display import display, HTML
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import pyphetools
from pyphetools.creation import *
from pyphetools.validation import CohortValidator
from pyphetools.visualization import *

print(f"Using pyphetools version {pyphetools.__version__}")

Using pyphetools version 0.9.77




<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hp_ontology = parser.get_ontology()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
PMID = "PMID:30945334"
title = "Recurrent de novo MAPK8IP3 variants cause neurological phenotypes"
cite = Citation(pmid=PMID, title=title)
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199", citation=cite)
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>The Table of the Iwasawa et al (2019) paper was copied into an Excel file that is included in the data subfolder</p>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('data/PMID_30945334.xlsx')

In [4]:
df.head(2)

Unnamed: 0,identifier,Individual 1,Individual 2,Individual 3,Individual 4,Individual 5
0,"Variant (hg19, NM_015133.4)",c.1732C>T,c.1732C>T,c.1732C>T,c.3436C>T,c.3436C>T
1,Protein variant,(p.Arg578Cys),(p.Arg578Cys),(p.Arg578Cys),(p.Arg1146Cys),(p.Arg1146Cys)


<h1>Converting to row-based format</h1>
<p>To use pyphetools, we need to have the individuals represented as rows (one row per individual) and have the items of interest be encoded as column names. The required transformations for doing this may be different for different input data, but often we will want to transpose the table (using the pandas <tt>transpose</tt> function) and set the column names of the new table to the zero-th row. After this, we drop the zero-th row (otherwise, it will be interpreted as an individual by the pyphetools code).</p>
<p>After this step is completed, the remaining steps to create phenopackets are the same as in the 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>.</p>

In [5]:
dft = df.transpose()

dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.head(2)

identifier,"Variant (hg19, NM_015133.4)",Protein variant,Age (yr),Sex,Gestational ages (weeks),Delayed motor development,Age at head control (months),Age at rolling (months),Age at unsupported sitting (months),Age at crawling (months),...,Corpus callosum hypoplasia,Facial dysmorphism,Round face,Prominent nasal bridge,Thin upper lip,Others,Other,Short stature,Obesity,Precocious puberty
Individual 1,c.1732C>T,(p.Arg578Cys),29,Male,39,+,2.5,ND,7,468,...,++,,+,−,+,,,+,+,+
Individual 2,c.1732C>T,(p.Arg578Cys),27,Female,40,+,3.5,11,6,11,...,++,,+,−,+,,,+,+,+


<h2>Index vs. normal column</h2>
<p>Another thing to look out for is whether the individuals (usually the first column) are regarded as the index of the table or as the first normal column.</p>
<p>If this is the case, it is easiest to create a new column with the contents of the index -- this will work with the pyphetools software. An example follows -- we can now use 'patient_id' as the column name.</p>

In [6]:
dft.index
dft['individual_id'] = dft.index
dft.head(2)

identifier,"Variant (hg19, NM_015133.4)",Protein variant,Age (yr),Sex,Gestational ages (weeks),Delayed motor development,Age at head control (months),Age at rolling (months),Age at unsupported sitting (months),Age at crawling (months),...,Facial dysmorphism,Round face,Prominent nasal bridge,Thin upper lip,Others,Other,Short stature,Obesity,Precocious puberty,individual_id
Individual 1,c.1732C>T,(p.Arg578Cys),29,Male,39,+,2.5,ND,7,468,...,,+,−,+,,,+,+,+,Individual 1
Individual 2,c.1732C>T,(p.Arg578Cys),27,Female,40,+,3.5,11,6,11,...,,+,−,+,,,+,+,+,Individual 2


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [7]:
column_mapper_list = list()

In [8]:
delayedMotorMapper = SimpleColumnMapper(column_name='Delayed motor development',
                                        hpo_id='HP:0001270',hpo_label='Motor delay',observed='+', excluded='-')
column_mapper_list.append(delayedMotorMapper)
delayedMotorMapper.preview_column(dft)

Unnamed: 0,mapping,count
0,"original value: ""+"" -> HP: Motor delay (HP:0001270) (observed)",5


<h2>ThresholdedColumnMapper</h2>
<p>Use this mapper for phenotypic features that are reported as ages (numbers). For instance, 
if "Age at head control (months)" is over 4 months, we would call <a href="https://hpo.jax.org/app/browse/term/HP:0032988" target="__blank">
Persistent head lag HP:0032988</a>.</p>



In [9]:
persistentHL = HpTerm(hpo_id="HP:0032988", label="Persistent head lag")
headLag = Thresholder(hpo_term_high=persistentHL, threshold_high=4, unit="month")
headLagMapper = ThresholdedColumnMapper(column_name="Age at head control (months)",
                                        thresholder=headLag)
column_mapper_list.append(headLagMapper)
headLagMapper.preview_column(dft)

Unnamed: 0,mapping: None-4.0 month,count
0,Persistent head lag (HP:0032988): not measured,3
1,Persistent head lag (HP:0032988): observed,2


<p>Here is another example:  <a href="https://hpo.jax.org/app/browse/term/HP:0032989" target="__blank">
Delayed ability to roll over (HP:0032989)</a>. . We will use the threshold of 6 months. </p>

In [10]:
rollOver = HpTerm(hpo_id="HP:0032989", label="Delayed ability to roll over",)
rot = Thresholder(hpo_term_high=rollOver, threshold_high=6, unit="month")
rollOverMappper = ThresholdedColumnMapper(column_name="Age at rolling (months)", thresholder=rot)
column_mapper_list.append(rollOverMappper)
rollOverMappper.preview_column(dft)

Unnamed: 0,mapping: None-6.0 month,count
0,Delayed ability to roll over (HP:0032989): not measured,3
1,Delayed ability to roll over (HP:0032989): observed,2


<h2>ThresholdedColumnMapper - special code</h2>
<p>In some cases, phrases such as 'not attained' are used to denote that a child has not attained a certain milestone at the
time of last examination and this this constitutes an abnormal finding. In this case, the optional argument ''observed_code'' should be used.</p>

In [11]:
# Age at unsupported sitting (months) 	threshold: 9 months
delayedSit = HpTerm(hpo_id="HP:0025336", label="Delayed ability to sit")
sitThreshold = Thresholder(hpo_term_high=delayedSit, threshold_high=9, unit="month")
delayedSittingMapper =  ThresholdedColumnMapper(column_name="Age at unsupported sitting (months)", thresholder=sitThreshold)
column_mapper_list.append(delayedSittingMapper)
delayedSittingMapper.preview_column(dft)

Unnamed: 0,mapping: None-9.0 month,count
0,Delayed ability to sit (HP:0025336): not measured,2
1,Delayed ability to sit (HP:0025336): observed,3


In [12]:
# Age at walking (months) - 15 months -- Delayed ability to walk HP:0031936
delayedWalk = HpTerm(hpo_id="HP:0031936", label="Delayed ability to walk")
walkTh = Thresholder(hpo_term_high=delayedWalk, threshold_high=15, unit="month")
delayedWalkingMapper =  ThresholdedColumnMapper( column_name="Age at walking (months)", thresholder=walkTh)
column_mapper_list.append(delayedWalkingMapper)
delayedWalkingMapper.preview_column(dft)

Unnamed: 0,mapping: None-15.0 month,count
0,Delayed ability to walk (HP:0031936): observed,5


<h2>Other columns</h2>
<p>The following "simple" columns are created in a loop for simplicity. The key represents the words used in the table, and the value is a two-element array with the corresponding HPO label and term id. See 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>
    for more information.</p>

In [13]:
items = {
    'History of regression': ["Developmental regression","HP:0002376"],
    'Spastic diplegia':['Spastic diplegia', 'HP:0001264'],     #       
    'Autistic behavior': ['Autistic behavior', 'HP:0000729'],  # 
    'Infantile hypotonia':['Infantile muscular hypotonia','HP:0008947'], # 
    'Cerebral atrophy':["Cerebral atrophy","HP:0002059"], #
    'Delayed myelination':["Delayed CNS myelination","HP:0002188"], #
    'Corpus callosum hypoplasia':['Hypoplasia of the corpus callosum','HP:0002079'],#
    'Prominent nasal bridge':['Prominent nasal bridge','HP:0000426'], #
    'Thin upper lip':["Thin upper lip vermilion","HP:0000219"],
    "Round face":["Round face","HP:0000311"],
    "Short stature":["Short stature","HP:0004322"],
    "Obesity":["Obesity", "HP:0001513"],
    "Precocious puberty":["Precocious puberty", "HP:0000826"],
}
item_column_mapper_d = hpo_cr.initialize_simple_column_maps(column_name_to_hpo_label_map=items, observed='+',
    excluded='-')
print(f"We created {len(item_column_mapper_d)} simple column mappers")
# Transfer to column_mapper_d
for k, v in item_column_mapper_d.items():
    column_mapper_list.append(v)

We created 13 simple column mappers


In [14]:
severity_d = {'Severe': 'Intellectual disability, severe',
             'Profound': 'Intellectual disability, profound'}
idMapper = OptionColumnMapper(column_name='Intellectual disability',
                              concept_recognizer=hpo_cr, option_d=severity_d)
column_mapper_list.append(idMapper)
idMapper.preview_column(dft)

Unnamed: 0,mapping,count
0,"Intellectual disability, severe (HP:0010864) (observed)",4
1,"Intellectual disability, profound (HP:0002187) (observed)",1


In [15]:
# Language skills
language_d = {'Simple two-word sentences': 'Delayed speech and language development',
             'Simple words': 'Delayed speech and language development',
             'Nonverbal': 'Absent speech'}
languageMapper = OptionColumnMapper(column_name='Language skills',concept_recognizer=hpo_cr, option_d=language_d)
column_mapper_list.append(languageMapper)
languageMapper.preview_column(dft)

Unnamed: 0,mapping,count
0,Delayed speech and language development (HP:0000750) (observed),3
1,Absent speech (HP:0001344) (observed),2


In [16]:
# Gross motor skills Wheelchair bound 	Wheelchair bound 	Wheelchair bound 	Cruising (5y)	Walking  (5y)
gms_d = {
    "Wheelchair bound": "Loss of ambulation",
    "Cruising": "Delayed gross motor development"
}
gmsMapper = OptionColumnMapper(column_name='Gross motor skills',concept_recognizer=hpo_cr, option_d=gms_d)
column_mapper_list.append(gmsMapper)
gmsMapper.preview_column(dft)


Unnamed: 0,mapping,count
0,Loss of ambulation (HP:0002505) (observed),3
1,Delayed gross motor development (HP:0002194) (observed),1


In [17]:
other_d = {
    "Long and thick eyebrows, ": ["Thick eyebrows", "Long eyebrows"],
    "upper slanted palpebral fissures": "Upslanted palpebral fissure",
}
otherMapper = OptionColumnMapper(column_name='Others',concept_recognizer=hpo_cr, option_d=other_d)
column_mapper_list.append(otherMapper)
otherMapper.preview_column(dft)

Unnamed: 0,mapping,count
0,Thick eyebrow (HP:0000574) (observed),1
1,Upslanted palpebral fissure (HP:0000582) (observed),1
2,Anteverted nares (HP:0000463) (observed),1
3,Short philtrum (HP:0000322) (observed),1


<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [18]:
MAPK8IP3_transcript='NM_015133.4'
vman = VariantManager(df=dft, individual_column_name="individual_id",transcript=MAPK8IP3_transcript,gene_symbol="MAPK8IP3",
                                     gene_id="HGNC:6884", allele_1_column_name="Variant (hg19, NM_015133.4)")

In [19]:
varMapper = VariantColumnMapper(variant_column_name="Variant (hg19, NM_015133.4)", variant_d=vman.get_variant_d(),default_genotype="heterozygous")
#varMapper.preview_column(dft['Variant (hg19, NM_015133.4)'])

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [20]:
ageMapper = AgeColumnMapper.by_year('Age (yr)')
ageMapper.preview_column(dft)

Unnamed: 0,original column contents,age
0,P29Y,29
1,P27Y,27
2,P16Y,16
3,P5Y,5


In [21]:
sexMapper = SexColumnMapper(male_symbol='Male', female_symbol='Female', column_name='Sex')
sexMapper.preview_column(dft)

Unnamed: 0,original column contents,sex
0,Male,MALE
1,Female,FEMALE
2,Male,MALE
3,Male,MALE
4,Female,FEMALE


In [22]:
encoder = CohortEncoder(df=dft, 
                        hpo_cr=hpo_cr, 
                        column_mapper_list=column_mapper_list, 
                        individual_column_name="individual_id", 
                        age_at_last_encounter_mapper=ageMapper, 
                        sexmapper=sexMapper,
                        variant_mapper=varMapper,
                        metadata=metadata,
                        )
disease_id = "OMIM:618443"
disease_label = "Neurodevelopmental disorder with or without variable brain abnormalities"
NEDBA = Disease(disease_id=disease_id, disease_label=disease_label)
encoder.set_disease(disease=NEDBA)

In [23]:
individuals = encoder.get_individuals()

In [24]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "PMID_30945334_Individual_1",
  "subject": {
    "id": "Individual 1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P29Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0001270",
        "label": "Motor delay"
      }
    },
    {
      "type": {
        "id": "HP:0031936",
        "label": "Delayed ability to walk"
      }
    },
    {
      "type": {
        "id": "HP:0001264",
        "label": "Spastic diplegia"
      }
    },
    {
      "type": {
        "id": "HP:0002059",
        "label": "Cerebral atrophy"
      }
    },
    {
      "type": {
        "id": "HP:0002188",
        "label": "Delayed CNS myelination"
      }
    },
    {
      "type": {
        "id": "HP:0000219",
        "label": "Thin upper lip vermilion"
      }
    },
    {
      "type": {
        "id": "HP:0000311",
        "label": "Round face"
      }
    },
    {
      "type": {
        "id": "HP:0004322",
        "

<h2>Validate</h2>
<p>pyphetools offers a quick validation that phenopackets contain a mininum number of variants and HPO terms.
We recommend additional validation with <a href="https://github.com/phenopackets/phenopacket-tools">phenopacket-tools</a>.</p>

In [25]:

cvalidator = CohortValidator(cohort=individuals, ontology=hp_ontology, min_hpo=1,
                                allelic_requirement=AllelicRequirement.MONO_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

Level,Error category,Count
WARNING,REDUNDANT,6
INFORMATION,NOT_MEASURED,37


<h2>Visualization</h2>
<p>pyphetools can output summary tables of the main data contained in the cohort.</p>

In [26]:
individuals = cvalidator.get_error_free_individual_list()
table = IndividualTable(individuals)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
Individual 1 (MALE; P29Y),Neurodevelopmental disorder with or without variable brain abnormalities (OMIM:618443),NM_015133.4:c.1732C>T (heterozygous),"Delayed ability to walk (HP:0031936); Spastic diplegia (HP:0001264); Cerebral atrophy (HP:0002059); Delayed CNS myelination (HP:0002188); Thin upper lip vermilion (HP:0000219); Round face (HP:0000311); Short stature (HP:0004322); Obesity (HP:0001513); Precocious puberty (HP:0000826); Intellectual disability, severe (HP:0010864); Delayed speech and language development (HP:0000750); Loss of ambulation (HP:0002505)"
Individual 2 (FEMALE; P27Y),Neurodevelopmental disorder with or without variable brain abnormalities (OMIM:618443),NM_015133.4:c.1732C>T (heterozygous),"Delayed ability to roll over (HP:0032989); Delayed ability to walk (HP:0031936); Spastic diplegia (HP:0001264); Cerebral atrophy (HP:0002059); Delayed CNS myelination (HP:0002188); Thin upper lip vermilion (HP:0000219); Round face (HP:0000311); Short stature (HP:0004322); Obesity (HP:0001513); Precocious puberty (HP:0000826); Intellectual disability, severe (HP:0010864); Delayed speech and language development (HP:0000750); Loss of ambulation (HP:0002505)"
Individual 3 (MALE; P16Y),Neurodevelopmental disorder with or without variable brain abnormalities (OMIM:618443),NM_015133.4:c.1732C>T (heterozygous),"Delayed ability to sit (HP:0025336); Delayed ability to walk (HP:0031936); Spastic diplegia (HP:0001264); Prominent nasal bridge (HP:0000426); Thin upper lip vermilion (HP:0000219); Short stature (HP:0004322); Obesity (HP:0001513); Intellectual disability, profound (HP:0002187); Delayed speech and language development (HP:0000750); Loss of ambulation (HP:0002505)"
Individual 4 (MALE; P5Y),Neurodevelopmental disorder with or without variable brain abnormalities (OMIM:618443),NM_015133.4:c.3436C>T (heterozygous),"Persistent head lag (HP:0032988); Delayed ability to roll over (HP:0032989); Delayed ability to sit (HP:0025336); Delayed ability to walk (HP:0031936); Autistic behavior (HP:0000729); Infantile muscular hypotonia (HP:0008947); Cerebral atrophy (HP:0002059); Prominent nasal bridge (HP:0000426); Thin upper lip vermilion (HP:0000219); Round face (HP:0000311); Short stature (HP:0004322); Intellectual disability, severe (HP:0010864); Absent speech (HP:0001344)"
Individual 5 (FEMALE; P5Y),Neurodevelopmental disorder with or without variable brain abnormalities (OMIM:618443),NM_015133.4:c.3436C>T (heterozygous),"Persistent head lag (HP:0032988); Delayed ability to sit (HP:0025336); Delayed ability to walk (HP:0031936); Spastic diplegia (HP:0001264); Autistic behavior (HP:0000729); Infantile muscular hypotonia (HP:0008947); Cerebral atrophy (HP:0002059); Delayed CNS myelination (HP:0002188); Prominent nasal bridge (HP:0000426); Thin upper lip vermilion (HP:0000219); Round face (HP:0000311); Intellectual disability, severe (HP:0010864); Absent speech (HP:0001344); Thick eyebrow (HP:0000574); Upslanted palpebral fissure (HP:0000582); Anteverted nares (HP:0000463); Short philtrum (HP:0000322)"


<h2>Output</h2>
<p>Finally, we output the five phenopackets to the "phenopackets" folder.</p>

In [27]:
Individual.output_individuals_as_phenopackets(individual_list=individuals, 
                                              metadata=metadata, 
                                              outdir="phenopackets")

We output 5 GA4GH phenopackets to the directory phenopackets
