<h1>Example 2: Creation of phenopackets from supplemental tables (individuals in columns)</h1>
<p>The Python Phenopackets Tools (pyphetools) library offers several functions for the creation, validation, and display of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a>.</p>
<p>See <a href="https://github.com/monarch-initiative/pyphetools/blob/main/notebooks/Example_1_(rowwise_table)_MAPK8IP3.ipynb">Example 1</a> for additional explanations.</p>
<p>Supplemental tables show data about individuals in rows (See Example 1) or in columns (this example).</p>
<p>We will use <a href="https://pubmed.ncbi.nlm.nih.gov/30945334/" target="__blank">Iwasawa S., et al. (2019) De Recurrent de novo MAPK8IP3 variants cause neurological phenotypes</a> as an example</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
from pyphetools.creation import *
import importlib.metadata
__version__ = importlib.metadata.version("pyphetools")
print(f"Using pyphetools version {__version__}")

Using pyphetools version 0.4.8


<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>The Table of the Iwasawa et al (2019) paper was copied into an Excel file that is included in the data subfolder</p>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('data/PMID_30945334.xlsx')

In [4]:
df

Unnamed: 0,identifier,Individual 1,Individual 2,Individual 3,Individual 4,Individual 5
0,"Variant (hg19, NM_015133.4)",c.1732C>T,c.1732C>T,c.1732C>T,c.3436C>T,c.3436C>T
1,Protein variant,(p.Arg578Cys),(p.Arg578Cys),(p.Arg578Cys),(p.Arg1146Cys),(p.Arg1146Cys)
2,Age (yr),29,27,16,5,5
3,Sex,Male,Female,Male,Male,Female
4,Gestational ages (weeks),39,40,40,36,41
5,Delayed motor development,+,+,+,+,+
6,Age at head control (months),2.5,3.5,4,5,5
7,Age at rolling (months),ND,11,6,7,6
8,Age at unsupported sitting (months),7,6,Not acquired,15,11
9,Age at crawling (months),Not acquired,11,ND,18,18


<h1>Converting to row-based format</h1>
<p>To use pyphetools, we need to have the individuals represented as rows (one row per individual) and have the items of interest be encoded as column names. The required transformations for doing this may be different for different input data, but often we will want to transpose the table (using the pandas <tt>transpose</tt> function) and set the column names of the new table to the zero-th row. After this, we drop the zero-th row (otherwise, it will be interpreted as an individual by the pyphetools code).</p>
<p>After this step is completed, the remaining steps to create phenopackets are the same as in the 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>.</p>

In [5]:
dft = df.transpose()

dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.head()

identifier,"Variant (hg19, NM_015133.4)",Protein variant,Age (yr),Sex,Gestational ages (weeks),Delayed motor development,Age at head control (months),Age at rolling (months),Age at unsupported sitting (months),Age at crawling (months),...,Corpus callosum hypoplasia,Facial dysmorphism,Round face,Prominent nasal bridge,Thin upper lip,Others,Other,Short stature,Obesity,Precocious puberty
Individual 1,c.1732C>T,(p.Arg578Cys),29,Male,39,+,2.5,ND,7,Not acquired,...,++,,+,−,+,,,+,+,+
Individual 2,c.1732C>T,(p.Arg578Cys),27,Female,40,+,3.5,11,6,11,...,++,,+,−,+,,,+,+,+
Individual 3,c.1732C>T,(p.Arg578Cys),16,Male,40,+,4.0,6,Not acquired,ND,...,++,,−,+,+,,,+,+,ND
Individual 4,c.3436C>T,(p.Arg1146Cys),5,Male,36,+,5.0,7,15,18,...,++,,+,+,+,,,+,−,−
Individual 5,c.3436C>T,(p.Arg1146Cys),5,Female,41,+,5.0,6,11,18,...,++,,+,+,+,"Long and thick eyebrows, upper slanted palpebral fissures, anteverted nares, short philtrum",,−,−,−


<h2>Index vs. normal column</h2>
<p>Another thing to look out for is whether the individuals (usually the first column) are regarded as the index of the table or as the first normal column.</p>
<p>If this is the case, it is easiest to create a new column with the contents of the index -- this will work with the pyphetools software. An example follows -- we can now use 'patient_id' as the column name.</p>

In [6]:
dft.index
dft['patient_id'] = dft.index
dft.head()

identifier,"Variant (hg19, NM_015133.4)",Protein variant,Age (yr),Sex,Gestational ages (weeks),Delayed motor development,Age at head control (months),Age at rolling (months),Age at unsupported sitting (months),Age at crawling (months),...,Facial dysmorphism,Round face,Prominent nasal bridge,Thin upper lip,Others,Other,Short stature,Obesity,Precocious puberty,patient_id
Individual 1,c.1732C>T,(p.Arg578Cys),29,Male,39,+,2.5,ND,7,Not acquired,...,,+,−,+,,,+,+,+,Individual 1
Individual 2,c.1732C>T,(p.Arg578Cys),27,Female,40,+,3.5,11,6,11,...,,+,−,+,,,+,+,+,Individual 2
Individual 3,c.1732C>T,(p.Arg578Cys),16,Male,40,+,4.0,6,Not acquired,ND,...,,−,+,+,,,+,+,ND,Individual 3
Individual 4,c.3436C>T,(p.Arg1146Cys),5,Male,36,+,5.0,7,15,18,...,,+,+,+,,,+,−,−,Individual 4
Individual 5,c.3436C>T,(p.Arg1146Cys),5,Female,41,+,5.0,6,11,18,...,,+,+,+,"Long and thick eyebrows, upper slanted palpebral fissures, anteverted nares, short philtrum",,−,−,−,Individual 5


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [7]:
column_mapper_d = defaultdict(ColumnMapper)

In [8]:
delayedMotorMapper = SimpleColumnMapper(hpo_id='HP:0001270',
    hpo_label='Motor delay',
    observed='+',
    excluded='-')
delayedMotorMapper.preview_column(dft['Delayed motor development'])

Unnamed: 0,term,status
0,Motor delay (HP:0001270),observed
1,Motor delay (HP:0001270),observed
2,Motor delay (HP:0001270),observed
3,Motor delay (HP:0001270),observed
4,Motor delay (HP:0001270),observed


In [9]:
column_mapper_d['Delayed motor development'] = delayedMotorMapper

<h2>ThresholdedColumnMapper</h2>
<p>Use this mapper for phenotypic features that are reported as ages (numbers). For instance, 
if "Age at head control (months)" is over 4 months, we would call <a href="https://hpo.jax.org/app/browse/term/HP:0032988" target="__blank">
Persistent head lag HP:0032988</a>.</p>



In [10]:
headLagMapper = ThresholdedColumnMapper(hpo_id="HP:0032988", hpo_label="Persistent head lag", 
                                        threshold=4, call_if_above=True)
headLagMapper.preview_column(dft["Age at head control (months)"])

Unnamed: 0,term,status
0,Persistent head lag (HP:0032988),excluded
1,Persistent head lag (HP:0032988),excluded
2,Persistent head lag (HP:0032988),excluded
3,Persistent head lag (HP:0032988),observed
4,Persistent head lag (HP:0032988),observed


In [11]:
column_mapper_d["Age at head control (months)"] = headLagMapper

<p>Here is another example:  <a href="https://hpo.jax.org/app/browse/term/HP:0032989" target="__blank">
Delayed ability to roll over (HP:0032989)</a>. . We will use the threshold of 6 months. </p>

In [12]:
rollOverMappper = ThresholdedColumnMapper(hpo_id="HP:0032989", hpo_label="Delayed ability to roll over", 
                                        threshold=6, call_if_above=True)
rollOverMappper.preview_column(dft["Age at rolling (months)"])

Unnamed: 0,term,status
0,Delayed ability to roll over (HP:0032989),not measured
1,Delayed ability to roll over (HP:0032989),observed
2,Delayed ability to roll over (HP:0032989),excluded
3,Delayed ability to roll over (HP:0032989),observed
4,Delayed ability to roll over (HP:0032989),excluded


In [13]:
column_mapper_d["Age at rolling (months)"] = rollOverMappper

<h2>ThresholdedColumnMapper - special code</h2>
<p>In some cases, phrases such as 'not attained' are used to denote that a child has not attained a certain milestone at the
time of last examination and this this constitutes an abnormal finding. In this case, the optional argument ''observed_code'' should be used.</p>

In [14]:
# Age at unsupported sitting (months) 	threshold: 9 months
delayedSittingMapper =  ThresholdedColumnMapper(hpo_id="HP:0025336", hpo_label="Delayed ability to sit", 
                                        threshold=9, call_if_above=True, observed_code='Not acquired')
delayedSittingMapper.preview_column(dft["Age at unsupported sitting (months)"])

Unnamed: 0,term,status
0,Delayed ability to sit (HP:0025336),excluded
1,Delayed ability to sit (HP:0025336),excluded
2,Delayed ability to sit (HP:0025336),observed
3,Delayed ability to sit (HP:0025336),observed
4,Delayed ability to sit (HP:0025336),observed


In [15]:
# Age at walking (months) - 15 months -- Delayed ability to walk HP:0031936
delayedWalkingMapper =  ThresholdedColumnMapper(hpo_id="HP:0031936", hpo_label="Delayed ability to walk", 
                                        threshold=15, call_if_above=True, observed_code='Not acquired')
delayedWalkingMapper.preview_column(dft["Age at walking (months)"])


Unnamed: 0,term,status
0,Delayed ability to walk (HP:0031936),observed
1,Delayed ability to walk (HP:0031936),observed
2,Delayed ability to walk (HP:0031936),observed
3,Delayed ability to walk (HP:0031936),observed
4,Delayed ability to walk (HP:0031936),observed


In [16]:
column_mapper_d["Age at unsupported sitting (months)"] = delayedSittingMapper
column_mapper_d["Age at walking (months)"] = delayedWalkingMapper

<h2>Other columns</h2>
<p>The following "simple" columns are created in a loop for simplicity. The key represents the words used in the table, and the value is a two-element array with the corresponding HPO label and term id. See 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>
    for more information.</p>

In [17]:
items = {
    'History of regression': ["Developmental regression","HP:0002376"],
    'Spastic diplegia':['Spastic diplegia', 'HP:0001264'],     #       
    'Autistic behavior': ['Autistic behavior', 'HP:0000729'],  # 
    'Infantile hypotonia':['Infantile muscular hypotonia','HP:0008947'], # 
    'Cerebral atrophy':["Cerebral atrophy","HP:0002059"], #
    'Delayed myelination':["Delayed CNS myelination","HP:0002188"], #
    'Corpus callosum hypoplasia':['Hypoplasia of the corpus callosum','HP:0002079'],#
    'Prominent nasal bridge':['Prominent nasal bridge','HP:0000426'], #
    'Thin upper lip':["Thin upper lip vermilion","HP:0000219"],
    "Round face":["Round face","HP:0000311"],
    "Short stature":["Short stature","HP:0004322"],
    "Obesity":["Obesity", "HP:0001513"],
    "Precocious puberty":["Precocious puberty", "HP:0000826"],
}
item_column_mapper_d = hpo_cr.initialize_simple_column_maps(column_name_to_hpo_label_map=items, observed='+',
    excluded='-')
print(f"We created {len(item_column_mapper_d)} simple column mappers")
# Transfer to column_mapper_d
for k, v in item_column_mapper_d.items():
    column_mapper_d[k] = v

We created 13 simple column mappers


In [18]:
severity_d = {'Severe': 'Intellectual disability, severe',
             'Profound': 'Intellectual disability, profound'}
idMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=severity_d)
#idMapper.preview_column(dft['Intellectual disability'])
column_mapper_d['Intellectual disability'] = idMapper

In [19]:
# Language skills
language_d = {'Simple two-word sentences': 'Delayed speech and language development',
             'Simple words': 'Delayed speech and language development',
             'Nonverbal': 'Absent speech'}
languageMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=language_d)
# languageMapper.preview_column(dft['Language skills'])
column_mapper_d['Language skills'] = languageMapper

In [20]:
# Gross motor skills Wheelchair bound 	Wheelchair bound 	Wheelchair bound 	Cruising (5y)	Walking  (5y)
gms_d = {
    "Wheelchair bound": "Loss of ambulation",
    "Cruising": "Delayed gross motor development"
}
gmsMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=gms_d)
# gmsMapper.preview_column(dft['Gross motor skills'])
column_mapper_d['Gross motor skills'] = gmsMapper

In [21]:
# Others
other_d = {'upper slanted palpebral fissures': 'Upslanted palpebral fissure'}
otherMapper = CustomColumnMapper(concept_recognizer=hpo_cr, custom_map_d=other_d)
#otherMapper.preview_column(dft['Others'])
column_mapper_d['Others'] = otherMapper

<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [22]:
genome = 'hg38'
transcript='NM_015133.4'
varMapper = VariantColumnMapper(assembly=genome,
                                column_name='Variant (hg19, NM_015133.4)', 
                                transcript=transcript, 
                                default_genotype='heterozygous')
varMapper.preview_column(dft['Variant (hg19, NM_015133.4)'])

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.3436C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.3436C>T/NM_015133.4?content-type=application%2Fjson


Unnamed: 0,variant
0,NM_015133.4:c.1732C>T
1,NM_015133.4:c.1732C>T
2,NM_015133.4:c.1732C>T
3,NM_015133.4:c.3436C>T
4,NM_015133.4:c.3436C>T


<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [23]:
ageMapper = AgeColumnMapper.by_year('Age (yr)')
ageMapper.preview_column(dft['Age (yr)'])

Unnamed: 0,original column contents,age
0,29,P29Y
1,27,P27Y
2,16,P16Y
3,5,P5Y


In [24]:
sexMapper = SexColumnMapper(male_symbol='Male', female_symbol='Female', column_name='Sex')
sexMapper.preview_column(dft['Sex'])

Unnamed: 0,original column contents,sex
0,Male,MALE
1,Female,FEMALE
2,Male,MALE
3,Male,MALE
4,Female,FEMALE


In [25]:
pmid = "PMID:30612693"
encoder = CohortEncoder(df=dft, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        metadata=metadata,
                        variant_mapper=varMapper,
                        pmid=pmid)
disease_id = "OMIM:618443"
disease_label = "Neurodevelopmental disorder with or without variable brain abnormalities"
encoder.set_disease(disease_id=disease_id, label=disease_label)

In [26]:
individuals = encoder.get_individuals()

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.3436C>T/NM_015133.4?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.3436C>T/NM_015133.4?content-type=application%2Fjson


In [27]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "Individual 1",
  "subject": {
    "id": "Individual 1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P29Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0001270",
        "label": "Motor delay"
      }
    },
    {
      "type": {
        "id": "HP:0032988",
        "label": "Persistent head lag"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0025336",
        "label": "Delayed ability to sit"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0031936",
        "label": "Delayed ability to walk"
      }
    },
    {
      "type": {
        "id": "HP:0001264",
        "label": "Spastic diplegia"
      }
    },
    {
      "type": {
        "id": "HP:0002059",
        "label": "Cerebral atrophy"
      }
    },
    {
      "type": {
        "id": "HP:0002188",
        "label": "Delayed CNS myelination"
      }
    },
    {
      "type"

<h2>Output</h2>
<p>Finally, we output the five phenopackets to the "phenopackets" folder.</p>

In [29]:
Individual.output_individuals_as_phenopackets(individual_list=individuals, pmid=pmid, metadata=metadata.to_ga4gh(), outdir="phenopackets")

5