<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/29907796/" target="__blank">Diets, et al. (2019) SMARCB1 causes severe intellectual disability and choroid plexus hyperplasia with resultant hydrocephalus</a></p>
<p>pyphetools provides a convenient way of extracting HPO terms from typical tables presented in supplemental material. Typical tables can have the individuals in columns or rows. In this case, we extract data from TABLE. Clinical Phenotype of Individuals With MAPK8IP3 Variants, in which data from five individuals are presented in columns, with the rows representing the category of clinical data.</p>
<p>This note shows how to work through the table and set up the pyphetools encoder. The table is available only as a PDF table in the original publication. We copied the information into an Excel file for this workbook.</p>
<p>Users can work on one column at a time and then generate a collection of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a> to represent each patient included in the original supplemental material. These phenopackets can then be used for a variety of downstream applications.</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys

sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools import *

<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()

<h2>Importing the supplemental table</h2>
<p>The Table of the Iwasawa et al (2019) paper was copied into an Excel file that is included in the data subfolder</p>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('data/PMID_29907796.xlsx')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,Patient 1,Patient 2,Patient 3,Patient 4,Count features
0,Pathogenic variant,c.110G>A,c.110G>A,c.110G>A,c.110G>A,
1,,p.Arg37His,p.Arg37His,p.Arg37His,p.Arg37His,
2,Inheritance,De novo,De novo,De novo,De novo,
3,Age at examination,9.5 y,5 y 8 mo,12 y,17 mo,
4,Development,,,,,
5,Intellectual disability,Severe,Severe,Severe,Severe,4/4 (100%)
6,Speech delay,Severe,Severe,Severe,Severe,4/4 (100%)
7,Motor delay,Severe,Severe,Severe,Severe,4/4 (100%)
8,Congenital anomalies,,,,,
9,Congenital heart defect,-,+,-,+,2/4 (50%)


<h1>Converting to row-based format</h1>
<p>To use pyphetools, we need to have the individuals represented as rows (one row per individual) and have the items of interest be encoded as column names. The required transformations for doing this may be different for different input data, but often we will want to transpose the table (using the pandas <tt>transpose</tt> function) and set the column names of the new table to the zero-th row. After this, we drop the zero-th row (otherwise, it will be interpreted as an individual by the pyphetools code).</p>
<p>After this step is completed, the remaining steps to create phenopackets are the same as in the 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>.</p>
    
Furthermore, for this specific case, there is a Count features row that we want dropped, so we filter out any row that does not have Patient in the first column.

In [15]:
dft = df.transpose()

dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft = dft[dft.index.astype(str).str.contains('Patient')]
dft.head()

Unnamed: 0,Pathogenic variant,NaN,Inheritance,Age at examination,Development,Intellectual disability,Speech delay,Motor delay,Congenital anomalies,Congenital heart defect,...,Musculoskeletal,Brachycephaly,Joint hypermobility,Hip dysplasia,Contractures,Other,Obstructive sleep apnea,Precocious puberty,(History of) anemia,Thrombocytopenia
Patient 1,c.110G>A,p.Arg37His,De novo,9.5 y,,Severe,Severe,Severe,,-,...,,+,-,-,+,,+,-,+,-
Patient 2,c.110G>A,p.Arg37His,De novo,5 y 8 mo,,Severe,Severe,Severe,,+,...,,-,+,-,-,,+,+,+,-
Patient 3,c.110G>A,p.Arg37His,De novo,12 y,,Severe,Severe,Severe,,-,...,,-,+,-,+,,+,-,+,-
Patient 4,c.110G>A,p.Arg37His,De novo,17 mo,,Severe,Severe,Severe,,+,...,,+,+,+,-,,+,,-,+


<h2>Index vs. normal column</h2>
<p>Another thing to look out for is whether the individuals (usually the first column) are regarded as the index of the table or as the first normal column.</p>
<p>If this is the case, it is easiest to create a new column with the contents of the index -- this will work with the pyphetools software. An example follows -- we can now use 'patient_id' as the column name.</p>

In [25]:
dft.index
dft['patient_id'] = dft.index
dft.head()

Unnamed: 0,Pathogenic variant,NaN,Inheritance,Age at examination,Development,Intellectual disability,Speech delay,Motor delay,Congenital anomalies,Congenital heart defect,...,Brachycephaly,Joint hypermobility,Hip dysplasia,Contractures,Other,Obstructive sleep apnea,Precocious puberty,(History of) anemia,Thrombocytopenia,patient_id
Patient 1,c.110G>A,p.Arg37His,De novo,9.5 y,,Severe,Severe,Severe,,-,...,+,-,-,+,,+,-,+,-,Patient 1
Patient 2,c.110G>A,p.Arg37His,De novo,5 y 8 mo,,Severe,Severe,Severe,,+,...,-,+,-,-,,+,+,+,-,Patient 2
Patient 3,c.110G>A,p.Arg37His,De novo,12 y,,Severe,Severe,Severe,,-,...,-,+,-,+,,+,-,+,-,Patient 3
Patient 4,c.110G>A,p.Arg37His,De novo,17 mo,,Severe,Severe,Severe,,+,...,+,+,+,-,,+,,-,+,Patient 4


Some column names might include spaces in front or after, so lets make sure that is not the case:

In [26]:
dft.columns = dft.columns.str.strip()
dft.head()

Unnamed: 0,Pathogenic variant,NaN,Inheritance,Age at examination,Development,Intellectual disability,Speech delay,Motor delay,Congenital anomalies,Congenital heart defect,...,Brachycephaly,Joint hypermobility,Hip dysplasia,Contractures,Other,Obstructive sleep apnea,Precocious puberty,(History of) anemia,Thrombocytopenia,patient_id
Patient 1,c.110G>A,p.Arg37His,De novo,9.5 y,,Severe,Severe,Severe,,-,...,+,-,-,+,,+,-,+,-,Patient 1
Patient 2,c.110G>A,p.Arg37His,De novo,5 y 8 mo,,Severe,Severe,Severe,,+,...,-,+,-,-,,+,+,+,-,Patient 2
Patient 3,c.110G>A,p.Arg37His,De novo,12 y,,Severe,Severe,Severe,,-,...,-,+,-,+,,+,-,+,-,Patient 3
Patient 4,c.110G>A,p.Arg37His,De novo,17 mo,,Severe,Severe,Severe,,+,...,+,+,+,-,,+,,-,+,Patient 4


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [29]:
column_mapper_d = defaultdict(ColumnMapper)

In [None]:
def try_mapping_columns(hpo_id, observed, excluded, df):
    
    

In [30]:
congenital_heart_defect_mapper = SimpleColumnMapper(hpo_id='HP:0001627',
    hpo_label='Abnormal heart morphology',
    observed='+',
    excluded='-')
congenital_heart_defect_mapper.preview_column(dft['Congenital heart defect'])
column_mapper_d['Congenital heart defect'] = congenital_heart_defect_mapper

In [32]:
brachycephaly_mapper = SimpleColumnMapper(hpo_id='HP:0000248',
    hpo_label='Brachycephaly',
    observed='+',
    excluded='-')
brachycephaly_mapper.preview_column(dft['Brachycephaly'])
column_mapper_d['Brachycephaly'] = brachycephaly_mapper

<h2>Other columns</h2>
<p>The following columns are shown without detailed explanations. See 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>
    for more information.</p>

In [17]:
severity_d = {'Severe': 'Intellectual disability, severe',
             'Profound': 'Intellectual disability, profound'}
idMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=severity_d)
#idMapper.preview_column(dft['Intellectual disability'])
column_mapper_d['Intellectual disability'] = idMapper

In [18]:
#Autistic behavior  Autistic behavior HP:0000729
autisticMapper = SimpleColumnMapper(hpo_id='HP:0000729',
    hpo_label='Autistic behavior',
    observed='+',
    excluded='−')
#autisticMapper.preview_column(dft['Autistic behavior'])
column_mapper_d['Autistic behavior'] = autisticMapper

In [19]:
# Language skills

# Simple two-word sentences 	Simple words 	Nonverbal 	
language_d = {'Simple two-word sentences': 'Delayed speech and language development',
             'Simple words': 'Delayed speech and language development',
             'Nonverbal': 'Absent speech'}
languageMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=language_d)
# languageMapper.preview_column(dft['Language skills'])
column_mapper_d['Language skills'] = languageMapper

In [20]:
# Spastic diplegia 	Spastic diplegia HP:0001264
spasticDiplegiaMapper = SimpleColumnMapper(hpo_id='HP:0001264',
    hpo_label='Spastic diplegia',
    observed='+',
    excluded='−')
#spasticDiplegiaMapper.preview_column(dft['Spastic diplegia'])
column_mapper_d['Spastic diplegia'] = spasticDiplegiaMapper

In [21]:
# Gross motor skills Wheelchair bound 	Wheelchair bound 	Wheelchair bound 	Cruising (5y)	Walking  (5y)
gms_d = {
    "Wheelchair bound": "Loss of ambulation",
    "Cruising": "Delayed gross motor development"
}
gmsMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=gms_d)
# gmsMapper.preview_column(dft['Gross motor skills'])
column_mapper_d['Gross motor skills'] = gmsMapper

In [22]:
# Infantile hypotonia Infantile muscular hypotonia HP:0008947
hypotoniaMapper = SimpleColumnMapper(hpo_id='HP:0008947',
    hpo_label='Infantile muscular hypotonia',
    observed='+',
    excluded='−')
hypotoniaMapper.preview_column(dft['Infantile hypotonia'])
column_mapper_d['Infantile hypotonia'] = hypotoniaMapper

In [23]:
# History of regression Developmental regression HP:0002376
regressionMapper = SimpleColumnMapper(hpo_id='HP:0002376',
    hpo_label='Developmental regression',
    observed='+',
    excluded='−')
# regressionMapper.preview_column(dft['History of regression'])
column_mapper_d['History of regression'] = regressionMapper

In [24]:
# Epilepsy  Seizure HP:0001250
seizureMapper = SimpleColumnMapper(hpo_id='HP:0001250',
    hpo_label='Seizure',
    observed='+',
    excluded='−')
# seizureMapper.preview_column(dft['Epilepsy'])
column_mapper_d['Epilepsy'] = seizureMapper

In [25]:
# Cerebral atrophy 	Cerebral atrophy HP:0002059
# Note -- record the 'Mild' version as not measured
cerebralAtrophyMapper = SimpleColumnMapper(hpo_id='HP:0002059',
    hpo_label='Cerebral atrophy',
    observed='+',
    excluded='−')
# cerebralAtrophyMapper.preview_column(dft['Cerebral atrophy'])
column_mapper_d['Cerebral atrophy'] = cerebralAtrophyMapper

In [26]:
# Delayed myelination  Delayed CNS myelination HP:0002188
delayedCNSMy = SimpleColumnMapper(hpo_id='HP:0002188',
    hpo_label='Delayed CNS myelination',
    observed='+',
    excluded='−')
#delayedCNSMy.preview_column(dft['Delayed myelination'])
column_mapper_d['Delayed myelination'] = delayedCNSMy

In [27]:
# Corpus callosum hypoplasia  Hypoplasia of the corpus callosum HP:0002079
cchMapper = SimpleColumnMapper(hpo_id='HP:0002079',
    hpo_label='Hypoplasia of the corpus callosum',
    observed='++',
    excluded='−')
# cchMapper.preview_column(dft['Corpus callosum hypoplasia'])
column_mapper_d['Corpus callosum hypoplasia'] = cchMapper

In [28]:
# Round face  - Round face HP:0000311
roundFaceMapper = SimpleColumnMapper(hpo_id='HP:0000311',
    hpo_label='Round face',
    observed='+',
    excluded='−')
#roundFaceMapper.preview_column(dft['Round face'])
column_mapper_d['Round face'] = roundFaceMapper

In [29]:
# Prominent nasal bridge 	Prominent nasal bridge HP:0000426
promNasalBridgeMapper = SimpleColumnMapper(hpo_id='HP:0000426',
    hpo_label='Prominent nasal bridge',
    observed='+',
    excluded='−')
#promNasalBridgeMapper.preview_column(dft['Prominent nasal bridge'])
column_mapper_d['Prominent nasal bridge'] = promNasalBridgeMapper

In [30]:
# Thin upper lip -- Thin upper lip vermilion HP:0000219
upperLipMapper = SimpleColumnMapper(hpo_id='HP:0000219',
    hpo_label='Thin upper lip vermilion',
    observed='+',
    excluded='−')
#upperLipMapper.preview_column(dft['Thin upper lip'])
column_mapper_d['Thin upper lip'] = upperLipMapper

In [31]:
# Others
other_d = {'upper slanted palpebral fissures': 'Upslanted palpebral fissure'}
otherMapper = CustomColumnMapper(concept_recognizer=hpo_cr, custom_map_d=other_d)
#otherMapper.preview_column(dft['Others'])
column_mapper_d['Others'] = otherMapper

In [32]:
# Short stature  Short stature HP:0004322
shortStatureMapper = SimpleColumnMapper(hpo_id='HP:0004322',
    hpo_label='Short stature',
    observed='+',
    excluded='−')
#shortStatureMapper.preview_column(dft['Short stature'])
column_mapper_d['Short stature'] = shortStatureMapper

In [33]:
# Obesity -- Obesity HP:0001513
obesityMapper = SimpleColumnMapper(hpo_id='HP:0004322',
    hpo_label='Obesity',
    observed='+',
    excluded='−')
#obesityMapper.preview_column(dft['Obesity'])
column_mapper_d['Obesity'] = obesityMapper

In [34]:
# Precocious puberty -- Precocious puberty HP:0000826
precociousMapper = SimpleColumnMapper(hpo_id='HP:0004322',
    hpo_label='Precocious puberty',
    observed='+',
    excluded='−')
#precociousMapper.preview_column(dft['Precocious puberty'])
column_mapper_d['Precocious puberty'] = precociousMapper

<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [35]:
genome = 'hg38'
default_genotype = 'heterozygous'
transcript='NM_015133.4'
varMapper = VariantColumnMapper(assembly=genome,column_name='Variant (hg19, NM_015133.4)', 
                                transcript=transcript, genotype=default_genotype)

In [36]:
#varMapper.preview_column(dft['Variant (hg19, NM_015133.4)'])

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [37]:
ageMapper = AgeColumnMapper.by_year('Age (yr)')
ageMapper.preview_column(dft['Age (yr)'])


Unnamed: 0,original column contents,age
0,29,P29Y
1,27,P27Y
2,16,P16Y
3,5,P5Y


In [38]:
sexMapper = SexColumnMapper(male_symbol='Male', female_symbol='Female', column_name='Sex')
sexMapper.preview_column(dft['Sex'])

Unnamed: 0,original column contents,sex
0,Male,MALE
1,Female,FEMALE
2,Male,MALE
3,Male,MALE
4,Female,FEMALE


In [39]:
pmid = "PMID: 30612693"
encoder = CohortEncoder(df=dft, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper,
                       variant_mapper=varMapper,
                       pmid=pmid)

In [40]:
individuals = encoder.get_individuals()

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4
size of variant_list is {len(variant_list)}
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4
size of variant_list is {len(variant_list)}
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.1732C>T/NM_015133.4
size of variant_list is {len(variant_list)}
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.3436C>T/NM_015133.4
size of variant_list is {len(variant_list)}
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_015133.4%3Ac.3436C>T/NM_015133.4
size of variant_list is {len(variant_list)}


In [44]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket()
json_string = MessageToJson(phenopacket1)
print(json_string)

Individual, size of variants 1
{
  "id": "Individual 1",
  "subject": {
    "id": "Individual 1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P29Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0001270",
        "label": "Motor delay"
      }
    },
    {
      "type": {
        "id": "HP:0032988",
        "label": "Persistent head lag"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0025336",
        "label": "Delayed ability to sit"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0031936",
        "label": "Delayed ability to walk"
      }
    },
    {
      "type": {
        "id": "HP:0010864",
        "label": "Intellectual disability, severe"
      }
    },
    {
      "type": {
        "id": "HP:0000729",
        "label": "Autistic behavior"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0000750",
     

In [45]:
output_directory = "phenopackets"
encoder.output_phenopackets(outdir=output_directory, individual_list=individuals)

Individual, size of variants 1
Individual, size of variants 1
Individual, size of variants 1
Individual, size of variants 1
Individual, size of variants 1
Wrote 5 phenopackets to phenopackets
