<h1> Explore creation of phenopackets from supplemental material</h1>
<p>Let's take <a href="https://pubmed.ncbi.nlm.nih.gov/30612693/" target="__blank">Platzer K., et al. (2019) De Novo Variants in MAPK8IP3 Cause Intellectual Disability with Variable Brain Anomalies</a> as an example</p>
<p>pyphetools provides a convenient way of extracting HPO terms from typical tables presented in supplemental material, in which columns either contain yes/no/not-observed indications for a specific phenotypic feature or contain variously formated strings with one or multiple phenotypic features. pyphetools uses text mining to capture as many <a href="https://hpo.jax.org/app/" target="__blank">HPO</a> terms as possible and allows users to specific optional dictionaries that map words or phrases used in the table to the primary labels of HPO terms.</p>
<p>Users can work on one column at a time and then generate a collection of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a> to represent each patient included in the original supplemental material. These phenopackets can then be used for a variety of downstream applications.</p>

In [1]:
import phenopackets as pp
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys

sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools import *

<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>

In [2]:
hpo_json_path = '/home/peter/data/hpo/hp.json'
parser = HpoParser(hpo_json_file=hpo_json_path)
hpo_cr = parser.get_hpo_concept_recognizer()

<h2>Importing the supplemental table</h2>
<p>Supplemental Table S1 of the Platzer et al (2019) paper is an Excel file that is included in the data subfolder and contains Detailed Clinical Information for All Individuals with Causative De Novo Variants in MAPK8IP3. We need to read this from the original Excel file because some of the cells contain new-line symbols.</p>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('data/mmc2.xlsx')

<h2>Stepwise encoding of the supplementary material</h2>
<p>pyphetools supports efficient and accurate HPO-encoding of typical supplementary tables that describe a cohort of 
individuals found to have a certain disease. We recommend using the <tt>df.head()</tt> and the <tt>df.columns</tt> commands
to view the data and the column headers as follows.</p>

In [4]:
df.head()

Unnamed: 0,Indvidual\nin\nmanuscript,g.(hg19) Chr16:,Transcript\nNM_015133.4\nc.,p.,origin,genetic testing,Sex,age at last assesment,prenatal period,Exam at birth,...,neurological examination,result of external MRI,seizures,Sz onset and Sz types,AEDs used,Sz outcome,EEG,Additional symptoms,family history,further results of genetic testing
0,1,1756405,c.65delG,p.Gly22Alafs*3,de novo,TrioWES,M,14 y 8 m,,41 weeks:\nlength: 53.3 cm\nweight: 3.941 kg\nOFC: NA,...,ataxia,"mild cerebellar atrophy, hypointensity of the globi pallidi and substantia nigra, possible mild degree of abnormal iron or mineral deposition",no,,,,,"speech is ataxic but speaks in sentences/short phrases; attention issues, impulse control and emotional lability, OCD symptoms; recently developed scoliosis",unremarkable,
1,2,1756419,c.79G>T,p.Glu27*,de novo,SingleWES,M,4 y,,length: 49 cm\nweigth: 3215 g\nOFC: 35 cm,...,ataxia,normal,no,,,,,pre-natal pelvi-ureteric junction stenosis (spontaneous resolution at 6 m),,
2,3,1756451,c.111C>G,p.Tyr37*,de novo,TrioWES,M,4 y,,length: 20.5 in\nweight: 8 lb 2 oz\nOFC: NA,...,,Stable areas of T2 hyperintensity involving the central tegmental tracts,no,,,,,Nystagmus,unremarkable,770 kb duplicaion of 20p12.3 on chromosome microarray
3,4,1798706,c.1198G>A,p.Gly400Arg,de novo,TrioWES,M,7 y 6 m,"no prenatal care, no known problems","32 weeks:\nlength: NA,\nweight: 4 lbs,\nOFC: NA\n\nhad a 30 day hospital course",...,,no MRI done,no,,,,,"Left hearing loss; Dysmorphic features: hypertelorism inner canthal distance 4.3cm; low set prominent ears, slight overhangin columella, hypodontia; 5th finger clinodactyly and 5th finger brachydactylky; synophrys; Encopresis",Mother with learning disorder; finished 11th grade; Father with ADHD and learning disorder; finished 9th grade; Full sister with learning disorder; Full sister no known problems; Full brother with learning disorder,
4,5,1810410,c.1331T>C,p.Leu444Pro,de novo,TrioWES,M,10 y,,"40 weeks, length: 52 cm\nweight: 3810 g\nOFC: 36 cm",...,,perisylvian polymicrogyria,yes,10 y:\none event of a generalized seizure,,,"pathological EEG with normal age-related background activity (alpha-type), increased appearance of slowing over temporal and occipital regions","no dysmorphism, small teeth, severe s-configured scoliosis of thoracic and lumbar spine",,


<h3>Step 1: Determine the columns of interest</h3>
<p>Typically, some but not all columns of Supplemental tables include clinical phenotypic features that can be encoded using HPO. Inspect the table using the pandas <tt>head()</tt> function or the <tt>columns</tt> attribute and decide which columns to encode</p>

In [5]:
df.columns

Index(['Indvidual\nin\nmanuscript', 'g.(hg19) Chr16:',
       'Transcript\nNM_015133.4\nc.', 'p.', 'origin', 'genetic testing', 'Sex',
       'age at last assesment', 'prenatal period', 'Exam at birth',
       'body measurements\n(at last assesment if not otherwise specified)',
       'DD', 'severity of ID', 'development', 'regression', 'autism',
       'hypotonia', 'movement disorder', 'CVI', 'neurological examination',
       'result of external MRI', 'seizures', 'Sz onset and Sz types',
       'AEDs used', 'Sz outcome', 'EEG', 'Additional symptoms',
       'family history', 'further results of genetic testing'],
      dtype='object')

<h3>Step 2: Encode each column of interest using the ColumnMapper class</h3>
<p>We will show how to work with the ColumnMapper class in detail using the 'neurological examination' column. The basic idea is to make one ColumnMapper object for each column of interest. The column mapper knows how to map the contents using either default exact text matching or custom maps from whatever strings to HPO terms.</p>
<p>The first step is to create a ColumnMapper object and use the preview_column feature to see how many terms can be mapped using exact text mining</p>

In [6]:
neuroMapper = CustomColumnMapper(concept_recognizer=hpo_cr)
neuroMapper.preview_column(df['neurological examination'])

Unnamed: 0,column,terms
0,ataxia,Ataxia (HP:0001251)
1,,
2,spastic paraplegia,Spastic paraplegia (HP:0001258)
3,"spasticity; nerve conduction and EMG studies with abnormal findings ""remarkable for the failure to activate the leg muscles due to an upper motor neuron pattern of aberrant motor unit potential firing rates. These findings are consistent with dysfunction of the corticospinal pathways rather than a lower motor unit."" Significant low extremity weakness.",Spasticity (HP:0001257)
4,spasticity/stiff legs,Spasticity (HP:0001257)
5,spastic diplegic cerebral palsy,Cerebral palsy (HP:0100021)
6,"orobuccal dyspraxia, awkward gross and fine motricity, difficulty in coordination, unstable gait",Stable (HP:0031915)


<h3>Adding manual mappings: CustomColumnMapper</h3>
<p>We can see that the string in the first column, 'ataxia' was mapped to the HPO term <i>Ataxia</i> (HP:0001251), and that several other concepts were identified. However, we missed several concepts that do not match exactly with HPO term labels or synonyms, but can be mapped using some domain knonwledge. For instance, low extremity weakness appears to be equivalent to <i>Lower limb muscle weakness</i> (HP:0007340).</p>
<p>To add these mappings, users should look up the primary label of the HPO terms in question and create a dictionary that maps the phrases used in the supplemental material to the HPO labels. The following cell shows a map for the 'neurological examination' column and calls the ColumnMapper constructor with this custom map</p>

In [7]:
neuro_exam_custom_map = {'low extremity weakness': 'Lower limb muscle weakness',  
                         'unstable gait': 'Unsteady gait',
                         'dysfunction of the corticospinal pathways':'Upper motor neuron dysfunction',
                         'spastic': 'Spasticity',
                         'orobuccal dyspraxia': 'Oromotor apraxia',
                         'difficulty in coordination':'Poor coordination'
                        }
neuroMapper = CustomColumnMapper(concept_recognizer=hpo_cr, custom_map_d=neuro_exam_custom_map, )
neuroMapper.preview_column(df['neurological examination'])

Unnamed: 0,column,terms
0,ataxia,Ataxia (HP:0001251)
1,,
2,spastic paraplegia,Spastic paraplegia (HP:0001258)
3,"spasticity; nerve conduction and EMG studies with abnormal findings ""remarkable for the failure to activate the leg muscles due to an upper motor neuron pattern of aberrant motor unit potential firing rates. These findings are consistent with dysfunction of the corticospinal pathways rather than a lower motor unit."" Significant low extremity weakness.",Lower limb muscle weakness (HP:0007340); Upper motor neuron dysfunction (HP:0002493); Spasticity (HP:0001257)
4,spasticity/stiff legs,Spasticity (HP:0001257)
5,spastic diplegic cerebral palsy,Spasticity (HP:0001257); Cerebral palsy (HP:0100021)
6,"orobuccal dyspraxia, awkward gross and fine motricity, difficulty in coordination, unstable gait",Unsteady gait (HP:0002317); Oromotor apraxia (HP:0007301); Poor coordination (HP:0002370)


<h3>Adding manual mappings: CustomColumnMapper</h3>
<p>The 'DD' column only contains information about <i>Global developmental delay</i> (HP:0001263). We use
    the SimpleColumnMapper class for columns such as this that contain Yes/No information (and sometimes an indication that the item was not measured or information is not available).</p>

In [8]:
df['DD'].head()

0    yes
1    yes
2    yes
3    yes
4    yes
Name: DD, dtype: object

In [9]:
ddMapper = SimpleColumnMapper(hpo_id='HP:0001263',
    hpo_label='Global developmental delay',
    observed='yes',
    excluded='no')

In [10]:
ddMapper.preview_column(df['DD'])

Unnamed: 0,term,status
0,Global developmental delay (HP:0001263),observed
1,Global developmental delay (HP:0001263),observed
2,Global developmental delay (HP:0001263),observed
3,Global developmental delay (HP:0001263),observed
4,Global developmental delay (HP:0001263),observed
5,Global developmental delay (HP:0001263),observed
6,Global developmental delay (HP:0001263),observed
7,Global developmental delay (HP:0001263),observed
8,Global developmental delay (HP:0001263),observed
9,Global developmental delay (HP:0001263),observed


<h3>Adding manual mappings from a list of options: OptionColumnMapper</h3>
<p>In some cases, a columnn has just a few options that may need to be manually mapped. In our example, the column 'severity of ID' contains <i>Intellectual disability, severe</i> (HP:0010864)  <i>Intellectual disability, moderate</i>  (HP:0002342), and <i>Intellectual disability, mild</i> (HP:0001256)</p>
<p>We use the OptionColumnMapper class for such cases.</p>

In [13]:
df['severity of ID']

0            moderate\n(IQ 48)
1                       severe
2                     moderate
3                         mild
4     moderate\n(IQ 49 and 65)
5                         mild
6                         mild
7                       severe
8                     moderate
9                     moderate
10                    moderate
11                      severe
12           moderate\n(IQ 49)
Name: severity of ID, dtype: object

In [16]:
severity_d = {'moderate\n(IQ 48)':'Intellectual disability, moderate',
             'moderate':'Intellectual disability, moderate',
             'moderate\n(IQ 49)': 'Intellectual disability, moderate',
             'severe': 'Intellectual disability, severe',
             'mild': 'Intellectual disability, mild'}
severityOfIdMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=severity_d)
severityOfIdMapper.preview_column(df['severity of ID'])

IndexError: list index out of range

<h2>Collecting column mappings for the entire table</h2>
<p>pyphetools expects to get a disctionary whose keys correspond to the column names used by the pandas DataFrame, and the values are the corresponding ColumnMapper objects. In the following, we create this dictionary and then we create a ColumnMapper object for each of the columns to be mapped.</p>

In [11]:
column_mapper_d = defaultdict(ColumnMapper)
column_mapper_d['neurological examination'] = neuroMapper
column_mapper_d['DD'] = ddMapper

In [12]:
df['severity of ID']

0            moderate\n(IQ 48)
1                       severe
2                     moderate
3                         mild
4     moderate\n(IQ 49 and 65)
5                         mild
6                         mild
7                       severe
8                     moderate
9                     moderate
10                    moderate
11                      severe
12           moderate\n(IQ 49)
Name: severity of ID, dtype: object