<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will use <a href="https://pubmed.ncbi.nlm.nih.gov/30945334/" target="__blank">Iwasawa S., et al. (2019) De Recurrent de novo MAPK8IP3 variants cause neurological phenotypes</a> as an example</p>
<p>pyphetools provides a convenient way of extracting HPO terms from typical tables presented in supplemental material. Typical tables can have the individuals in columns or rows. In this case, we extract data from TABLE. Clinical Phenotype of Individuals With MAPK8IP3 Variants, in which data from five individuals are presented in columns, with the rows representing the category of clinical data.</p>
<p>This note shows how to work through the table and set up the pyphetools encoder. The table is available only as a PDF table in the original publication. We copied the information into an Excel file for this workbook.</p>
<p>Users can work on one column at a time and then generate a collection of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a> to represent each patient included in the original supplemental material. These phenopackets can then be used for a variety of downstream applications.</p>

In [7]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys

sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools import *

<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [8]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()

Length of valid_node_curies 16536


<h2>Importing the supplemental table</h2>
<p>The Table of the Iwasawa et al (2019) paper was copied into an Excel file that is included in the data subfolder</p>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [22]:
df = pd.read_excel('data/PMID_30945334.xlsx')

In [23]:
df

Unnamed: 0,identifier,Individual 1,Individual 2,Individual 3,Individual 4,Individual 5
0,"Variant (hg19, NM_015133.4)",c.1732C>T,c.1732C>T,c.1732C>T,c.3436C>T,c.3436C>T
1,Protein variant,(p.Arg578Cys),(p.Arg578Cys),(p.Arg578Cys),(p.Arg1146Cys),(p.Arg1146Cys)
2,Age (yr),29,27,16,5,5
3,Sex,Male,Female,Male,Male,Female
4,Gestational ages (weeks),39,40,40,36,41
5,Delayed motor development,+,+,+,+,+
6,Age at head control (months),2.5,3.5,4,5,5
7,Age at rolling (months),ND,11,6,7,6
8,Age at unsupported sitting (months),7,6,Not acquired,15,11
9,Age at crawling (months),Not acquired,11,ND,18,18


<h2>Column vs. row-based tables</h2>
<p>In the noebook "Create phenopackets from tabular data with individuals in rows" we show how to use the package for tables in which individuals are arranged in rows. In this notebook, we show how to use the package if the individuals are arranged in columns. The ColumnMapper classes of the package expect "pandas.core.series.Series" objects. To support this with row-based data, we need to set the index of the DataFrame and then use the <tt>df.loc</tt> function to get a Series object corresponding to a given row.</p>


In [24]:
df.set_index('identifier', inplace=True)
df.loc['Obesity']

Individual 1    +
Individual 2    +
Individual 3    +
Individual 4    −
Individual 5    −
Name: Obesity, dtype: object

<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [28]:
column_mapper_d = defaultdict(ColumnMapper)

In [30]:
delayedMotedMapper = SimpleColumnMapper(hpo_id='HP:0001270',
    hpo_label='Motor delay',
    observed='+',
    excluded='-')
delayedMotedMapper.preview_column(df.loc['Delayed motor development'])

Unnamed: 0,term,status
0,Motor delay (HP:0001270),observed
1,Motor delay (HP:0001270),observed
2,Motor delay (HP:0001270),observed
3,Motor delay (HP:0001270),observed
4,Motor delay (HP:0001270),observed


In [None]:
column_mapper_d['Delayed motor development'] = delayedMotedMapper