<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/15007000/" target="__blank">Boutouyrie, et al. (2004) Increased carotid wall stress in vascular Ehlers-Danlos syndrome</a> (PMID:15007000).</p>
<p>We extract data from TABLE 1.</p>
<p>This note shows how to work through the table and set up the pyphetools encoder. The table was not originally available in the table, but constructed using the data in the publication</p>
<p>Users can work on one column at a time and then generate a collection of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a> to represent each patient included in the original supplemental material. These phenopackets can then be used for a variety of downstream applications.</p>
<p>Note that this publication does not include genetic data.</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys
import numpy as np
from pyphetools.creation import *
from pyphetools.creation.simple_column_mapper import try_mapping_columns, get_separate_hpos_from_df
# last tested with pyphetools version 0.3.4

<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155")
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('input/PMID_15007000.xlsx')

Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that:

In [4]:
df.columns = df.columns.str.strip()
df = df.dropna(axis=1, how='all')
df['patient_id'] = df.index
df

Unnamed: 0,Patients,Gender,"Age at Echo Examination, y","Age at First Complication, y",Arterial Dissection or Rupture,Organ Rupture,Family History of the Disease,Acrogeria,Excessive Bruising/ Hematoma,Thin Translucent Skin,patient_id
0,C.G.,F,36,28.0,"Carotid-cavernous fistula, iliac artery dissection",Uterus,Yes (son),++,+,++,0
1,I.A.,F,31,30.0,"AV fistula tibial artery, stroke",0,Yes (numerous),+++,+++,+,1
2,A.G.,M,26,24.0,Left vertebrobasilar dissection,"Colonic perforation, colectomy",No,+++,+++,+,2
3,J.S.,M,29,13.0,0,Total colectomy,Yes (mother),+,0,+,3
4,V.C.,F,22,19.0,0,2 Colonic perforations,No,+++,++,++,4
5,V. P.,F,31,28.0,0,Uterus,No,+++,++,+++,5
6,L.B.,M,14,13.0,Voluminous hematoma,0,No,+++,+++,++,6
7,E.Z.,F,19,,0,0,No,+,+,+,7
8,R.S.,F,51,50.0,Splenic artery dissection,0,No,+++,+++,+,8
9,J.C.,F,31,,Varicose,0,"Yes (father, aunt)",+,++,++,9


<h2>Column mappers</h2>
<p>Please see the <a href="https://monarch-initiative.github.io/pyphetools/" target="__blank">phephetools documentation</a> on the software. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data. For this notebook, we will not model severity, and so we first make the table easier to parse by replace all +++ and ++ by a single +.</p>

In [5]:
hpo_cr = parser.get_hpo_concept_recognizer()
df = df.replace('+++', '+').replace('++', '+').replace(0, '-')
column_mapper_d, col_not_found = try_mapping_columns(df=df,
                                                    observed='+',
                                                    excluded='-',
                                                    hpo_cr=hpo_cr,
                                                    preview=True)
print(col_not_found)

                                term        status
0   Arterial dissection (HP:0005294)  not measured
1   Arterial dissection (HP:0005294)  not measured
2   Arterial dissection (HP:0005294)  not measured
3   Arterial dissection (HP:0005294)      excluded
4   Arterial dissection (HP:0005294)      excluded
5   Arterial dissection (HP:0005294)      excluded
6   Arterial dissection (HP:0005294)  not measured
7   Arterial dissection (HP:0005294)      excluded
8   Arterial dissection (HP:0005294)  not measured
9   Arterial dissection (HP:0005294)  not measured
10  Arterial dissection (HP:0005294)  not measured
11  Arterial dissection (HP:0005294)      excluded
12  Arterial dissection (HP:0005294)      excluded
13  Arterial dissection (HP:0005294)      excluded
14  Arterial dissection (HP:0005294)      excluded
15  Arterial dissection (HP:0005294)  not measured
                                term    status
0   Dermal translucency (HP:0010648)  observed
1   Dermal translucency (HP:0010648)  o

In [6]:
AcrogeriaMapper = SimpleColumnMapper(hpo_id='HP:0000978',
    hpo_label='Bruising susceptibility',
    observed='+',
    excluded='-')
AcrogeriaMapper.preview_column(df['Excessive Bruising/ Hematoma'])
column_mapper_d['Excessive Bruising/ Hematoma'] = AcrogeriaMapper

thinskinMapper = SimpleColumnMapper(hpo_id='HP:0010648',
    hpo_label='Dermal translucency',
    observed='+',
    excluded='-')
thinskinMapper.preview_column(df['Thin Translucent Skin'])
column_mapper_d['Thin Translucent Skin'] = thinskinMapper

excessive_bruisingMapper = SimpleColumnMapper(hpo_id='HP:0000978',
    hpo_label='Bruising susceptibility',
    observed='+',
    excluded='-')
excessive_bruisingMapper.preview_column(df['Excessive Bruising/ Hematoma'])
column_mapper_d['Excessive Bruising/ Hematoma'] = excessive_bruisingMapper

Lets try to get code autoformatted so that we can easily copy-paste and change it.

In [7]:
hpo_cr = parser.get_hpo_concept_recognizer()
for y in range(df.shape[1]):
    temp_dict = {}
    for i in range(len(df)):
        if len(str(df.iloc[i, y])) > 1:
            for entry in str(df.iloc[i, y]).split(','):
                hpo_term = hpo_cr.parse_cell(entry.strip())
                if len(hpo_term) > 0:
                    temp_dict[entry.strip()] = hpo_term[0].label                
                else:
                    temp_dict[entry.strip()] = 'placeholder'
    col_name = str(df.columns[y]).lower().replace(' ', '_')
    print(col_name + ' = ' + str(temp_dict).replace(',', ',\n'))
    print(col_name + 'Mapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=' + col_name + ')')
    print("print(" + col_name + 'Mapper' +  ".preview_column(df['" + str(df.columns[y]) + "']))")
    print("column_mapper_d['" + str(df.columns[y]) + "'] = " + col_name + "Mapper")
    print('')
        

patients = {'C.G.': 'placeholder',
 'I.A.': 'placeholder',
 'A.G.': 'placeholder',
 'J.S.': 'placeholder',
 'V.C.': 'placeholder',
 'V. P.': 'placeholder',
 'L.B.': 'placeholder',
 'E.Z.': 'placeholder',
 'R.S.': 'placeholder',
 'J.C.': 'placeholder',
 'P.C.': 'placeholder',
 'A.M.': 'placeholder',
 'V.L.': 'placeholder',
 'C.L.': 'placeholder',
 'M.D.': 'placeholder'}
patientsMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=patients)
print(patientsMapper.preview_column(df['Patients']))
column_mapper_d['Patients'] = patientsMapper

gender = {}
genderMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=gender)
print(genderMapper.preview_column(df['Gender']))
column_mapper_d['Gender'] = genderMapper

age_at_echo_examination,_y = {'36': 'placeholder',
 '31': 'placeholder',
 '26': 'placeholder',
 '29': 'placeholder',
 '22': 'placeholder',
 '14': 'placeholder',
 '19': 'placeholder',
 '51': 'placeholder',
 '46': 'placeholder',
 '17': 'placeholder',
 '32': 'placehold

In [8]:
arterial_dissection_or_rupture = {'Carotid-cavernous fistula': 'Carotid cavernous fistula',
 'iliac artery dissection': 'Arterial dissection',
 'AV fistula tibial artery': 'Arteriovenous fistula',
 'stroke': 'Stroke',
 'Left vertebrobasilar dissection': 'Arterial dissection',
 'Voluminous hematoma': 'Spontaneous hematomas',
 'Splenic artery dissection': 'Arterial dissection',
 'Varicose': 'Varicose veins',
 'Iliac artery dissection': 'Arterial dissection',
 'Mesenteric artery aneurysm': 'Vascular dilatation',
 'renal artery dissection': 'Arterial dissection'}
arterial_dissection_or_ruptureMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=arterial_dissection_or_rupture)
#print(arterial_dissection_or_ruptureMapper.preview_column(df['Arterial Dissection or Rupture']))
column_mapper_d['Arterial Dissection or Rupture'] = arterial_dissection_or_ruptureMapper

organ_rupture = {'Uterus': 'Uterine rupture',
 'Colonic perforation': 'Colon perforation',
 'colectomy': 'Colon perforation',
 'Total colectomy': 'Colon perforation',
 '2 Colonic perforations': 'Colon perforation',
 'Colectomy': 'Colon perforation'}
organ_ruptureMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=organ_rupture)
#print(organ_ruptureMapper.preview_column(df['Organ Rupture']))
column_mapper_d['Organ Rupture'] = organ_ruptureMapper

                                                                                                                              terms
0                                        HP:0031157 (Carotid cavernous fistula/observed); HP:0005294 (Arterial dissection/observed)
1                                                         HP:0004947 (Arteriovenous fistula/observed); HP:0001297 (Stroke/observed)
2                                                                                         HP:0005294 (Arterial dissection/observed)
3                                                                                                                               n/a
4                                                                                                                               n/a
5                                                                                                                               n/a
6                                                                           

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [9]:
ageMapper = AgeColumnMapper.by_year('Age at Echo Examination, y')
ageMapper.preview_column(df['Age at Echo Examination, y']).head(2)

Unnamed: 0,original column contents,age
0,36,P36Y
1,31,P31Y


In [10]:
sexMapper = SexColumnMapper(male_symbol='M', female_symbol='F', column_name='Gender')
sexMapper.preview_column(df['Gender']).head(2)

Unnamed: 0,original column contents,sex
0,F,FEMALE
1,F,FEMALE


In [11]:
pmid = "PMID:15007000"
encoder = CohortEncoder(df=df, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper,
                       metadata=metadata,  pmid=pmid)
encoder.set_disease(disease_id='130050', label='Ehlers-Danlos syndrome, vascular type')

In [12]:
individuals = encoder.get_individuals()

In [13]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "-",
  "subject": {
    "id": "-",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P36Y"
      }
    },
    "sex": "FEMALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0031157",
        "label": "Carotid cavernous fistula"
      }
    },
    {
      "type": {
        "id": "HP:0005294",
        "label": "Arterial dissection"
      }
    },
    {
      "type": {
        "id": "HP:0010648",
        "label": "Dermal translucency"
      }
    },
    {
      "type": {
        "id": "HP:0000978",
        "label": "Bruising susceptibility"
      }
    },
    {
      "type": {
        "id": "HP:0100718",
        "label": "Uterine rupture"
      }
    }
  ],
  "metaData": {
    "created": "2023-07-02T12:36:54.612121343Z",
    "createdBy": "ORCID:0000-0002-5648-2155",
    "resources": [
      {
        "id": "geno",
        "name": "Genotype Ontology",
        "url": "http://purl.obolibrary.org/obo/geno.owl",
        "version": "2022-0

In [14]:
output_directory = "../../phenopackets/COL3A1/"
encoder.output_phenopackets(outdir=output_directory)

Wrote 16 phenopackets to ../../phenopackets/COL3A1/
