<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/15007000/" target="__blank">Boutouyrie, et al. (2004) Increased carotid wall stress in vascular Ehlers-Danlos syndrome</a> (PMID:15007000).</p>
<p>Note that this publication does not report variants.</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys
import numpy as np
from pyphetools.creation import *
from pyphetools.visualization import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.6.3


<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
PMID = "PMID:15007000"
title = "Increased carotid wall stress in vascular Ehlers-Danlos syndrome"
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155", pmid=PMID, pubmed_title=title)
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('input/PMID_15007000.xlsx')

Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that:

In [4]:
import re
# We add an index to the id because two patients have the same initials
indx = 0
def clean_id(patient_id):
    pat_id = re.sub(r"[\s\.]", "", str(patient_id))
    global indx
    indx += 1
    return f"{pat_id}_{indx}"
df.columns = df.columns.str.strip()
df = df.dropna(axis=1, how='all')
df['patient_id'] = df['Patients'].transform(lambda x: clean_id(x) )
df

Unnamed: 0,Patients,Gender,"Age at Echo Examination, y","Age at First Complication, y",Arterial Dissection or Rupture,Organ Rupture,Family History of the Disease,Acrogeria,Excessive Bruising/ Hematoma,Thin Translucent Skin,patient_id
0,C.G.,F,36,28.0,"Carotid-cavernous fistula, iliac artery dissection",Uterus,Yes (son),++,+,++,CG_1
1,I.A.,F,31,30.0,"AV fistula tibial artery, stroke",0,Yes (numerous),+++,+++,+,IA_2
2,A.G.,M,26,24.0,Left vertebrobasilar dissection,"Colonic perforation, colectomy",No,+++,+++,+,AG_3
3,J.S.,M,29,13.0,0,Total colectomy,Yes (mother),+,0,+,JS_4
4,V.C.,F,22,19.0,0,2 Colonic perforations,No,+++,++,++,VC_5
5,V. P.,F,31,28.0,0,Uterus,No,+++,++,+++,VP_6
6,L.B.,M,14,13.0,Voluminous hematoma,0,No,+++,+++,++,LB_7
7,E.Z.,F,19,,0,0,No,+,+,+,EZ_8
8,R.S.,F,51,50.0,Splenic artery dissection,0,No,+++,+++,+,RS_9
9,J.C.,F,31,,Varicose,0,"Yes (father, aunt)",+,++,++,JC_10


<h2>Column mappers</h2>
<p>Please see the <a href="https://monarch-initiative.github.io/pyphetools/" target="__blank">phephetools documentation</a> on the software. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data. For this notebook, we will not model severity, and so we first make the table easier to parse by replace all +++ and ++ by a single +.</p>

In [9]:
hpo_cr = parser.get_hpo_concept_recognizer()
df = df.replace('+++', '+').replace('++', '+').replace(0, '-')
generator = SimpleColumnMapperGenerator(df=df, observed='+', excluded='-', hpo_cr=hpo_cr)
column_mapper_d = generator.try_mapping_columns()
print(generator.get_unmapped_columns())


['Patients', 'Gender', 'Age at Echo Examination, y', 'Age at First Complication, y', 'Arterial Dissection or Rupture', 'Organ Rupture', 'Family History of the Disease', 'Acrogeria', 'Excessive Bruising/ Hematoma', 'Thin Translucent Skin', 'patient_id']


In [6]:
AcrogeriaMapper = SimpleColumnMapper(hpo_id='HP:0000978',
    hpo_label='Bruising susceptibility',
    observed='+',
    excluded='-')
AcrogeriaMapper.preview_column(df['Excessive Bruising/ Hematoma'])
column_mapper_d['Excessive Bruising/ Hematoma'] = AcrogeriaMapper

thinskinMapper = SimpleColumnMapper(hpo_id='HP:0010648',
    hpo_label='Dermal translucency',
    observed='+',
    excluded='-')
thinskinMapper.preview_column(df['Thin Translucent Skin'])
column_mapper_d['Thin Translucent Skin'] = thinskinMapper

excessive_bruisingMapper = SimpleColumnMapper(hpo_id='HP:0000978',
    hpo_label='Bruising susceptibility',
    observed='+',
    excluded='-')
excessive_bruisingMapper.preview_column(df['Excessive Bruising/ Hematoma'])
column_mapper_d['Excessive Bruising/ Hematoma'] = excessive_bruisingMapper

<h2>Autoformating code for the OptionColumnMapper</h2>
<p>Lets try to get code autoformatted so that we can easily copy-paste and change it. Some of the columns can be created in this way,</p>

In [10]:
output = OptionColumnMapper.autoformat(df=df, concept_recognizer=hpo_cr)
print(output)

age_at_echo_examination_y_d = {'36': 'PLACEHOLDER',
 '31': 'PLACEHOLDER',
 '26': 'PLACEHOLDER',
 '29': 'PLACEHOLDER',
 '22': 'PLACEHOLDER',
 '14': 'PLACEHOLDER',
 '19': 'PLACEHOLDER',
 '51': 'PLACEHOLDER',
 '46': 'PLACEHOLDER',
 '17': 'PLACEHOLDER',
 '32': 'PLACEHOLDER',
 '45': 'PLACEHOLDER'}
age_at_echo_examination_yMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=age_at_echo_examination_y_d)
age_at_echo_examination_yMapper.preview_column(df['Age at Echo Examination, y'])
column_mapper_d['Age at Echo Examination, y'] = age_at_echo_examination_yMapper

age_at_first_complication_y_d = {'28.0': 'PLACEHOLDER',
 '30.0': 'PLACEHOLDER',
 '24.0': 'PLACEHOLDER',
 '13.0': 'PLACEHOLDER',
 '19.0': 'PLACEHOLDER',
 'nan': 'PLACEHOLDER',
 '50.0': 'PLACEHOLDER',
 '37.0': 'PLACEHOLDER'}
age_at_first_complication_yMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=age_at_first_complication_y_d)
age_at_first_complication_yMapper.preview_column(df['Age at First Complication, y

In [11]:
arterial_dissection_or_rupture = {'Carotid-cavernous fistula': 'Carotid cavernous fistula',
 'iliac artery dissection': 'Arterial dissection',
 'AV fistula tibial artery': 'Arteriovenous fistula',
 'stroke': 'Stroke',
 'Left vertebrobasilar dissection': 'Arterial dissection',
 'Voluminous hematoma': 'Spontaneous hematomas',
 'Splenic artery dissection': 'Arterial dissection',
 'Varicose': 'Varicose veins',
 'Iliac artery dissection': 'Arterial dissection',
 'Mesenteric artery aneurysm': 'Vascular dilatation',
 'renal artery dissection': 'Arterial dissection'}
arterial_dissection_or_ruptureMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=arterial_dissection_or_rupture)
#print(arterial_dissection_or_ruptureMapper.preview_column(df['Arterial Dissection or Rupture']))
column_mapper_d['Arterial Dissection or Rupture'] = arterial_dissection_or_ruptureMapper

organ_rupture = {'Uterus': 'Uterine rupture',
 'Colonic perforation': 'Colon perforation',
 'colectomy': 'Colon perforation',
 'Total colectomy': 'Colon perforation',
 '2 Colonic perforations': 'Colon perforation',
 'Colectomy': 'Colon perforation'}
organ_ruptureMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=organ_rupture)
#print(organ_ruptureMapper.preview_column(df['Organ Rupture']))
column_mapper_d['Organ Rupture'] = organ_ruptureMapper

<h1>Demographic data</h1>

In [12]:
ageMapper = AgeColumnMapper.by_year('Age at Echo Examination, y')
ageMapper.preview_column(df['Age at Echo Examination, y']).head(2)

Unnamed: 0,original column contents,age
0,36,P36Y
1,31,P31Y


In [13]:
sexMapper = SexColumnMapper(male_symbol='M', female_symbol='F', column_name='Gender')
sexMapper.preview_column(df['Gender']).head(2)

Unnamed: 0,original column contents,sex
0,F,FEMALE
1,F,FEMALE


In [15]:
encoder = CohortEncoder(df=df, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        metadata=metadata,  
                        pmid=PMID)
encoder.set_disease(disease_id='OMIM:130050', label='Ehlers-Danlos syndrome, vascular type')

In [16]:
individuals = encoder.get_individuals()

In [17]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "PMID_15007000_CG_1",
  "subject": {
    "id": "CG_1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P36Y"
      }
    },
    "sex": "FEMALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0005294",
        "label": "Arterial dissection"
      }
    },
    {
      "type": {
        "id": "HP:0031157",
        "label": "Carotid cavernous fistula"
      }
    },
    {
      "type": {
        "id": "HP:0100718",
        "label": "Uterine rupture"
      }
    }
  ],
  "metaData": {
    "created": "2023-10-01T01:27:59.727237224Z",
    "createdBy": "ORCID:0000-0002-5648-2155",
    "resources": [
      {
        "id": "geno",
        "name": "Genotype Ontology",
        "url": "http://purl.obolibrary.org/obo/geno.owl",
        "version": "2022-03-05",
        "namespacePrefix": "GENO",
        "iriPrefix": "http://purl.obolibrary.org/obo/GENO_"
      },
      {
        "id": "hgnc",
        "name": "HUGO Gene Nomenclature Committee",

In [14]:
output_directory = "../../phenopackets/COL3A1/"
encoder.output_phenopackets(outdir=output_directory)

Wrote 16 phenopackets to ../../phenopackets/COL3A1/
