<h1>Generate table from collection of phenopackets</h1>
<p>A common task for the analysis of a cohort of individuals with pathogenic variants in a given gene is to generate a table with a summary of the findings. The pyphetool package has functionality to ingest a collection of phenopakcets and to generate several different kinds of tables that may be useful for publications of supplementary material sections.</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys

from phenopackets import Phenopacket
from google.protobuf.json_format import Parse
import json

import hpotk

sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools.output import *

In [2]:
phenopacket_dir = "/Users/robinp/GIT/hrmd/notebooks/TRPM3/phenopackets"

In [3]:
## NOT NEEDED ANYMORE
phenopacket_paths = []
if not os.path.isdir(phenopacket_dir):
    raise ValueError(f"{phenopacket_dir} was not a directory")
for root, dirs, files in os.walk(phenopacket_dir):
    for file in files:
        if file.endswith(".json"):
            phenopacket_paths.append(os.path.join(root,file))
print(f"We extracted {len(phenopacket_paths)} GA4GH phenopackets")

We extracted 17 GA4GH phenopackets


In [4]:
ingestor = PhenopacketIngestor(indir=phenopacket_dir)

In [5]:
patient_d = ingestor.get_patient_dictionary()
print(f"We got {len(patient_d)} phenopackets")

We got 17 phenopackets


In [6]:
from hpotk.ontology import Ontology
from hpotk.ontology.load.obographs import load_ontology
if os.path.isfile('hpo_data/hp.json'):
    hpo_ontology = load_ontology('hpo_data/hp.json')
else:
    hpo_ontology = load_ontology('https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.json')

In [7]:
focus_id = 'PMID_35146895_Individual_5'
ftab = FocusCountTable(patient_d=patient_d, focus_id=focus_id, ontology=hpo_ontology)

In [8]:
df = ftab.get_simple_table()

In [9]:
pd.set_option('display.max_rows', None)
df

Unnamed: 0_level_0,term,HP:id,focus,other,total,total_count
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
constitutional,Pain,HP:0012531,0,2,2/17 (11.8%),2
constitutional,Asthenia,HP:0025406,0,1,1/17 (5.9%),1
digestive,Gastroesophageal reflux,HP:0002020,0,2,2/17 (11.8%),2
digestive,Feeding difficulties,HP:0011968,1,0,1/17 (5.9%),1
digestive,Dysphagia,HP:0002015,0,1,1/17 (5.9%),1
digestive,Necrotizing enterocolitis,HP:0033165,0,1,1/17 (5.9%),1
ear,Hearing impairment,HP:0000365,0,1,1/17 (5.9%),1
ear,Low-set ears,HP:0000369,0,1,1/17 (5.9%),1
ear,Posteriorly rotated ears,HP:0000358,0,1,1/17 (5.9%),1
eye,Exotropia,HP:0000577,0,2,2/17 (11.8%),2


In [10]:
 #df.set_index('category', inplace=True)
df.columns

Index(['term', 'HP:id', 'focus', 'other', 'total', 'total_count'], dtype='object')

In [11]:
df

Unnamed: 0_level_0,term,HP:id,focus,other,total,total_count
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
constitutional,Pain,HP:0012531,0,2,2/17 (11.8%),2
constitutional,Asthenia,HP:0025406,0,1,1/17 (5.9%),1
digestive,Gastroesophageal reflux,HP:0002020,0,2,2/17 (11.8%),2
digestive,Feeding difficulties,HP:0011968,1,0,1/17 (5.9%),1
digestive,Dysphagia,HP:0002015,0,1,1/17 (5.9%),1
digestive,Necrotizing enterocolitis,HP:0033165,0,1,1/17 (5.9%),1
ear,Hearing impairment,HP:0000365,0,1,1/17 (5.9%),1
ear,Low-set ears,HP:0000369,0,1,1/17 (5.9%),1
ear,Posteriorly rotated ears,HP:0000358,0,1,1/17 (5.9%),1
eye,Exotropia,HP:0000577,0,2,2/17 (11.8%),2


In [12]:
df2 = ftab.get_thresholded_table(min_proportion=0.33)

Output terms with at least 6 counts
Could not find category for HP:0033127
Could not find category for HP:0000118
Could not find category for HP:0000001
Could not find category for HP:0000707
Could not find category for HP:0040064
Could not find category for HP:0000152
Could not find category for HP:0000478


In [13]:
ftab.get_thresholded_table(min_proportion=0.2)

Output terms with at least 3 counts
Could not find category for HP:0033127
Could not find category for HP:0025031
Could not find category for HP:0000118
Could not find category for HP:0000001
Could not find category for HP:0000707
Could not find category for HP:0040064
Could not find category for HP:0025142
Could not find category for HP:0000152
Could not find category for HP:0000478
Could not find category for HP:0001939


Unnamed: 0_level_0,term,HP:id,focus,other,total,total_count
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
digestive,Abnormality of digestive system physiology,HP:0025032,1,3,4/17 (23.5%),4
digestive,Abnormality of the gastrointestinal tract,HP:0011024,0,3,3/17 (17.6%),3
digestive,Functional abnormality of the gastrointestinal tract,HP:0012719,0,3,3/17 (17.6%),3
eye,Abnormal eye physiology,HP:0012373,0,6,6/17 (35.3%),6
eye,Abnormality of eye movement,HP:0000496,0,4,4/17 (23.5%),4
eye,Abnormal conjugate eye movement,HP:0000549,0,3,3/17 (17.6%),3
eye,Strabismus,HP:0000486,0,3,3/17 (17.6%),3
head/neck,Abnormality of the head,HP:0000234,0,7,7/17 (41.2%),7
head/neck,Abnormality of the face,HP:0000271,0,6,6/17 (35.3%),6
head/neck,Abnormal skull morphology,HP:0000929,0,6,6/17 (35.3%),6


In [14]:
N=17
rows = []
min_count = 3
for hpid, total_count in ftab._total_counts.items():
    if total_count < min_count:
        continue
    total_per = 100*total_count/N
    total_s = f"{total_count}/{N} ({total_per:.1f}%)"
    hpterm = ftab._ontology.get_term(hpid)
    cat = ftab.get_category(termid=hpid)
    focus_count = ftab._focus_counts.get(hpid, 0)
    other_count = ftab._non_focus_counts.get(hpid, 0)
    d = {'category': cat, 'term': hpterm.name, 'HP:id': hpid, 'focus' : focus_count, 'other': other_count, 'total': total_s, 'total_count': total_count}
    rows.append(d)

In [15]:
df = pd.DataFrame(rows)
df.sort_values(['category', 'total_count'], ascending=[True, False], inplace=True)
df

Unnamed: 0,category,term,HP:id,focus,other,total,total_count
12,head/neck,Macrocephaly,HP:0000256,0,3,3/17 (17.6%),3
13,head/neck,Frontal bossing,HP:0002007,0,3,3/17 (17.6%),3
0,limbs,Talipes equinovarus,HP:0001762,1,2,3/17 (17.6%),3
11,limbs,Hip dislocation,HP:0002827,0,3,3/17 (17.6%),3
9,musculoskeletal,Hypotonia,HP:0001252,5,6,11/17 (64.7%),11
5,nervous_system,Delayed ability to walk,HP:0031936,6,2,8/17 (47.1%),8
6,nervous_system,Delayed speech and language development,HP:0000750,6,2,8/17 (47.1%),8
3,nervous_system,Global developmental delay,HP:0001263,1,5,6/17 (35.3%),6
2,nervous_system,Seizure,HP:0001250,1,4,5/17 (29.4%),5
4,nervous_system,"Intellectual disability, severe",HP:0010864,4,0,4/17 (23.5%),4


In [16]:
hsa = HpoCategorySet(ontology=hpo_ontology)

In [18]:
hsa.get_category('HP:0000707')

Could not find category for HP:0000707


'not_found'

In [19]:
hsa.get_category('HP:3000025')

Could not find category for HP:3000025


'not_found'