<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/11179005/" target="__blank">Ferrante, et al. (2001) Identification of the gene for oral-facial-digital type I syndrome</a> in this notebook.
<p>pyphetools provides a convenient way of extracting HPO terms from typical tables presented in supplemental material. Typical tables can have the individuals in columns or rows. In this case, we extract data from TABLE.</p>
<p>This note shows how to work through the table and set up the pyphetools encoder. The table was not originally available in the table, but constructed using the data in the publication</p>
<p>Users can work on one column at a time and then generate a collection of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a> to represent each patient included in the original supplemental material. These phenopackets can then be used for a variety of downstream applications.</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
from pyphetools.creation import *
from pyphetools.visualization import *
import importlib.metadata
__version__ = importlib.metadata.version("pyphetools")
print(f"Using pyphetools version {__version__}")

Using pyphetools version 0.5.8


<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
pmid = "PMID:11179005"
title = "Identification of the gene for oral-facial-digital type I syndrome"
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155", pmid=pmid, pubmed_title=title)
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('input/PMID_11179005.xlsx')
df['Individual'] = df['(CASE)']

In [4]:
df

Unnamed: 0,(CASE),sex,age,variant,Downslanted palpebral fissures,Dolichocephaly,Facial asymmetry,Localized skin lesion,Dental anomalies,Oral cleft,...,Dry hair,Coarse hair,Alopecia,Short 2nd toe,Syndactyly,Polycystic kidney dysplasia,Polydactyly,Clinodactyly,Brachydactyly,Individual
0,1 (F),male,1,c.1303A>C,+,+,-,+,+,+,...,-,+,+,-,-,-,-,-,-,1 (F)
1,3 (F),male,3,c.312delG,-,-,-,-,-,+,...,-,-,-,+,-,+,+,,-,3 (F)
2,4 (F),male,4,c.294_312delTGGTTTGGCAAAAGAAAAG,-,-,+,-,-,+,...,+,-,+,-,+,+,-,+,-,4 (F)
3,6 (S),male,6,c.121C>T,-,-,-,+,-,+,...,-,-,-,-,+,-,-,-,-,6 (S)
4,10 (S),male,10,c.1071_1078delGAAGGATG,-,-,,,-,+,...,-,,,-,+,-,-,-,-,10 (S)
5,27 (S),male,27,c.312+2del,-,-,-,+,+,+,...,-,-,-,-,-,+,-,-,-,27 (S)
6,28 (S),male,28,c.1757delG,-,-,-,+,+,+,...,-,-,+,-,-,-,-,+,+,28 (S)


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [5]:
hpo_cr = parser.get_hpo_concept_recognizer()
generator = SimpleColumnMapperGenerator(df=df, observed="+", excluded="-", hpo_cr=hpo_cr)
column_mapper_d = generator.try_mapping_columns()

In [6]:
print(generator.get_mapped_columns())

['Downslanted palpebral fissures', 'Dolichocephaly', 'Facial asymmetry', 'Localized skin lesion', 'Dental anomalies', 'Oral cleft', 'Brain imaging abnormality', 'Developmental delay', 'Hepatic cysts', 'Dry hair', 'Coarse hair', 'Alopecia', 'Short 2nd toe', 'Syndactyly', 'Polycystic kidney dysplasia', 'Polydactyly', 'Clinodactyly', 'Brachydactyly']


In [7]:
print(generator.get_unmapped_columns())        

['(CASE)', 'sex', 'age', 'variant', 'Individual']


<h2>Variant Data</h2>
<p>The OFD1 variant data (HGVS transcript) is listed in the variant column.</p>

In [8]:
genome = 'hg19'
default_genotype = 'hemizygous'
transcript='NM_003611.2'
varMapper = VariantColumnMapper(assembly=genome,
                                column_name='variant', 
                                transcript=transcript, 
                                default_genotype=default_genotype)

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [9]:
ageMapper = AgeColumnMapper.by_year('age')
ageMapper.preview_column(df['age'])

Unnamed: 0,original column contents,age
0,1,P1Y
1,3,P3Y
2,4,P4Y
3,6,P6Y
4,10,P10Y
5,27,P27Y
6,28,P28Y


In [10]:
sexMapper = SexColumnMapper(male_symbol='male', female_symbol='female', column_name='sex')
sexMapper.preview_column(df['sex'])

Unnamed: 0,original column contents,sex
0,male,MALE
1,male,MALE
2,male,MALE
3,male,MALE
4,male,MALE
5,male,MALE
6,male,MALE


In [11]:
encoder = CohortEncoder(df=df, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="Individual", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        variant_mapper=varMapper, 
                        metadata=metadata,
                        pmid=pmid)
encoder.set_disease(disease_id='OMIM:311200', label='Orofaciodigital syndrome I')

In [12]:
individuals = encoder.get_individuals()

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.1303A>C/NM_003611.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.312delG/NM_003611.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.294_312delTGGTTTGGCAAAAGAAAAG/NM_003611.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.121C>T/NM_003611.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.1071_1078delGAAGGATG/NM_003611.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.312+2del/NM_003611.2?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NM_003611.2%3Ac.1757delG/NM_003611.2?content-t

In [13]:
i1 = individuals[-1]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "PMID_11179005_28_(S)",
  "subject": {
    "id": "28 (S)",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P28Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0000494",
        "label": "Downslanted palpebral fissures"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0000268",
        "label": "Dolichocephaly"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0000324",
        "label": "Facial asymmetry"
      },
      "excluded": true
    },
    {
      "type": {
        "id": "HP:0011355",
        "label": "Localized skin lesion"
      }
    },
    {
      "type": {
        "id": "HP:0000164",
        "label": "Abnormality of the dentition"
      }
    },
    {
      "type": {
        "id": "HP:0000202",
        "label": "Orofacial cleft"
      }
    },
    {
      "type": {
        "id": "HP:0001263",
        "label": "Global developm

In [14]:
from IPython.display import HTML, display

phenopackets = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individuals]
table = PhenopacketTable(phenopacket_list=phenopackets)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
1 (F) (MALE; P1Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.1303A>C (hemizygous),Downslanted palpebral fissures (HP:0000494); Dolichocephaly (HP:0000268); Localized skin lesion (HP:0011355); Abnormality of the dentition (HP:0000164); Orofacial cleft (HP:0000202); Brain imaging abnormality (HP:0410263); Global developmental delay (HP:0001263); Coarse hair (HP:0002208); Alopecia (HP:0001596)
3 (F) (MALE; P3Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.312+1del (hemizygous),Orofacial cleft (HP:0000202); Brain imaging abnormality (HP:0410263); Short 2nd toe (HP:0001885); Polycystic kidney dysplasia (HP:0000113); Polydactyly (HP:0010442)
4 (F) (MALE; P4Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.294_312del (hemizygous),Facial asymmetry (HP:0000324); Orofacial cleft (HP:0000202); Brain imaging abnormality (HP:0410263); Global developmental delay (HP:0001263); Hepatic cysts (HP:0001407); Dry hair (HP:0011359); Alopecia (HP:0001596); Syndactyly (HP:0001159); Polycystic kidney dysplasia (HP:0000113); Clinodactyly (HP:0030084)
6 (S) (MALE; P6Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.121C>T (hemizygous),Localized skin lesion (HP:0011355); Orofacial cleft (HP:0000202); Syndactyly (HP:0001159)
10 (S) (MALE; P10Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.1071_1078del (hemizygous),Orofacial cleft (HP:0000202); Syndactyly (HP:0001159)
27 (S) (MALE; P27Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.312+2del (hemizygous),Localized skin lesion (HP:0011355); Abnormality of the dentition (HP:0000164); Orofacial cleft (HP:0000202); Global developmental delay (HP:0001263); Polycystic kidney dysplasia (HP:0000113)
28 (S) (MALE; P28Y),Orofaciodigital syndrome I (OMIM:311200),NM_003611.2:c.1757del (hemizygous),Localized skin lesion (HP:0011355); Abnormality of the dentition (HP:0000164); Orofacial cleft (HP:0000202); Global developmental delay (HP:0001263); Alopecia (HP:0001596); Clinodactyly (HP:0030084); Brachydactyly (HP:0001156)


In [15]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                             metadata=metadata.to_ga4gh(),
                                             pmid=pmid,
                                             outdir=output_directory)

We output 7 GA4GH phenopackets to the directory phenopackets
