<h1>ZTTK syndrome (Dingemans, et al., 2022)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/34521999/" target="__blank">Dingemans, et al. (2022) Establishing the phenotypic spectrum of ZTTK syndrome by analysis of 52 individuals with variants in SON</a></p>
<p>Phenotypic abnormalities, systematically collected and analyzed in Human Phenotype Ontology, were found in all organ systems. Significant inter-individual phenotypic variability was observed, even in individuals with the same recurrent variant (n = 13). SON haploinsufficiency was previously shown to lead to downregulation of downstream genes, contributing to specific phenotypic features. Similar functional analysis for one missense variant, however, suggests a different mechanism than for heterozygous loss-of-function..</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys
import numpy as np
import pyphetools
from pyphetools.creation import *
from pyphetools.visualization import *
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.6.3


<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
PMID = "PMID:34521999"
title = "Establishing the phenotypic spectrum of ZTTK syndrome by analysis of 52 individuals with variants in SON"
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155", pmid=PMID, pubmed_title=title)
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('input/PMID_34521999.xlsx')

In [4]:
df

Unnamed: 0,Unnamed: 1,1,2,3,4,5,6,7,8,9,...,43,44,45,46,47,48,49,50,51,52
0,Gender,Male,Male,Female,Female,Male,Female,Male,Male,Female,...,Female,Female,Female,Female,Female,Female,Female,Male,Female,Male
1,Age at examination,5 years,2 years,2 years,4 years and 4 months,9 years and 11 months,15 years,7 years,3 years and 3 months,4 years,...,9 years,5 years,9 years,3 years,9 years,15 years,3 years,23 years,6 years,3 years and 5 months
2,Genomic position,g.34927290_34927293del,g.34927290_34927293del,g.34927290_34927293del,g.34924740C>G,g.34921994del,g.34921921del,g.34783136_34975848del,g.34927547del,g.34925248del,...,g.34923418_34923419del,g.34927086_34927087del,g.34925389_34925393del,g.34927065C>A,g.34924610dup,g.34927290_34927293del,g.34921823C>T,g.34929534del,g.34927290_34927293del,g.34926456_34926460del
3,cDNA change,c.5753_5756del,c.5753_5756del,c.5753_5756del,c.3203C>G,c.457del,c.384del,0.19Mb deletion,c.6010del,c.3711del,...,c.1881_1882del,c.5549_5550del,c.3852_3856del,c.5528C>A,c.3073dup,c.5753_5756del,c.286C>T,c.6233del,c.5753_5756del,c.4919_4923del
4,Predicted protein effect,p.(Val1918Glufs*87),p.(Val1918Glufs*87),p.(Val1918Glufs*87),p.(Ser1068*),p.(Asp153Ilefs*4),p.(Lys128Asnfs*21),,p.(Val2004Trpfs*2),p.(Ser1238Glnfs*3),...,p.(Val629Alafs*56),p.(Arg1850Ilefs*3),p.(Met1284Ilefs*2),p.(Ser1843Tyr),p.(Met1025Asnfs*6),p.(Val1918Glufs*87),p.(Gln96*),p.(Pro2078Hisfs*4),p.(Val1918Glufs*87),p.(Asp1640Glyfs*7)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,Other,"Pleural effusion,Wide intermamillary distance",,,"Asthma,abnormality of the respiratory system",,,Inguinal hernia,,,...,,,,,"Respiratory distress, Emphysema, Early respiratory failure","Neonatal respiratory distress, Respiratory failure requiring assisted ventilation","Respiratory failure requiring assisted ventilation, Laryngeal cleft, Laryngomalacia, Abnormality of the carotid arteries",,"Deep venous thrombosis, Respiratory failure requiring assisted ventilation",
97,PMID,,,,,,,,,,...,27545680;31005274,27545680,27545676,27545676,27545676,27545676,27545676,27545676,27545676,32705777
98,,,,,,,,,,,...,,,,,,,,,,
99,,,,,,,,,,,...,,,,,,,,,,


<h1>Converting to row-based format</h1>
<p>For this specific case, there is a Count features row that we want dropped, so we filter out any row that does not have Patient in the first column.</p> 

In [5]:
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.head()

Unnamed: 0,Gender,Age at examination,Genomic position,cDNA change,Predicted protein effect,Other genomic variants potentially contributing to the phenotype,Head circumference (at birth) (HP:0011451 / HP:0004488),Head circumference (HP:0000252 / HP:0000256),Heigth (at birth) (HP:0003561 / HP:0003517),Heigth (HP:0004322 / HP:0000098),...,Recurrent otitis media (HP:0000403),Abnormality of the immunological system other,Abnormality of the endocrine system (HP:0000818),Abnormality of metabolism/homeostasis (HP:0001939),Neoplasia (HP:0002664),Other,PMID,NaN,NaN.1,"Description of the variants on genomic chromosomal level reported using NC_000021.8, and annotated based on NM_138927.2 unless indicated otherwise. Abbreviations: +, present; -, not present; NR, not reported; NA, not applicable; PMID, PubMed ID; U, unknown."
1,Male,5 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,P3 - P98,P3 - P98,NR,P3 - P98,...,-,,-,-,-,"Pleural effusion,Wide intermamillary distance",,,,
2,Male,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,NR,> P98,P3 - P98,< P3,...,-,,-,-,-,,,,,
3,Female,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,< P3,< P3,< P3,< P3,...,-,,-,-,-,,,,,
4,Female,4 years and 4 months,g.34924740C>G,c.3203C>G,p.(Ser1068*),-,P3 - P98,< P3,P3 - P98,< P3,...,-,Otitis media,+ (Hypothyroidism),-,-,"Asthma,abnormality of the respiratory system",,,,
5,Male,9 years and 11 months,g.34921994del,c.457del,p.(Asp153Ilefs*4),-,NR,< P3,NR,< P3,...,-,,+ (Growth hormone deficiency),-,-,,,,,


Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that:

In [6]:
dft.columns = dft.columns.str.strip()
dft = dft.dropna(axis=1, how='all')
dft['patient_id'] = dft.index
dft.head()

Unnamed: 0,Gender,Age at examination,Genomic position,cDNA change,Predicted protein effect,Other genomic variants potentially contributing to the phenotype,Head circumference (at birth) (HP:0011451 / HP:0004488),Head circumference (HP:0000252 / HP:0000256),Heigth (at birth) (HP:0003561 / HP:0003517),Heigth (HP:0004322 / HP:0000098),...,Abnormality of the immune system (HP:0002715),Recurrent otitis media (HP:0000403),Abnormality of the immunological system other,Abnormality of the endocrine system (HP:0000818),Abnormality of metabolism/homeostasis (HP:0001939),Neoplasia (HP:0002664),Other,PMID,"Description of the variants on genomic chromosomal level reported using NC_000021.8, and annotated based on NM_138927.2 unless indicated otherwise. Abbreviations: +, present; -, not present; NR, not reported; NA, not applicable; PMID, PubMed ID; U, unknown.",patient_id
1,Male,5 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,P3 - P98,P3 - P98,NR,P3 - P98,...,-,-,,-,-,-,"Pleural effusion,Wide intermamillary distance",,,1
2,Male,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,NR,> P98,P3 - P98,< P3,...,-,-,,-,-,-,,,,2
3,Female,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,< P3,< P3,< P3,< P3,...,-,-,,-,-,-,,,,3
4,Female,4 years and 4 months,g.34924740C>G,c.3203C>G,p.(Ser1068*),-,P3 - P98,< P3,P3 - P98,< P3,...,+,-,Otitis media,+ (Hypothyroidism),-,-,"Asthma,abnormality of the respiratory system",,,4
5,Male,9 years and 11 months,g.34921994del,c.457del,p.(Asp153Ilefs*4),-,NR,< P3,NR,< P3,...,-,-,,+ (Growth hormone deficiency),-,-,,,,5


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [7]:
hpo_cr = parser.get_hpo_concept_recognizer()
generator = SimpleColumnMapperGenerator(df=dft,observed='+', excluded='-',hpo_cr=hpo_cr)
column_mapper_d = generator.try_mapping_columns()
print(generator.get_mapped_columns())

[]


In [8]:
print(generator.get_unmapped_columns())

['Gender', 'Age at examination', 'Genomic position', 'cDNA change', 'Predicted protein effect', 'Other genomic variants potentially contributing to the phenotype', 'Head circumference (at birth) (HP:0011451 / HP:0004488)', 'Head circumference (HP:0000252 / HP:0000256)', 'Heigth (at birth) (HP:0003561 / HP:0003517)', 'Heigth (HP:0004322 / HP:0000098)', 'Weigth (at birth) (HP:0001518 / HP:0001520)', 'Weigth (HP:0004325 / HP:0004324)', 'Motor delay (HP:0001270)', 'Speech delay (HP:0000750)', 'Intellectual disability (HP:0001249)', 'Severity of intellectual disability (HP:0001256 / HP:0002342 / HP:0010864)', 'Abnormality of prenatal development or birth (HP:0001197)', 'Premature birth (HP:0001622)', 'Caesarian section (HP:0011410)', 'Abnormality of prenatal development or birth other', 'Neurological abnormality (HP:0000707)', 'Hypotonia (HP:0001252)', 'Seizures (HP:0001250)', 'EEG abnormality (HP:0002353)', 'Neurological abnormality other', 'Abnormality of the brain (HP:0002363)', 'Abnorma

In [9]:
headcircumference = {'> P98': 'Macrocephaly at birth',
                 '< P3': 'Primary microcephaly' }
headcircumferenceMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=headcircumference)
print(headcircumferenceMapper.preview_column(dft['Head circumference (at birth) (HP:0011451 / HP:0004488)']))
column_mapper_d['Head circumference (at birth) (HP:0011451 / HP:0004488)'] = headcircumferenceMapper

                                         terms
0                                          n/a
1                                          n/a
2   HP:0011451 (Primary microcephaly/observed)
3                                          n/a
4                                          n/a
5                                          n/a
6                                          n/a
7                                          n/a
8                                          n/a
9                                          n/a
10                                         n/a
11                                         n/a
12                                         n/a
13                                         n/a
14                                         n/a
15                                         n/a
16                                         n/a
17                                         n/a
18                                         n/a
19                                         n/a
20           

In [10]:
headcircumference = {'> P98': 'Macrocephaly',
                 '< P3': 'Microcephaly' }
headcircumferenceMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=headcircumference)
headcircumferenceMapper.preview_column(dft['Head circumference (HP:0000252 / HP:0000256)'])
column_mapper_d['Head circumference (HP:0000252 / HP:0000256)'] = headcircumferenceMapper

In [11]:
birth_length = {'> P98': 'Birth length greater than 97th percentile',
                 '< P3': 'Birth length less than 3rd percentile'}
birth_lengthMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=birth_length)
#print(birth_lengthMapper.preview_column(dft['Heigth (at birth) (HP:0003561 / HP:0003517)']))
column_mapper_d['Heigth (at birth) (HP:0003561 / HP:0003517)'] = birth_lengthMapper

length = {'> P98': 'Tall stature',
                 '< P3': 'Short stature'}
lengthMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=length)
#print(lengthMapper.preview_column(dft['Heigth (HP:0004322 / HP:0000098)']))
column_mapper_d['Heigth (HP:0004322 / HP:0000098)'] = lengthMapper

In [12]:
birth_weight = {'> P98': 'Large for gestational age',
                 '< P3': 'Small for gestational age'}
birth_weightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=birth_weight)
#print(birth_weightMapper.preview_column(dft['Weigth (at birth) (HP:0001518 / HP:0001520)']))
column_mapper_d['Weigth (at birth) (HP:0001518 / HP:0001520)'] = birth_weightMapper

weight = {'> P98': 'Increased body weight',
                 '< P3': 'Decreased body weight'}
weightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=weight)
#print(weightMapper.preview_column(dft['Weigth (HP:0004325 / HP:0004324)']))
column_mapper_d['Weigth (HP:0004325 / HP:0004324)'] = weightMapper

In [13]:
id_severity = {'Mild': 'Intellectual disability, mild',
                 'Moderate': 'Intellectual disability, moderate',
         'Severe': 'Intellectual disability, severe'}
id_severityMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=id_severity)
#print(id_severityMapper.preview_column(dft['Severity of intellectual disability (HP:0001256 / HP:0002342 / HP:0010864)']))
column_mapper_d['Severity of intellectual disability (HP:0001256 / HP:0002342 / HP:0010864)'] = id_severityMapper

For this particular file, there are HPO terms in the cells of the table as well, so we should loop them, parse contents and add them to the parser.

In [14]:
#additional_hpos = get_additional_hpos_from_df(dft, hpo_cr)

<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [15]:
genome = 'hg38'
default_genotype = 'heterozygous'
SON_transcript='NM_138927.2'
vvalidator = VariantValidator(genome_build=genome, transcript=SON_transcript)
#varMapper = VariantColumnMapper(assembly=genome,column_name='cDNA change', 
#                                transcript=transcript, default_genotype=default_genotype)
vars = dft['cDNA change'].unique()
print(vars)

['c.5753_5756del' 'c.3203C>G' 'c.457del' 'c.384del' '0.19Mb deletion'
 'c.6010del' 'c.3711del' 'c.348_351del' 'c.668C>T' 'c.3334C>T'
 'c.1881_del1882' 'c.1736C>G' 'c.4018del' 'c.4678del' 'c.1444del'
 'c.394C>T' 'c.5230del' 'Whole gene deletion' 'c.4549dup' 'c.4055del '
 'c.268del' 'c.2365del' 'c.4152_4172del' 'c.3597_3598dup' 'c.6087del'
 'c.4640del' 'c.4358_4359del ' 'c.6002_6003insCC' 'c.4999_5013del'
 'c.3852_3856del' 'c.5753_5756del ' 'c.1881_1882del ' 'c.5549_5550del'
 'c.5528C>A' 'c.3073dup' 'c.286C>T' 'c.6233del' 'c.4919_4923del']


In [16]:
variant_d = {}
son_id = "HGNC:11183"
structural = {"0.19Mb deletion","Whole gene deletion"} 
for v in vars:
    print(f"decoding {v}")
    if v in structural:
        var = StructuralVariant.chromosomal_deletion(cell_contents=v, gene_symbol="SON", gene_id=son_id)
    else:
        if v == "c.1881_del1882":
            var = vvalidator.encode_hgvs("c.1881_1882del")
        else:
            var = vvalidator.encode_hgvs(v)
    # All variants are heterozygous
    var.set_heterozygous()
    variant_d[v] = var
print(f"We extracted {len(variant_d)} variants") 

decoding c.5753_5756del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2?content-type=application%2Fjson
decoding c.3203C>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.3203C>G/NM_138927.2?content-type=application%2Fjson
decoding c.457del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.457del/NM_138927.2?content-type=application%2Fjson
decoding c.384del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.384del/NM_138927.2?content-type=application%2Fjson
decoding 0.19Mb deletion
decoding c.6010del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.6010del/NM_138927.2?content-type=application%2Fjson
decoding c.3711del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.3711del/NM_138927.2?content-type=application%2Fjson
decoding 

<h1>Demographic data</h1>

In [17]:
ageMapper = AgeColumnMapper.by_year('Age at examination')
ageMapper.preview_column(dft['Age at examination']).head(2)

Unnamed: 0,original column contents,age
0,5 years,P5Y
1,2 years,P2Y


In [18]:
sexMapper = SexColumnMapper(male_symbol='Male', female_symbol='Female', column_name='Gender')
sexMapper.preview_column(dft['Gender']).head(2)

Unnamed: 0,original column contents,sex
0,Male,MALE
1,Male,MALE


In [19]:
varMapper = VariantColumnMapper(variant_d=variant_d, variant_column_name="cDNA change")
encoder = CohortEncoder(df=dft, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        variant_mapper=varMapper, 
                        metadata=metadata,
                        pmid=PMID)
encoder.set_disease(disease_id='OMIM:617140', label='ZTTK SYNDROME')

In [20]:
individuals = encoder.get_individuals()

In [21]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "PMID_34521999_1",
  "subject": {
    "id": "1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P5Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0004325",
        "label": "Decreased body weight"
      }
    }
  ],
  "interpretations": [
    {
      "id": "1",
      "progressStatus": "SOLVED",
      "diagnosis": {
        "disease": {
          "id": "OMIM:617140",
          "label": "ZTTK SYNDROME"
        },
        "genomicInterpretations": [
          {
            "subjectOrBiosampleId": "1",
            "interpretationStatus": "CAUSATIVE",
            "variantInterpretation": {
              "variationDescriptor": {
                "id": "var_xtJuymDjRWAOMLEFNGKiwcfwX",
                "geneContext": {
                  "valueId": "HGNC:11183",
                  "symbol": "SON"
                },
                "expressions": [
                  {
                    "syntax": "hgvs.c"

In [22]:
from IPython.display import HTML, display
phenopackets = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individuals]
table = PhenopacketTable(phenopacket_list=phenopackets)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
1 (MALE; P5Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.5753_5756del (heterozygous),Decreased body weight (HP:0004325)
2 (MALE; P2Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.5753_5756del (heterozygous),"Macrocephaly (HP:0000256); Short stature (HP:0004322); Intellectual disability, severe (HP:0010864)"
3 (FEMALE; P2Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.5753_5756del (heterozygous),Primary microcephaly (HP:0011451); Microcephaly (HP:0000252); Birth length less than 3rd percentile (HP:0003561); Short stature (HP:0004322); Small for gestational age (HP:0001518)
4 (FEMALE; P4Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.3203C>G (heterozygous),"Microcephaly (HP:0000252); Short stature (HP:0004322); Decreased body weight (HP:0004325); Intellectual disability, moderate (HP:0002342)"
5 (MALE; P9Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.457del (heterozygous),"Microcephaly (HP:0000252); Short stature (HP:0004322); Increased body weight (HP:0004324); Intellectual disability, severe (HP:0010864)"
6 (FEMALE; P15Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.384del (heterozygous),"Short stature (HP:0004322); Decreased body weight (HP:0004325); Intellectual disability, severe (HP:0010864)"
7 (MALE; P7Y),ZTTK SYNDROME (OMIM:617140),:0> (heterozygous),"Short stature (HP:0004322); Decreased body weight (HP:0004325); Intellectual disability, moderate (HP:0002342)"
8 (MALE; P3Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.6010del (heterozygous),
9 (FEMALE; P4Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.3711del (heterozygous),"Short stature (HP:0004322); Decreased body weight (HP:0004325); Intellectual disability, mild (HP:0001256)"
10 (MALE; P2Y),ZTTK SYNDROME (OMIM:617140),NM_138927.2:c.6010del (heterozygous),"Short stature (HP:0004322); Decreased body weight (HP:0004325); Intellectual disability, moderate (HP:0002342)"


In [23]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                              pmid=PMID,
                                              metadata=metadata.to_ga4gh(),
                                              outdir=output_directory)

We output 52 GA4GH phenopackets to the directory phenopackets
