<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/34521999/" target="__blank">`Dingemans, et al. (2022) Establishing the phenotypic spectrum of ZTTK syndrome by analysis of 52 individuals with variants in SON</a></p>
<p>pyphetools provides a convenient way of extracting HPO terms from typical tables presented in supplemental material. Typical tables can have the individuals in columns or rows. In this case, we extract data from TABLE.</p>
<p>This note shows how to work through the table and set up the pyphetools encoder. The table was not originally available in the table, but constructed using the data in the publication</p>
<p>Users can work on one column at a time and then generate a collection of <a href="https://pubmed.ncbi.nlm.nih.gov/35705716/" target="__blank">GA4GH phenopackets</a> to represent each patient included in the original supplemental material. These phenopackets can then be used for a variety of downstream applications.</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys
sys.path.insert(0, os.path.abspath('../../../pyphetools'))
sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools.creation import *
from pyphetools.creation.simple_column_mapper import try_mapping_columns
import numpy as np

<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155")
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('../../data/SON/PMID_34521999.xlsx')

In [4]:
df

Unnamed: 0,Unnamed: 1,1,2,3,4,5,6,7,8,9,...,43,44,45,46,47,48,49,50,51,52
0,Gender,Male,Male,Female,Female,Male,Female,Male,Male,Female,...,Female,Female,Female,Female,Female,Female,Female,Male,Female,Male
1,Age at examination,5 years,2 years,2 years,4 years and 4 months,9 years and 11 months,15 years,7 years,3 years and 3 months,4 years,...,9 years,5 years,9 years,3 years,9 years,15 years,3 years,23 years,6 years,3 years and 5 months
2,Genomic position,g.34927290_34927293del,g.34927290_34927293del,g.34927290_34927293del,g.34924740C>G,g.34921994del,g.34921921del,g.34783136_34975848del,g.34927547del,g.34925248del,...,g.34923418_34923419del,g.34927086_34927087del,g.34925389_34925393del,g.34927065C>A,g.34924610dup,g.34927290_34927293del,g.34921823C>T,g.34929534del,g.34927290_34927293del,g.34926456_34926460del
3,cDNA change,c.5753_5756del,c.5753_5756del,c.5753_5756del,c.3203C>G,c.457del,c.384del,0.19Mb deletion,c.6010del,c.3711del,...,c.1881_1882del,c.5549_5550del,c.3852_3856del,c.5528C>A,c.3073dup,c.5753_5756del,c.286C>T,c.6233del,c.5753_5756del,c.4919_4923del
4,Predicted protein effect,p.(Val1918Glufs*87),p.(Val1918Glufs*87),p.(Val1918Glufs*87),p.(Ser1068*),p.(Asp153Ilefs*4),p.(Lys128Asnfs*21),,p.(Val2004Trpfs*2),p.(Ser1238Glnfs*3),...,p.(Val629Alafs*56),p.(Arg1850Ilefs*3),p.(Met1284Ilefs*2),p.(Ser1843Tyr),p.(Met1025Asnfs*6),p.(Val1918Glufs*87),p.(Gln96*),p.(Pro2078Hisfs*4),p.(Val1918Glufs*87),p.(Asp1640Glyfs*7)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,Other,"Pleural effusion,Wide intermamillary distance",,,"Asthma,abnormality of the respiratory system",,,Inguinal hernia,,,...,,,,,"Respiratory distress, Emphysema, Early respiratory failure","Neonatal respiratory distress, Respiratory failure requiring assisted ventilation","Respiratory failure requiring assisted ventilation, Laryngeal cleft, Laryngomalacia, Abnormality of the carotid arteries",,"Deep venous thrombosis, Respiratory failure requiring assisted ventilation",
97,PMID,,,,,,,,,,...,27545680;31005274,27545680,27545676,27545676,27545676,27545676,27545676,27545676,27545676,32705777
98,,,,,,,,,,,...,,,,,,,,,,
99,,,,,,,,,,,...,,,,,,,,,,


<h1>Converting to row-based format</h1>
<p>To use pyphetools, we need to have the individuals represented as rows (one row per individual) and have the items of interest be encoded as column names. The required transformations for doing this may be different for different input data, but often we will want to transpose the table (using the pandas <tt>transpose</tt> function) and set the column names of the new table to the zero-th row. After this, we drop the zero-th row (otherwise, it will be interpreted as an individual by the pyphetools code).</p>
<p>After this step is completed, the remaining steps to create phenopackets are the same as in the 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>.</p>
    
Furthermore, for this specific case, there is a Count features row that we want dropped, so we filter out any row that does not have Patient in the first column.

In [5]:
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.head()

Unnamed: 0,Gender,Age at examination,Genomic position,cDNA change,Predicted protein effect,Other genomic variants potentially contributing to the phenotype,Head circumference (at birth) (HP:0011451 / HP:0004488),Head circumference (HP:0000252 / HP:0000256),Heigth (at birth) (HP:0003561 / HP:0003517),Heigth (HP:0004322 / HP:0000098),...,Recurrent otitis media (HP:0000403),Abnormality of the immunological system other,Abnormality of the endocrine system (HP:0000818),Abnormality of metabolism/homeostasis (HP:0001939),Neoplasia (HP:0002664),Other,PMID,NaN,NaN.1,"Description of the variants on genomic chromosomal level reported using NC_000021.8, and annotated based on NM_138927.2 unless indicated otherwise. Abbreviations: +, present; -, not present; NR, not reported; NA, not applicable; PMID, PubMed ID; U, unknown."
1,Male,5 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,P3 - P98,P3 - P98,NR,P3 - P98,...,-,,-,-,-,"Pleural effusion,Wide intermamillary distance",,,,
2,Male,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,NR,> P98,P3 - P98,< P3,...,-,,-,-,-,,,,,
3,Female,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,< P3,< P3,< P3,< P3,...,-,,-,-,-,,,,,
4,Female,4 years and 4 months,g.34924740C>G,c.3203C>G,p.(Ser1068*),-,P3 - P98,< P3,P3 - P98,< P3,...,-,Otitis media,+ (Hypothyroidism),-,-,"Asthma,abnormality of the respiratory system",,,,
5,Male,9 years and 11 months,g.34921994del,c.457del,p.(Asp153Ilefs*4),-,NR,< P3,NR,< P3,...,-,,+ (Growth hormone deficiency),-,-,,,,,


Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that:

In [6]:
dft.columns = dft.columns.str.strip()
dft = dft.dropna(axis=1, how='all')
dft['patient_id'] = dft.index
dft

Unnamed: 0,Gender,Age at examination,Genomic position,cDNA change,Predicted protein effect,Other genomic variants potentially contributing to the phenotype,Head circumference (at birth) (HP:0011451 / HP:0004488),Head circumference (HP:0000252 / HP:0000256),Heigth (at birth) (HP:0003561 / HP:0003517),Heigth (HP:0004322 / HP:0000098),...,Abnormality of the immune system (HP:0002715),Recurrent otitis media (HP:0000403),Abnormality of the immunological system other,Abnormality of the endocrine system (HP:0000818),Abnormality of metabolism/homeostasis (HP:0001939),Neoplasia (HP:0002664),Other,PMID,"Description of the variants on genomic chromosomal level reported using NC_000021.8, and annotated based on NM_138927.2 unless indicated otherwise. Abbreviations: +, present; -, not present; NR, not reported; NA, not applicable; PMID, PubMed ID; U, unknown.",patient_id
1,Male,5 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,P3 - P98,P3 - P98,NR,P3 - P98,...,-,-,,-,-,-,"Pleural effusion,Wide intermamillary distance",,,1
2,Male,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,NR,> P98,P3 - P98,< P3,...,-,-,,-,-,-,,,,2
3,Female,2 years,g.34927290_34927293del,c.5753_5756del,p.(Val1918Glufs*87),-,< P3,< P3,< P3,< P3,...,-,-,,-,-,-,,,,3
4,Female,4 years and 4 months,g.34924740C>G,c.3203C>G,p.(Ser1068*),-,P3 - P98,< P3,P3 - P98,< P3,...,+,-,Otitis media,+ (Hypothyroidism),-,-,"Asthma,abnormality of the respiratory system",,,4
5,Male,9 years and 11 months,g.34921994del,c.457del,p.(Asp153Ilefs*4),-,NR,< P3,NR,< P3,...,-,-,,+ (Growth hormone deficiency),-,-,,,,5
6,Female,15 years,g.34921921del,c.384del,p.(Lys128Asnfs*21),-,NR,NR,P3 - P98,< P3,...,-,-,,+ (Hypothyroidism),+ (Neonatal hypoglycemia),-,,,,6
7,Male,7 years,g.34783136_34975848del,0.19Mb deletion,,-,P3 - P98,P3 - P98,P3 - P98,< P3,...,+,-,"Hypogammaglobulinemia,Recurrent infections",-,-,-,Inguinal hernia,,-,7
8,Male,3 years and 3 months,g.34927547del,c.6010del,p.(Val2004Trpfs*2),-,NR,P3 - P98,P3 - P98,P3 - P98,...,-,-,,-,-,-,,,,8
9,Female,4 years,g.34925248del,c.3711del,p.(Ser1238Glnfs*3),-,NR,P3 - P98,NR,< P3,...,-,+,,-,-,-,,,,9
10,Male,2 years and 4 months,g.34927547del,c.6010del,p.(Val2004Trpfs*2),-,P3 - P98,P3 - P98,NR,< P3,...,+,-,Recurrent respiratory infections,NR,NR,-,,,,10


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [7]:
hpo_cr = parser.get_hpo_concept_recognizer()
column_mapper_d, col_not_found = try_mapping_columns(df=dft,
                                                    observed='+',
                                                    excluded='-',
                                                    hpo_cr=hpo_cr,
                                                    preview=True)

                        term        status
0   Motor delay (HP:0001270)      observed
1   Motor delay (HP:0001270)      observed
2   Motor delay (HP:0001270)      observed
3   Motor delay (HP:0001270)      observed
4   Motor delay (HP:0001270)      observed
5   Motor delay (HP:0001270)      observed
6   Motor delay (HP:0001270)      observed
7   Motor delay (HP:0001270)      observed
8   Motor delay (HP:0001270)      observed
9   Motor delay (HP:0001270)      observed
10  Motor delay (HP:0001270)      observed
11  Motor delay (HP:0001270)      observed
12  Motor delay (HP:0001270)      observed
13  Motor delay (HP:0001270)      excluded
14  Motor delay (HP:0001270)      observed
15  Motor delay (HP:0001270)      observed
16  Motor delay (HP:0001270)      observed
17  Motor delay (HP:0001270)      observed
18  Motor delay (HP:0001270)      observed
19  Motor delay (HP:0001270)  not measured
20  Motor delay (HP:0001270)  not measured
21  Motor delay (HP:0001270)      observed
22  Motor d

                             term        status
0   Facial asymmetry (HP:0000324)      excluded
1   Facial asymmetry (HP:0000324)      excluded
2   Facial asymmetry (HP:0000324)      excluded
3   Facial asymmetry (HP:0000324)      excluded
4   Facial asymmetry (HP:0000324)      excluded
5   Facial asymmetry (HP:0000324)      excluded
6   Facial asymmetry (HP:0000324)      excluded
7   Facial asymmetry (HP:0000324)      excluded
8   Facial asymmetry (HP:0000324)      excluded
9   Facial asymmetry (HP:0000324)      excluded
10  Facial asymmetry (HP:0000324)  not measured
11  Facial asymmetry (HP:0000324)      excluded
12  Facial asymmetry (HP:0000324)      excluded
13  Facial asymmetry (HP:0000324)      excluded
14  Facial asymmetry (HP:0000324)      excluded
15  Facial asymmetry (HP:0000324)      excluded
16  Facial asymmetry (HP:0000324)      excluded
17  Facial asymmetry (HP:0000324)      excluded
18  Facial asymmetry (HP:0000324)      excluded
19  Facial asymmetry (HP:0000324)  not m

                                                                 term  \
0   Abnormality of the curvature of the vertebral column (HP:0010674)   
1   Abnormality of the curvature of the vertebral column (HP:0010674)   
2   Abnormality of the curvature of the vertebral column (HP:0010674)   
3   Abnormality of the curvature of the vertebral column (HP:0010674)   
4   Abnormality of the curvature of the vertebral column (HP:0010674)   
5   Abnormality of the curvature of the vertebral column (HP:0010674)   
6   Abnormality of the curvature of the vertebral column (HP:0010674)   
7   Abnormality of the curvature of the vertebral column (HP:0010674)   
8   Abnormality of the curvature of the vertebral column (HP:0010674)   
9   Abnormality of the curvature of the vertebral column (HP:0010674)   
10  Abnormality of the curvature of the vertebral column (HP:0010674)   
11  Abnormality of the curvature of the vertebral column (HP:0010674)   
12  Abnormality of the curvature of the vertebral c

                         term        status
0   Otitis media (HP:0000388)      excluded
1   Otitis media (HP:0000388)      excluded
2   Otitis media (HP:0000388)      excluded
3   Otitis media (HP:0000388)      excluded
4   Otitis media (HP:0000388)      excluded
5   Otitis media (HP:0000388)      excluded
6   Otitis media (HP:0000388)      excluded
7   Otitis media (HP:0000388)      excluded
8   Otitis media (HP:0000388)      observed
9   Otitis media (HP:0000388)      excluded
10  Otitis media (HP:0000388)  not measured
11  Otitis media (HP:0000388)  not measured
12  Otitis media (HP:0000388)  not measured
13  Otitis media (HP:0000388)      excluded
14  Otitis media (HP:0000388)      excluded
15  Otitis media (HP:0000388)      excluded
16  Otitis media (HP:0000388)      excluded
17  Otitis media (HP:0000388)      excluded
18  Otitis media (HP:0000388)      observed
19  Otitis media (HP:0000388)      excluded
20  Otitis media (HP:0000388)      excluded
21  Otitis media (HP:0000388)   

In [8]:
print(col_not_found)

['Gender', 'Age at examination', 'Genomic position', 'cDNA change', 'Predicted protein effect', 'Other genomic variants potentially contributing to the phenotype', 'Head circumference (at birth) (HP:0011451 / HP:0004488)', 'Head circumference (HP:0000252 / HP:0000256)', 'Heigth (at birth) (HP:0003561 / HP:0003517)', 'Heigth (HP:0004322 / HP:0000098)', 'Weigth (at birth) (HP:0001518 / HP:0001520)', 'Weigth (HP:0004325 / HP:0004324)', 'Facial (eye)', 'Facial (eye) other', 'Facial (mouth)', 'Facial (mouth) other', 'Facial (nose)', 'Facial (nose) other', 'Facial (ear)', 'Facial (ear) other', 'Facial other', 'Abnormality of the finger (HP:0001167)', 'Skeletal abnormality other', 'Gastrointestinal abnormality other', 'Abnormality of the immunological system other', 'Other', 'PMID', 'Description of the variants on genomic chromosomal level reported using NC_000021.8, and annotated based on NM_138927.2 unless indicated otherwise. Abbreviations: +, present; -, not present; NR, not reported; NA,

In [9]:
headcircumference = {'> P98': 'Macrocephaly at birth',
                 '< P3': 'Primary microcephaly',
        }
headcircumferenceMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=headcircumference)
print(headcircumferenceMapper.preview_column(dft['Head circumference (at birth) (HP:0011451 / HP:0004488)']))
column_mapper_d['Head circumference (at birth) (HP:0011451 / HP:0004488)'] = headcircumferenceMapper

headcircumference = {'> P98': 'Macrocephaly',
                 '< P3': 'Microcephaly',
        }
headcircumferenceMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=headcircumference)
print(headcircumferenceMapper.preview_column(dft['Head circumference (HP:0000252 / HP:0000256)']))
column_mapper_d['Head circumference (HP:0000252 / HP:0000256)'] = headcircumferenceMapper

                                         terms
0                                          n/a
1                                          n/a
2   HP:0011451 (Primary microcephaly/observed)
3                                          n/a
4                                          n/a
5                                          n/a
6                                          n/a
7                                          n/a
8                                          n/a
9                                          n/a
10                                         n/a
11                                         n/a
12                                         n/a
13                                         n/a
14                                         n/a
15                                         n/a
16                                         n/a
17                                         n/a
18                                         n/a
19                                         n/a
20           

In [10]:
birth_length = {'> P98': 'Birth length greater than 97th percentile',
                 '< P3': 'Birth length less than 3rd percentile',
        }
birth_lengthMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=birth_length)
print(birth_lengthMapper.preview_column(dft['Heigth (at birth) (HP:0003561 / HP:0003517)']))
column_mapper_d['Heigth (at birth) (HP:0003561 / HP:0003517)'] = birth_lengthMapper

length = {'> P98': 'Tall stature',
                 '< P3': 'Short stature',
        }
lengthMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=length)
print(lengthMapper.preview_column(dft['Heigth (HP:0004322 / HP:0000098)']))
column_mapper_d['Heigth (HP:0004322 / HP:0000098)'] = lengthMapper

                                                          terms
0                                                           n/a
1                                                           n/a
2   HP:0003561 (Birth length less than 3rd percentile/observed)
3                                                           n/a
4                                                           n/a
5                                                           n/a
6                                                           n/a
7                                                           n/a
8                                                           n/a
9                                                           n/a
10                                                          n/a
11                                                          n/a
12                                                          n/a
13                                                          n/a
14                                      

In [11]:
birth_weight = {'> P98': 'Large for gestational age',
                 '< P3': 'Small for gestational age',
        }
birth_weightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=birth_weight)
print(birth_weightMapper.preview_column(dft['Weigth (at birth) (HP:0001518 / HP:0001520)']))
column_mapper_d['Weigth (at birth) (HP:0001518 / HP:0001520)'] = birth_weightMapper

weight = {'> P98': 'Increased body weight',
                 '< P3': 'Decreased body weight',
        }
weightMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=weight)
print(weightMapper.preview_column(dft['Weigth (HP:0004325 / HP:0004324)']))
column_mapper_d['Weigth (HP:0004325 / HP:0004324)'] = weightMapper

                                              terms
0                                               n/a
1                                               n/a
2   HP:0001518 (Small for gestational age/observed)
3                                               n/a
4                                               n/a
5                                               n/a
6                                               n/a
7                                               n/a
8                                               n/a
9                                               n/a
10                                              n/a
11                                              n/a
12                                              n/a
13                                              n/a
14                                              n/a
15                                              n/a
16  HP:0001518 (Small for gestational age/observed)
17                                              n/a
18  HP:00015

In [12]:
id_severity = {'Mild': 'Intellectual disability, mild',
                 'Moderate': 'Intellectual disability, moderate',
         'Severe': 'Intellectual disability, severe'
        }
id_severityMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=id_severity)
print(id_severityMapper.preview_column(dft['Severity of intellectual disability (HP:0001256 / HP:0002342 / HP:0010864)']))
column_mapper_d['Severity of intellectual disability (HP:0001256 / HP:0002342 / HP:0010864)'] = id_severityMapper

                                                      terms
0                                                       n/a
1     HP:0010864 (Intellectual disability, severe/observed)
2                                                       n/a
3   HP:0002342 (Intellectual disability, moderate/observed)
4     HP:0010864 (Intellectual disability, severe/observed)
5     HP:0010864 (Intellectual disability, severe/observed)
6   HP:0002342 (Intellectual disability, moderate/observed)
7                                                       n/a
8       HP:0001256 (Intellectual disability, mild/observed)
9   HP:0002342 (Intellectual disability, moderate/observed)
10      HP:0001256 (Intellectual disability, mild/observed)
11    HP:0010864 (Intellectual disability, severe/observed)
12    HP:0010864 (Intellectual disability, severe/observed)
13      HP:0001256 (Intellectual disability, mild/observed)
14                                                      n/a
15                                      

For this particular file, there are HPO terms in the cells of the table as well, so we should loop them, parse contents and add them to the parser.

In [13]:
additional_hpos = []

for i in range(len(dft)):
    temp_hpos = []
    for y in range(dft.shape[1]):
        hpo_term = hpo_cr.parse_cell(dft.iloc[i,y])
        if len(hpo_term) > 0:
            temp_hpos.extend(hpo_term)
    additional_hpos.append(temp_hpos)
print(additional_hpos)

Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must

Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must

Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (32045494) must be string but was <class 'int'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) m

Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (27545680) must be string but was <class 'int'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) m

Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must

Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (nan) must be string but was <class 'float'> -- coerced to string
Error: cell_contents argument (27545680)

<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [14]:
genome = 'hg38'
default_genotype = 'heterozygous'
transcript='NM_138927.2'
varMapper = VariantColumnMapper(assembly=genome,column_name='cDNA change', 
                                transcript=transcript, genotype=default_genotype)

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [15]:
ageMapper = AgeColumnMapper.by_year('Age at examination')
ageMapper.preview_column(dft['Age at examination'])

Unnamed: 0,original column contents,age
0,5 years,P5Y
1,2 years,P2Y
2,4 years and 4 months,P4Y
3,9 years and 11 months,P9Y
4,15 years,P15Y
5,7 years,P7Y
6,3 years and 3 months,P3Y
7,4 years,P4Y
8,2 years and 4 months,P2Y
9,11 years,P11Y


In [16]:
sexMapper = SexColumnMapper(male_symbol='Male', female_symbol='Female', column_name='Gender')
sexMapper.preview_column(dft['Gender'])

Unnamed: 0,original column contents,sex
0,Male,MALE
1,Male,MALE
2,Female,FEMALE
3,Female,FEMALE
4,Male,MALE
5,Female,FEMALE
6,Male,MALE
7,Male,MALE
8,Female,FEMALE
9,Male,MALE


In [17]:
pmid = "PMID: 34521999"
encoder = CohortEncoder(df=dft, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper,
                       variant_mapper=varMapper, metadata=metadata,
                       pmid=pmid)
encoder.set_disease(disease_id='617140', label='ZTTK SYNDROME')

In [18]:
individuals = encoder.get_individuals(additional_hpos)

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.3203C>G/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.457del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.384del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3A0.19Mb deletion/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.6010del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.3711del/NM_138927.2


In [19]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "1",
  "subject": {
    "id": "1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P5Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0001561",
        "label": "Polyhydramnios"
      },
      "onset": {
        "age": {
          "iso8601duration": "P5Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0001288",
        "label": "Gait disturbance"
      },
      "onset": {
        "age": {
          "iso8601duration": "P5Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0002172",
        "label": "Postural instability"
      },
      "onset": {
        "age": {
          "iso8601duration": "P5Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0002317",
        "label": "Unsteady gait"
      },
      "onset": {
        "age": {
          "iso8601duration": "P5Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0003698",
        "

In [20]:
output_directory = "../../phenopackets/SON/"
encoder.output_phenopackets(outdir=output_directory)

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.5753_5756del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.3203C>G/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.457del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.384del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3A0.19Mb deletion/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.6010del/NM_138927.2
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_138927.2%3Ac.3711del/NM_138927.2
