<h1>Creation of phenopackets from tabular data (individuals in columns)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/34101994//" target="__blank">Kordes, et al. (2021) Evidence for a low-penetrant extended phenotype of rhabdoid tumor predisposition syndrome type 1 from a kindred with gain of SMARCB1 exon 6</a> and
    <a href="https://pubmed.ncbi.nlm.nih.gov/33470921/" target="__blank">Baker, et al. (2021)Epithelioid Sarcoma Arising in a Long-Term Survivor of an Atypical Teratoid/Rhabdoid Tumor in a Patient With Rhabdoid Tumor Predisposition Syndrome</a>
here.
We need a different approach, because there are no clinical tables in these papers</p>


In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys
sys.path.insert(0, os.path.abspath('../../../pyphetools'))
sys.path.insert(0, os.path.abspath('../../pyphetools'))
from pyphetools.creation import *
from pyphetools.creation.simple_column_mapper import try_mapping_columns
import numpy as np

<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155")
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Creating and loading the table</h2>
<p>Both papers do not have specific clinical tables, therefor, we have first created one manually, that we will now load.</p>

In [3]:
df = pd.read_excel('../../data/SMARCB1/PMID_34101994_and_33470921.xlsx')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,pt_1,pt_2,pt_3
0,PMID,33470921,34101994,34101994
1,age,16,21,57
2,sex,male,male,male
3,pathogenic variant,,,
4,Feeding difficulties,+,,
5,Lethargy,+,,
6,Vomiting,+,,
7,Hydrocephalus,+,,
8,Neoplasm,+,+,+
9,Atypical teratoid/rhabdoid tumor,+,+,


<h1>Converting to row-based format</h1>
<p>To use pyphetools, we need to have the individuals represented as rows (one row per individual) and have the items of interest be encoded as column names. The required transformations for doing this may be different for different input data, but often we will want to transpose the table (using the pandas <tt>transpose</tt> function) and set the column names of the new table to the zero-th row. After this, we drop the zero-th row (otherwise, it will be interpreted as an individual by the pyphetools code).</p>
<p>After this step is completed, the remaining steps to create phenopackets are the same as in the 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>.</p>
    
Furthermore, for this specific case, there is a Count features row that we want dropped, so we filter out any row that does not have Patient in the first column.

In [5]:
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.head()

Unnamed: 0,PMID,age,sex,pathogenic variant,Feeding difficulties,Lethargy,Vomiting,Hydrocephalus,Neoplasm,Atypical teratoid/rhabdoid tumor,Flesh colored papules,Granuloma,Hemiparesis,Stroke,Leukemia,Ulcerative colitis,Specific learning disability,Ependymoma
pt_1,33470921,16,male,,+,+,+,+,+,+,+,+,+,+,-,-,-,-
pt_2,34101994,21,male,,,,,,+,+,,,,,-,-,+,-
pt_3,34101994,57,male,,,,,,+,,,,,,+,+,-,+


<h2>Index vs. normal column</h2>
<p>Another thing to look out for is whether the individuals (usually the first column) are regarded as the index of the table or as the first normal column.</p>
<p>If this is the case, it is easiest to create a new column with the contents of the index -- this will work with the pyphetools software. An example follows -- we can now use 'patient_id' as the column name.</p>

In [6]:
dft.index
dft['patient_id'] = dft.index
dft.head()

Unnamed: 0,PMID,age,sex,pathogenic variant,Feeding difficulties,Lethargy,Vomiting,Hydrocephalus,Neoplasm,Atypical teratoid/rhabdoid tumor,Flesh colored papules,Granuloma,Hemiparesis,Stroke,Leukemia,Ulcerative colitis,Specific learning disability,Ependymoma,patient_id
pt_1,33470921,16,male,,+,+,+,+,+,+,+,+,+,+,-,-,-,-,pt_1
pt_2,34101994,21,male,,,,,,+,+,,,,,-,-,+,-,pt_2
pt_3,34101994,57,male,,,,,,+,,,,,,+,+,-,+,pt_3


Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that:

In [7]:
dft.columns = dft.columns.str.strip()
dft = dft.dropna(axis=1, how='all')
dft.head()

Unnamed: 0,PMID,age,sex,Feeding difficulties,Lethargy,Vomiting,Hydrocephalus,Neoplasm,Atypical teratoid/rhabdoid tumor,Flesh colored papules,Granuloma,Hemiparesis,Stroke,Leukemia,Ulcerative colitis,Specific learning disability,Ependymoma,patient_id
pt_1,33470921,16,male,+,+,+,+,+,+,+,+,+,+,-,-,-,-,pt_1
pt_2,34101994,21,male,,,,,+,+,,,,,-,-,+,-,pt_2
pt_3,34101994,57,male,,,,,+,,,,,,+,+,-,+,pt_3


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [8]:
hpo_cr = parser.get_hpo_concept_recognizer()
column_mapper_d, col_not_found = try_mapping_columns(df=dft.loc[:,:],
                                                    observed='+',
                                                    excluded='-',
                                                    hpo_cr=hpo_cr,
                                                    preview=True)

                                term        status
0  Feeding difficulties (HP:0011968)      observed
1  Feeding difficulties (HP:0011968)  not measured
2  Feeding difficulties (HP:0011968)  not measured
                    term        status
0  Lethargy (HP:0001254)      observed
1  Lethargy (HP:0001254)  not measured
2  Lethargy (HP:0001254)  not measured
                    term        status
0  Vomiting (HP:0002013)      observed
1  Vomiting (HP:0002013)  not measured
2  Vomiting (HP:0002013)  not measured
                         term        status
0  Hydrocephalus (HP:0000238)      observed
1  Hydrocephalus (HP:0000238)  not measured
2  Hydrocephalus (HP:0000238)  not measured
                    term    status
0  Neoplasm (HP:0002664)  observed
1  Neoplasm (HP:0002664)  observed
2  Neoplasm (HP:0002664)  observed
                                            term        status
0  Atypical teratoid/rhabdoid tumor (HP:0034401)      observed
1  Atypical teratoid/rhabdoid tumor (HP:00

Lets merge the two dicts and the lists, to see which columns are not mapped yet.

In [9]:
print(col_not_found)

['PMID', 'age', 'sex', 'patient_id']


<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [10]:
#genome = 'hg38'
#default_genotype = 'heterozygous'
#transcript='NM_003073.3'
#varMapper = VariantColumnMapper(assembly=genome,column_name='Pathogenic variant', 
#                                transcript=transcript, genotype=default_genotype)

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [11]:
ageMapper = AgeColumnMapper.by_year('age')
ageMapper.preview_column(dft['age'])

Unnamed: 0,original column contents,age
0,16,P16Y
1,21,P21Y
2,57,P57Y


In [12]:
#sex is not in columns, since it were all females in this paper
sexMapper = SexColumnMapper(male_symbol='male', female_symbol='female', column_name='sex')
sexMapper.preview_column(dft['sex'])

Unnamed: 0,original column contents,sex
0,male,MALE
1,male,MALE
2,male,MALE


In [13]:
pmid = "PMID: 33470921"
encoder = CohortEncoder(df=dft.iloc[1:,:], hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper, metadata=metadata,
                       pmid=pmid)
encoder.set_disease(disease_id='609322', label='Rhabdoid tumor predisposition syndrome-1')

In [14]:
individuals = encoder.get_individuals()

In [15]:
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

{
  "id": "pt_2",
  "subject": {
    "id": "pt_2",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P21Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0002664",
        "label": "Neoplasm"
      },
      "onset": {
        "age": {
          "iso8601duration": "P21Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0034401",
        "label": "Atypical teratoid/rhabdoid tumor"
      },
      "onset": {
        "age": {
          "iso8601duration": "P21Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0001909",
        "label": "Leukemia"
      },
      "excluded": true,
      "onset": {
        "age": {
          "iso8601duration": "P21Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0100279",
        "label": "Ulcerative colitis"
      },
      "excluded": true,
      "onset": {
        "age": {
          "iso8601duration": "P21Y"
        }
      }
    

In [16]:
output_directory = "../../phenopackets/SMARCB1/"
encoder.output_phenopackets(outdir=output_directory)

Wrote 2 phenopackets to ../../phenopackets/SMARCB1/


In [17]:
pmid = "PMID: 34101994"
encoder = CohortEncoder(df=pd.DataFrame(dft.iloc[0,:]).T, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper, metadata=metadata,
                       pmid=pmid)
encoder.set_disease(disease_id='609322', label='Rhabdoid tumor predisposition syndrome-1')
individuals = encoder.get_individuals()
i1 = individuals[0]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)
output_directory = "../../phenopackets/SMARCB1/"
encoder.output_phenopackets(outdir=output_directory)

{
  "id": "pt_1",
  "subject": {
    "id": "pt_1",
    "timeAtLastEncounter": {
      "age": {
        "iso8601duration": "P16Y"
      }
    },
    "sex": "MALE"
  },
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0011968",
        "label": "Feeding difficulties"
      },
      "onset": {
        "age": {
          "iso8601duration": "P16Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0001254",
        "label": "Lethargy"
      },
      "onset": {
        "age": {
          "iso8601duration": "P16Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0002013",
        "label": "Vomiting"
      },
      "onset": {
        "age": {
          "iso8601duration": "P16Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0000238",
        "label": "Hydrocephalus"
      },
      "onset": {
        "age": {
          "iso8601duration": "P16Y"
        }
      }
    },
    {
      "type": {
        "id": "HP:0002664",
        "lab