<h1>WFS1: Strom et al. (1998)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/9817917/" target="__blank">Strom, et al. (1998) Diabetes insipidus, diabetes mellitus, optic atrophy and deafness (DIDMOAD) caused by mutations in a novel gene (wolframin) coding for a predicted transmembrane protein</a></p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import numpy as np
from pyphetools.creation import *
from pyphetools.visualization import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.6.4


<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
PMID = "PMID:9817917"
title = "Diabetes insipidus, diabetes mellitus, optic atrophy and deafness (DIDMOAD) caused by mutations in a novel gene (wolframin) coding for a predicted transmembrane protein"
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155")
metadata.default_versions_with_hpo(version=hpo_version)

<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('../../data/WFS1/PMID_9817917.xlsx')

In [4]:
df

Unnamed: 0,Family,Patient,Sex,Age,Diabetes mellitus,Progressive optic atrophy,Hearing impairment,Diabetes insipidus,Abnormality of the kidney,Neurological abnormalities,Other complications,Consangui,Variant
0,1.0,5519,f,22.0,+,+,+,+,+,"Ataxia, nystagmus","Retarded sexual maturation, depression",-,1380del9
1,1.0,13883,f,11.0,+,-,-,-,+,-,-,-,1380del9
2,2.0,13775,f,20.0,+,+,+,+,-,-,-,-,460+1G>A
3,2.0,13776,m,17.0,+,+,+,+,+,-,Retarded sexual maturation,-,460+1G>A
4,4.0,13070,f,22.0,+,+,+,-,+,Abnormal EEG,Psychiatric illness,-,599delT
5,5.0,13885,f,35.0,+,+,+,-,+,-,Cataract,-,1096C>T
6,6.0,13062,f,25.0,+,+,+,+,+,"Ataxia, nystagmus",,,676C>T
7,7.0,13076,m,26.0,+,+,+,+,-,-,"Retarded sexual maturation, mental retardation",-,599delT
8,8.0,13073,f,35.0,+,+,+,-,+,Ataxia,"Cataract, psychiatric illness, ragged red fibers",,1096C>T
9,9.0,13781,m,19.0,+,+,+,+,+,Abnormal EEG,Retarded sexual maturation,+,1558C>T


Some column names might include spaces in front or after, and a couple of columns are subheadings and only contain NaNs, so lets correct that. Furthermore, remove individuals without an age specified or a variant in this gene.

In [5]:
df.columns = df.columns.str.strip()
df = df.dropna(axis=1, how='all')
df['patient_id'] = df['Patient']
df = df[~df['Age'].isna()]
df = df[~df['Variant'].isna()]
df

Unnamed: 0,Family,Patient,Sex,Age,Diabetes mellitus,Progressive optic atrophy,Hearing impairment,Diabetes insipidus,Abnormality of the kidney,Neurological abnormalities,Other complications,Consangui,Variant,patient_id
0,1.0,5519,f,22.0,+,+,+,+,+,"Ataxia, nystagmus","Retarded sexual maturation, depression",-,1380del9,5519
1,1.0,13883,f,11.0,+,-,-,-,+,-,-,-,1380del9,13883
2,2.0,13775,f,20.0,+,+,+,+,-,-,-,-,460+1G>A,13775
3,2.0,13776,m,17.0,+,+,+,+,+,-,Retarded sexual maturation,-,460+1G>A,13776
4,4.0,13070,f,22.0,+,+,+,-,+,Abnormal EEG,Psychiatric illness,-,599delT,13070
5,5.0,13885,f,35.0,+,+,+,-,+,-,Cataract,-,1096C>T,13885
6,6.0,13062,f,25.0,+,+,+,+,+,"Ataxia, nystagmus",,,676C>T,13062
7,7.0,13076,m,26.0,+,+,+,+,-,-,"Retarded sexual maturation, mental retardation",-,599delT,13076
8,8.0,13073,f,35.0,+,+,+,-,+,Ataxia,"Cataract, psychiatric illness, ragged red fibers",,1096C>T,13073
9,9.0,13781,m,19.0,+,+,+,+,+,Abnormal EEG,Retarded sexual maturation,+,1558C>T,13781


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [7]:
hpo_cr = parser.get_hpo_concept_recognizer()
generator = SimpleColumnMapperGenerator(df=df, observed='+', excluded='-', hpo_cr=hpo_cr)
column_mapper_d = generator.try_mapping_columns()

In [8]:
from IPython.display import HTML, display
display(HTML(generator.to_html()))

Result,Columns
Mapped,Diabetes mellitus; Hearing impairment; Diabetes insipidus; Abnormality of the kidney
Unmapped,Family; Patient; Sex; Age; Progressive optic atrophy; Neurological abnormalities; Other complications; Consangui; Variant; patient_id


In [10]:
neurological = {'Ataxia': 'Ataxia',
                 'nystagmus': 'Nystagmus',
               'Abnormal EEG': 'EEG abnormality'}
neurologicalMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=neurological)
neurologicalMapper.preview_column(df['Neurological abnormalities'])
column_mapper_d['Neurological abnormalities'] = neurologicalMapper

other = {'Retarded sexual maturation': 'Puberty and gonadal disorders',
                 'depression': 'Depression',
               'psychiatric illness': 'Behavioral abnormality',
                'cataract' : 'Cataract', 
         'mental retardation': 'Intellectual disability',
         'ragged red fibers': 'Ragged-red muscle fibers'
        }
otherMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=other)
otherMapper.preview_column(df['Other complications'])
column_mapper_d['Other complications'] = otherMapper

<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [16]:
hg38 = 'hg38'
default_genotype = 'heterozygous'
WFS1_transcript='NM_006005.3'
vvalidator = VariantValidator(genome_build=hg38, transcript=WFS1_transcript)
variant_list = df['Variant'].unique()
print(variant_list)
variant_d = {}
for v in variant_list:
    if v == "1380del9":
        hgvs = "c.1385_1393del"
    else:
        hgvs = f"c.{v}"
    print(f"{v} - {hgvs}")
    var = vvalidator.encode_hgvs(hgvs)
    print(f"{v}: {var}")
    variant_d[v] = var
print(f"Extracted {len(variant_d)} unique variants")

['1380del9' ' 460+1G>A' ' 599delT' ' 1096C>T' '676C>T' '599delT' '1096C>T'
 '1558C>T']
1380del9 - c.1385_1393del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_006005.3%3Ac.1385_1393del/NM_006005.3?content-type=application%2Fjson
1380del9: chr4:6301174CCACCGAGGT>C
 460+1G>A - c. 460+1G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_006005.3%3Ac. 460+1G>A/NM_006005.3?content-type=application%2Fjson
 460+1G>A: chr4:6289132G>A
 599delT - c. 599delT
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_006005.3%3Ac. 599delT/NM_006005.3?content-type=application%2Fjson
 599delT: chr4:6291334CT>C
 1096C>T - c. 1096C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_006005.3%3Ac. 1096C>T/NM_006005.3?content-type=application%2Fjson
 1096C>T: chr4:6300891C>T
676C>T - c.676C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_006005.3%3Ac.676C>T/NM_006005.3?conte

In [17]:
varMapper = VariantColumnMapper(variant_d=variant_d,variant_column_name='Variant', default_genotype=default_genotype)

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [18]:
ageMapper = AgeColumnMapper.by_year('Age')
ageMapper.preview_column(df['Age'])

Unnamed: 0,original column contents,age
0,22.0,P22Y0M
1,11.0,P11Y0M
2,20.0,P20Y0M
3,17.0,P17Y0M
4,35.0,P35Y0M
5,25.0,P25Y0M
6,26.0,P26Y0M
7,19.0,P19Y0M


In [19]:
sexMapper = SexColumnMapper(male_symbol='m', female_symbol='f', column_name='Sex')
sexMapper.preview_column(df['Sex'])

Unnamed: 0,original column contents,sex
0,f,FEMALE
1,f,FEMALE
2,f,FEMALE
3,m,MALE
4,f,FEMALE
5,f,FEMALE
6,f,FEMALE
7,m,MALE
8,f,FEMALE
9,m,MALE


In [20]:

encoder = CohortEncoder(df=df, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", agemapper=ageMapper, sexmapper=sexMapper,
                       variant_mapper=varMapper, metadata=metadata,
                       pmid=PMID)
encoder.set_disease(disease_id='OMIM:222300', label='Wolfram syndrome 1')

In [21]:
individuals = encoder.get_individuals()

In [22]:
phenopackets = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individuals]
table = PhenopacketTable(phenopacket_list=phenopackets)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
5519 (FEMALE; P22Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.1385_1393del (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Diabetes insipidus (HP:0000873); Abnormality of the kidney (HP:0000077); Ataxia (HP:0001251); Nystagmus (HP:0000639); Depression (HP:0000716); Puberty and gonadal disorders (HP:0008373)
13883 (FEMALE; P11Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.1385_1393del (heterozygous),Diabetes mellitus (HP:0000819); Abnormality of the kidney (HP:0000077)
13775 (FEMALE; P20Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.460+1G>A (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Diabetes insipidus (HP:0000873)
13776 (MALE; P17Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.460+1G>A (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Diabetes insipidus (HP:0000873); Abnormality of the kidney (HP:0000077); Puberty and gonadal disorders (HP:0008373)
13070 (FEMALE; P22Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.599del (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Abnormality of the kidney (HP:0000077); EEG abnormality (HP:0002353)
13885 (FEMALE; P35Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.1096C>T (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Abnormality of the kidney (HP:0000077)
13062 (FEMALE; P25Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.676C>T (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Diabetes insipidus (HP:0000873); Abnormality of the kidney (HP:0000077); Ataxia (HP:0001251); Nystagmus (HP:0000639)
13076 (MALE; P26Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.599del (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Diabetes insipidus (HP:0000873); Intellectual disability (HP:0001249); Puberty and gonadal disorders (HP:0008373)
13073 (FEMALE; P35Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.1096C>T (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Abnormality of the kidney (HP:0000077); Ataxia (HP:0001251); Behavioral abnormality (HP:0000708); Ragged-red muscle fibers (HP:0003200)
13781 (MALE; P19Y0M),Wolfram syndrome 1 (OMIM:222300),NM_006005.3:c.1558C>T (heterozygous),Diabetes mellitus (HP:0000819); Hearing impairment (HP:0000365); Diabetes insipidus (HP:0000873); Abnormality of the kidney (HP:0000077); EEG abnormality (HP:0002353); Puberty and gonadal disorders (HP:0008373)


In [23]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                             pmid=PMID,
                                             metadata=metadata.to_ga4gh(),
                                             outdir=output_directory)

We output 10 GA4GH phenopackets to the directory phenopackets
