<h1>SMARCB1 Kordes</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/34101994/" target="__blank">Kordes, et al. (2021) Evidence for a low-penetrant extended phenotype of rhabdoid tumor predisposition syndrome type 1 from a kindred with gain of SMARCB1 exon 6</a>.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import re
from IPython.display import display, HTML, JSON
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.9.8


<h2>Importing HPO data</h2>

In [2]:
PMID = "PMID:34101994"
title = "Evidence for a low-penetrant extended phenotype of rhabdoid tumor predisposition syndrome \
type 1 from a kindred with gain of SMARCB1 exon 6"
cite = Citation(pmid=PMID, title=title)
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155", citation=cite)
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


<h2>Creating and loading the table</h2>
<p>Both papers do not have specific clinical tables, therefor, we have first created one manually, that we will now load.</p>

In [3]:
df = pd.read_excel('input/PMID_34101994_Kordes.xlsx')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,III.1,II.2
0,PMID,34101994,34101994
1,age,21,57
2,sex,male,male
3,pathogenic variant,,
4,Feeding difficulties,,
5,Lethargy,,
6,Vomiting,,
7,Hydrocephalus,,
8,Neoplasm,+,+
9,Atypical teratoid/rhabdoid tumor,+,


<h1>Converting to row-based format</h1>

In [5]:
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
dft.index
dft['patient_id'] = dft.index
dft.columns = dft.columns.str.strip()
dft = dft.dropna(axis=1, how='all')
dft.head()

Unnamed: 0,PMID,age,sex,Neoplasm,Atypical teratoid/rhabdoid tumor,Leukemia,Ulcerative colitis,Specific learning disability,Ependymoma,patient_id
III.1,34101994,21,male,+,+,-,-,+,-,III.1
II.2,34101994,57,male,+,,+,+,-,+,II.2


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [6]:
generator = SimpleColumnMapperGenerator(df=dft.loc[:,:],
                                                    observed='+',
                                                    excluded='-',
                                                    hpo_cr=hpo_cr)

column_mapper_d = generator.try_mapping_columns()
from IPython.display import display, HTML
display(HTML(generator.to_html()))

Result,Columns
Mapped,Neoplasm; Atypical teratoid/rhabdoid tumor; Leukemia; Ulcerative colitis; Specific learning disability; Ependymoma
Unmapped,PMID; age; sex; patient_id


<h2>Variant Data</h2>

In [7]:
SMARCB1_transcript = "NM_003073.5"
SMARCB1_symbol = "SMARCB1"
SMARCB1_id = "HGNC:11103"
smarcb6_gain = "gain of SMARCB1 exon 6"
gain_var = StructuralVariant.chromosomal_duplication(cell_contents=smarcb6_gain,
                                                     gene_id=SMARCB1_id,
                                                     gene_symbol=SMARCB1_symbol)
gain_var.set_heterozygous()

<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [8]:
ageMapper = AgeColumnMapper.by_year('age')
ageMapper.preview_column(dft['age'])

Unnamed: 0,original column contents,age
0,21,P21Y
1,57,P57Y


In [9]:
#sex is not in columns, since it were all females in this paper
sexMapper = SexColumnMapper(male_symbol='male', female_symbol='female', column_name='sex')
sexMapper.preview_column(dft['sex'])

Unnamed: 0,original column contents,sex
0,male,MALE
1,male,MALE


In [10]:
encoder = CohortEncoder(df=dft, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper, 
                        metadata=metadata)
rtps = Disease(disease_id='OMIM:609322', disease_label='Rhabdoid tumor predisposition syndrome-1')
encoder.set_disease(disease=rtps)

In [11]:
individuals = encoder.get_individuals()
for i in individuals:
    i.add_variant(gain_var)

In [12]:
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.MONO_ALLELIC)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

Level,Error category,Count
WARNING,REDUNDANT,2
INFORMATION,NOT_MEASURED,1


In [13]:
individuals = cvalidator.get_error_free_individual_list()
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
III.1 (MALE; P21Y),Rhabdoid tumor predisposition syndrome-1 (OMIM:609322),gain of SMARCB1 exon 6: chromosomal_duplication (SO:1000037),Rhabdoid tumor (HP:0034557); Specific learning disability (HP:0001328); excluded: Leukemia (HP:0001909); excluded: Ulcerative colitis (HP:0100279); excluded: Ependymoma (HP:0002888)
II.2 (MALE; P57Y),Rhabdoid tumor predisposition syndrome-1 (OMIM:609322),gain of SMARCB1 exon 6: chromosomal_duplication (SO:1000037),Leukemia (HP:0001909); Ulcerative colitis (HP:0100279); Ependymoma (HP:0002888); excluded: Specific learning disability (HP:0001328)


In [14]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                              metadata=metadata,
                                              outdir=output_directory)

We output 2 GA4GH phenopackets to the directory phenopackets
