<h1>SMARCC2: Bosch et al 2023 </h1>
<p>Extract the clinical data from <a href="https://pubmed.ncbi.nlm.nih.gov/37551667/"target="__blank">Bosch E, et al. (2023) Elucidating the clinical and molecular spectrum of SMARCC2-associated NDD in a cohort of 65 affected individuals. Genet Med.  PMID:37551667</a>.<p>
<p>The authors report that clinical presentation differed significantly, with LGD variants being predominantly inherited and associated with mildly reduced or normal cognitive development, while non-truncating variants were mostly de novo and presented with severe developmental delay. </p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
import math
from csv import DictReader
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import re
from pyphetools.creation import *
from pyphetools.output import PhenopacketTable
import importlib.metadata
__version__ = importlib.metadata.version("pyphetools")
print(f"Using pyphetools version {__version__}")

Using pyphetools version 0.4.8


In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)
pmid="37551667"

In [3]:
df = pd.read_excel("input/FileS2_cases_clinical-table.xlsx", index_col ="Patient_ID (in Project)", comment="##")

In [4]:
df.head()

Unnamed: 0_level_0,HPO,Ind-01,Ind-02,Ind-03,Ind-04,Ind-05,Ind-06,Ind-07,Ind-08,Ind-09,...,Machol_Ind_15,Li_Ind_1,Chen_Pat_123,Chen_Pat_124,Chen_Pat_126,Sun_case,Yi_case,Lo_twin_1,Lo_twin_2,Gofin_Subject_5
Patient_ID (in Project),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#family,,Fam-01,Fam-02,Fam-03,Fam-04,Fam-05,Fam-06,Fam-07,Fam-08,Fam-09,...,Fam-45,Fam-46,Fam-47,Fam-48,Fam-49,Fam-50,Fam-51,Fam-52,Fam-52,Fam-53
#group,,novel,novel,novel,novel,novel,novel,novel,novel,novel,...,literature,literature,literature,literature,literature,literature,literature,literature,literature,literature
#analysis,,include,include,exclude,include,include,exclude,include,include,include,...,include,include,include,include,include,include,include,include,include,exclude
#Case published previously,,no,no,no,no,no,no,no,no,no,...,yes,yes,yes,yes,yes,yes,yes,yes,yes,yes
#Literature reference,,,,,,,,,,,...,PMID:30580808,PMID:34881817,PMID:34906496,PMID:34906496,PMID:34906496,PMID:35241061,PMID:35536477,PMID:35699097,PMID:35699097,PMID:35796094


<H2>Parse strategy</H2>
<p>This supplemental file already has HPO ids and all needed information for each patient.

In [5]:
pat_id_list = df.columns

In [6]:
sex_pd_series = df.loc["#Sex"]
sex_pd_series = sex_pd_series.drop(labels=['HPO'])

In [7]:
hg38_var_list = df.loc["#Variant(s) in SMARCC2 (genomic hg38/GRCh38)"]
hg38_var_list = hg38_var_list.drop(labels=['HPO'])

In [8]:
variant_type_list = df.loc["#variant type"]
variant_type_list.unique()

array([nan, 'missense', 'missense; confirmed protein loss', 'truncating',
       'splice; potentially inframe', 'splice; confirmed inframe',
       'inframe', 'splice; potentially truncating',
       'splice; confirmed NMD'], dtype=object)

In [9]:
allelic_state_list = df.loc["#Allelic state"]
allelic_state_list = allelic_state_list.drop(labels=['HPO'])
allelic_state_list.unique()

array(['heterozygous', 'heterozygous '], dtype=object)

In [10]:
variant_hgvs_pd_series = df.loc["#Variant(s) in SMARCC2 (NM_003075.5: coding and protein)"]

<h2>HPO data</h2>
<p>The file contains rows that are already encoded as HPO terms. We look for a yes or no at the beginning of each cell, and do not parse any other text giving more detail.</p>

In [11]:
df_hpo = df[df['HPO'].fillna('').str.contains("HP:")]
hpo_ids = df_hpo["HPO"].to_list()

In [12]:
df_hpo.head()

Unnamed: 0_level_0,HPO,Ind-01,Ind-02,Ind-03,Ind-04,Ind-05,Ind-06,Ind-07,Ind-08,Ind-09,...,Machol_Ind_15,Li_Ind_1,Chen_Pat_123,Chen_Pat_124,Chen_Pat_126,Sun_case,Yi_case,Lo_twin_1,Lo_twin_2,Gofin_Subject_5
Patient_ID (in Project),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Microcephaly,HP:0000252,no,yes,yes,yes,no,no,no,no,yes,...,no,no,,,,,,,,
Macrocephaly,HP:0000256,no,no,no,no,no,no,no,yes,no,...,no,yes,,,,,,,,
Abnormal facial shape,HP:0001999,yes,no,yes,yes,yes; occipital plagiocephaly,no,no,yes; macrocephaly,no,...,no,yes; macrocephaly,,,,,no,,,
Abnormality of the eye,HP:0000478,no,yes; strabismus,yes; ptosis at 6y,no,yes; strabismus,yes; strabismus,no,no,no,...,no,,,,,,,,,
Abnormality of the hand,HP:0001155,no,no,no,yes; clinodactyly,no,no,no; slender appearance of hands and fingers without true arachnodactyly,no,no,...,no,yes; fetal finger pads bilaterally,,,,,,,,


In [13]:
df_hpo = df_hpo.set_index('HPO')
df_hpo.head()

Unnamed: 0_level_0,Ind-01,Ind-02,Ind-03,Ind-04,Ind-05,Ind-06,Ind-07,Ind-08,Ind-09,Ind-10,...,Machol_Ind_15,Li_Ind_1,Chen_Pat_123,Chen_Pat_124,Chen_Pat_126,Sun_case,Yi_case,Lo_twin_1,Lo_twin_2,Gofin_Subject_5
HPO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HP:0000252,no,yes,yes,yes,no,no,no,no,yes,no,...,no,no,,,,,,,,
HP:0000256,no,no,no,no,no,no,no,yes,no,no,...,no,yes,,,,,,,,
HP:0001999,yes,no,yes,yes,yes; occipital plagiocephaly,no,no,yes; macrocephaly,no,no,...,no,yes; macrocephaly,,,,,no,,,
HP:0000478,no,yes; strabismus,yes; ptosis at 6y,no,yes; strabismus,yes; strabismus,no,no,no,no,...,no,,,,,,,,,
HP:0001155,no,no,no,yes; clinodactyly,no,no,no; slender appearance of hands and fingers without true arachnodactyly,no,no,no,...,no,yes; fetal finger pads bilaterally,,,,,,,,


In [14]:
patient_id_to_hpo_observation_list_d = defaultdict(list)
hpo_observed = 0
hpo_excluded = 0
for hpo_term_id, row in df_hpo.iterrows():
    # deal with multiple terms, e.g., HP:0001263; HP:0001249
    fields = hpo_term_id.split(";")
    for term_id in fields:
        term_id = term_id.strip()
        hpo_term = hpo_cr.get_term_from_id(term_id)
        hpo_term_label = hpo_term.label
        if hpo_term_label is None or len(hpo_term_label) < 5:
            raise ValueError(f"Could not find HPO term label for {hpo_term_id}")
        for patient_id, item in row.iteritems():
            if item == "yes":
                hterm = HpTerm(label=hpo_term_label, hpo_id=term_id)
                hpo_observed += 1
            elif item == "no":
                hterm = HpTerm(label=hpo_term_label, hpo_id=term_id, observed=False)
                hpo_excluded += 1
            # skip if not yes/no -- in this case the term was not measured
            patient_id_to_hpo_observation_list_d[patient_id].append(hterm)
print(f"Got {hpo_observed} observed HPO terms and {hpo_excluded} excluded terms for {len(patient_id_to_hpo_observation_list_d)} individuals")

Got 626 observed HPO terms and 2821 excluded terms for 65 individuals


<h2>Variants</h2>
<p>All variants in this data set are heterozygous and refer to NM_003075.5.</p>

In [15]:
variant_patient_to_hgvs_d = defaultdict()
for idx, item in variant_hgvs_pd_series.iteritems():
    if idx == "HPO": continue # skip index column
    fields = item.split(",")
    variant_patient_to_hgvs_d[idx] = fields[0].strip()
#for k, v in variant_str_d.items():
#    print(f"\"{k}\" \"{v}\"")

In [16]:
genome = 'hg38'
transcript='NM_003075.5'
vv = VariantValidator(genome_build=genome, transcript=transcript)

In [18]:
variant_d = defaultdict(Variant)
variant_set = set(variant_patient_to_hgvs_d.values())
# Note that c.(?_-22)_(*1330_?)del is chr12-55960344-56192514-DEL
for hgvs in variant_set:
    print(f"Encoding {hgvs}")
    if hgvs == "c.(?_-22)_(*1330_?)del":
        variant = StructuralVariant.chromosomal_deletion(cell_contents="chr12-55960344-56192514-DEL", gene_symbol="SMARCC2",
                                                   gene_id="SMARCC2")
    else:
        variant = vv.encode_hgvs(hgvs)
    variant.set_heterozygous()
    variant_d[hgvs] = variant

Encoding c.1826T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003075.5%3Ac.1826T>C/NM_003075.5?content-type=application%2Fjson
Encoding c.(?_-22)_(*1330_?)del
Encoding c.1838T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003075.5%3Ac.1838T>C/NM_003075.5?content-type=application%2Fjson
Encoding c.2074G>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003075.5%3Ac.2074G>C/NM_003075.5?content-type=application%2Fjson
Encoding c.880_881del
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003075.5%3Ac.880_881del/NM_003075.5?content-type=application%2Fjson
Encoding c.2686A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003075.5%3Ac.2686A>G/NM_003075.5?content-type=application%2Fjson
Encoding c.326dup
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_003075.5%3Ac.326dup/NM_003075.5?content-type=application%2Fjson
Enco

<H2>Putting it all together</H2>

In [19]:
# patient_id_to_hpo_observation_list_d
# variant_str_d and variant_d
# pat_id_list
# sex_list
disease_id =  "OMIM:618362"
disease_label = "Coffin-Siris syndrome 8"
individual_list = []
for pat_id in pat_id_list:
    if pat_id == "HPO": continue
    hpo_terms = patient_id_to_hpo_observation_list_d.get(pat_id)
    sex = sex_pd_series[pat_id]
    variant_hgvs = variant_patient_to_hgvs_d.get(pat_id)
    if variant_hgvs is None:
        raise ValueError(f"Could not find variant hgvs for patient id \"{pat_id}\"")
    variant = variant_d.get(variant_hgvs)
    if variant is None:
        raise ValueError(f"Could not find variant for {variant_hgvs}")
    indiv = Individual(individual_id=pat_id, 
                       hpo_terms=hpo_terms, 
                       sex=sex,
                       interpretation_list=[variant.to_ga4gh_variant_interpretation()],
                       disease_id=disease_id, 
                       disease_label=disease_label)
    individual_list.append(indiv)
print(f"We created {len(individual_list)} individual objects")

We created 65 individual objects


<h3>Output</h3>

In [20]:
Individual.output_individuals_as_phenopackets(individual_list=individual_list, 
                                              pmid=pmid, 
                                              metadata=metadata.to_ga4gh(), 
                                              outdir="phenopackets")

65

<h2>Validation</h2>
<p>Set up phenopacket-tools as described <a href="http://phenopackets.org/phenopacket-tools/stable/cli.html">here</a></p>
<p>Then execute the following command in the terminal: pxf validate *.json</p>
<p>No errors were detected.</p>