<H1>FBXL4</h1>
<p>Todo-short intro about gene and associated diseases</p>

In [1]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import pyphetools
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import ContentValidator


print(f"Using pyphetools version {pyphetools.__version__}")

Using pyphetools version 0.9.1


In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)

In [3]:
df = pd.read_excel('input/FBXL4-data.xlsx')

<h1>Data ingest strategy</h1>
<p>We use classes of pyphetool to encode columns that contain the disease identifier, the age, sex, and phenotypes.
For now, we add a column with an identifier (which is required) - we should get this from the original publications - and also
add the PubMed identifiers (instead of DOI), for consistency with the other phenopackets in this project.</p>

In [4]:
df.head(2)

Unnamed: 0,omim_id,omim_title,omim_name,hgnc_id,gene_symbol,Zigosity,Location,Variant annotation,Consequence,Refseq,...,Protein structure,Age of diagnosis,Gender,Age at death,Phenotype,Prenatal ultrasound phenotype,MRI phenotype,Cardiac phenotype,Family history,Source
0,615471,MITOCHONDRIAL DNA DEPLETION SYNDROME 13 (ENCEPHALOMYOPATHIC TYPE),Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type),HGNC:13601,FBXL4,Compound heterozygous,NM_001278716.1/NM_001278716.1,c.772G>T(p.Gly258*)/c.1061G>C(p.Trp354Ser),nonsense/missense,NM_001278716.1(FBXL4):c.772G>T;(p.Gly258*)/NM_001278716.1(FBXL4):c.1060T>C;(p.Trp354Arg),...,,Fetus,,,"abnormal eyebrow morphology HP:0000534 , triangular face HP:0000325, abnormal shape of the palpebral fissure HP:0200005, mandibular prognathia HP:0000303, increased serum lactate HP:0002151 HP:0002151, increased serum pyruvate HP:0003542",,,,"Primigravid, no family history",https://doi.org/10.24953/turkjped.2021.05.025
1,615471,MITOCHONDRIAL DNA DEPLETION SYNDROME 13 (ENCEPHALOMYOPATHIC TYPE),Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type),HGNC:13601,FBXL4,homozygous,NM_001278716.1,c.993dupA (p.L332Tfs*3),,NM_001278716.1(FBXL4):c.993dup,...,truncated protein. Leucine position 332 changed to arginine and termination,1.5,,,"global developmental delay HP:0001263, hypotonia HP:0001252, abnormal eating behavior HP:0100738, hypertelorism HP:0000316, short palpebral fissure HP:0012745, internuclear ophthalmoplegia HP:0030773, wide nasal bridge HP:0000431",,abnormality of brain morphology HP:0012443,,no family history,https://doi.org/10.24953/turkjped.2021.05.025


In [5]:
# We need to provide an id for a valid phenopacket. TODO - get original IDs from publications if possible

import random
import string

def get_random_string(length):
    letters = string.ascii_lowercase
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str
    
df['patient_id'] = df['omim_id'].apply(lambda x: get_random_string(10))
df.set_index('patient_id', inplace=True)

<h3>PMID and title</h3>
<p>For consistency with the rest of the phenopackets, we like to have PMID and title. The following adds new columns to the table starting from the DOI, but it would be nicer to have the PMID info in the original table</p>

In [6]:
source_d = {'https://doi.org/10.24953/turkjped.2021.05.025': ["PMID:34738379", "Entero-encephalopathy due to FBXL4-related mtDNA depletion syndrome"],
    'https://doi.org/10.3389/fgene.2019.01300':["PMID:31969900"," Molecular Characterization of New FBXL4 Mutations in Patients With mtDNA Depletion Syndrome"],
    'https://doi.org/10.24953/turkjped.2020.04.016' : ['PMID:32779419','Different clinical presentation in a patient with two novel pathogenic variants of the FBXL4 gene'],
    'https://doi.org/10.1111/cge.12894':["PMID:27743463","FBXL4 defects are common in patients with congenital lactic acidemia and encephalomyopathic mitochondrial DNA depletion syndrome"],       
    'doi: 10.1007/8904_2015_491':["PMID:26404457","Detailed Biochemical and Bioenergetic Characterization of FBXL4-Related Encephalomyopathic Mitochondrial DNA Depletion"],
    'https://doi.org/10.3389/fgene.2019.00039': ["PMID:30804983", "FBXL4-Related Mitochondrial DNA Depletion Syndrome 13 (MTDPS13): A Case Report With a Comprehensive Mutation Review"],
    'https://doi.org/10.1159/000515928':["PMID:34602956","A Mild Phenotype of Mitochondrial DNA Depletion Syndrome Type 13 with a Novel FBXL4 Variant"],
    'https://doi.org/10.1016/j.gene.2019.01.049':["PMID:30771478", "Whole exome sequencing revealed mutations in FBXL4, UNC80, and ADK in Thai patients with severe intellectual disabilities"],
    'https://doi.org/10.1016/j.ajhg.2013.07.016':["PMID:23993194", "Mutations in FBXL4, encoding a mitochondrial protein, cause early-onset mitochondrial encephalomyopathy"],
}


df['PMID'] = df['Source'].apply(lambda x: source_d.get(x)[0])
df['title']= df['Source'].apply(lambda x: source_d.get(x)[1])
df[['PMID', 'title']].head()

Unnamed: 0_level_0,PMID,title
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
egzhrcpymu,PMID:34738379,Entero-encephalopathy due to FBXL4-related mtDNA depletion syndrome
hnkebhcukt,PMID:34738379,Entero-encephalopathy due to FBXL4-related mtDNA depletion syndrome
kzaszrhebc,PMID:34738379,Entero-encephalopathy due to FBXL4-related mtDNA depletion syndrome
bdqwaafjbu,PMID:34738379,Entero-encephalopathy due to FBXL4-related mtDNA depletion syndrome
rlcginzbsl,PMID:31969900,Molecular Characterization of New FBXL4 Mutations in Patients With mtDNA Depletion Syndrome


In [7]:
df.columns

Index(['omim_id', 'omim_title', 'omim_name', 'hgnc_id', 'gene_symbol',
       'Zigosity', 'Location', 'Variant annotation', 'Consequence', 'Refseq',
       'Protein ID', 'Genomic location', 'ACGM classification',
       'Protein structure', 'Age of diagnosis', 'Gender', 'Age at death',
       'Phenotype', 'Prenatal ultrasound phenotype', 'MRI phenotype',
       'Cardiac phenotype', 'Family history', 'Source', 'PMID', 'title'],
      dtype='object')

<h1>Variants</h1>
<p>Variant data is available in strings such as c.772G>T(p.Gly258*)/c.1061G>C(p.Trp354Ser) or c.772G>T(p.Gly258*) for homozygous variants.
We need to extract the HGVS cDNA string to pass to Variant Validator</p>

In [8]:
def extract_cdna(variant):
    """
    split strings like c.772G>T(p.Gly258*) on the open-parenthesis symbol and return the first part
    """
    return variant.split("(")[0]
    
def extract_variant_1(variants):
    """
    Split on the slash ("/") and return the first part (or entire string for homozygous)
    """
    v1 = variants.split("/")[0]
    return extract_cdna(v1)

def extract_variant_2(variants):
    """
    Split on the slash ("/") and return the second part (or entire string for homozygous)
    """
    fields = variants.split("/")
    if len(fields) == 2:
        return extract_cdna(fields[1])
    else:
        # there was only one variant
        return extract_cdna(variants)

In [9]:
df["var1"] = df['Variant annotation'].apply(lambda x: extract_variant_1(x))
df["var2"] = df['Variant annotation'].apply(lambda x: extract_variant_2(x))

In [10]:
var1_list = df["var1"].unique()
var2_list = df["var2"].unique()
var_set = set()
var_set.update(var1_list)
var_set.update(var2_list)
variant_d = {}
hg38 = "hg38"
FBXL4_transcript = "NM_001278716.1"
vvalidator = VariantValidator(genome_build=hg38, transcript=FBXL4_transcript)
for v in var_set:
    var = vvalidator.encode_hgvs(v)
    variant_d[v] = var
    print(f"{v} - {var}")

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.1%3Ac.1510T>C /NM_001278716.1?content-type=application%2Fjson
c.1510T>C  - chr6:98875607A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.1%3Ac.1652T>A/NM_001278716.1?content-type=application%2Fjson
c.1652T>A - chr6:98875465A>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.1%3Ac.1750T>C/NM_001278716.1?content-type=application%2Fjson
c.1750T>C - chr6:98874394A>G
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.1%3Ac.1229C>T/NM_001278716.1?content-type=application%2Fjson
c.1229C>T - chr6:98899356G>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.1%3Ac.851_852del/NM_001278716.1?content-type=application%2Fjson
c.851_852del - chr6:98917379AAG>A
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.1%3Ac.1694A>G/NM_0012787

<h2>Mapping the age column</h2>
<p>We add a dictionary to the corresponding ISO 8601 age strings. Anything not here will be mapped to "not provided"</p>

In [11]:
string_to_iso_dict = {
    "1.5": "P1Y6M",
    "3.5": "P3Y6M",
    'birth': "P1D",
    '51 days': "P1M21D",
    '19 months': "P1Y7M",
    '10 monhts': "P10M",
    '14 months': "P1Y2M",
    '23 months': "P1Y11M",
    '22 mohts': "P1Y10M",
    '9 months': "P9M",
    '12 years': "P12Y",
    '1 day of age': "P1D",
    '4.5 years': "P4Y6M",
    '4 months': "P4M",
    '1 month': "P1M"
}
ageMapper = AgeColumnMapper.custom_dictionary(column_name='Age of diagnosis', string_to_iso_d=string_to_iso_dict)
#ageMapper.preview_column(df['Age of diagnosis']).head()

In [12]:
sexMapper = SexColumnMapper(male_symbol="M", female_symbol="F", unknown_symbol="nan", column_name="Gender")
#sexMapper.preview_column(df['Gender']).head()

In [13]:
# todo, add vital status to encoder or encode as HPO terms
df['Age at death']

patient_id
egzhrcpymu                       NaN
hnkebhcukt                       NaN
kzaszrhebc                       NaN
bdqwaafjbu                  7 months
rlcginzbsl                 10 months
ufuvyqzezv                    4 days
nsukaiycdf           11 months alive
bmwqmwqedv           19 months alive
seuziqctok           10 months alive
ensmljhqhq           14 months alive
dqsihpwtpm                    5 days
vedoepzlav           23 months alive
dewhmocyye           22 months alive
uljskuasnt                       NaN
jsveocsain                    3 days
ferdlulhgd                  9 monhts
vqmlccrucq                  12 years
qkjedcruhr           14 months alive
sycpebcwme    2 years 8 months alive
lubvpsperx    4 years 6 months alive
zmxcdgvnuu                   3 years
ensdjttwfb                       NaN
bsucnpavjl                   4 years
uybclysyhy                       NaN
zpahipvihp                       NaN
dwdffxtxtp                       NaN
uyropypacq                 

<h1>Mapping phenotypic features</h1>
<p>We use the CustomColumnMapper class, which uses text mining to recognize one or more HPO terms in each table cell. The preview
column function can be used to check results, which appear very accurate. Each mapper is put into a map that is used to encode the entire cohort, below</p>

In [14]:
mapper_d = {}

In [15]:
phenotypeColumnMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
phenotypeColumnMapper.preview_column(df['Phenotype'])
mapper_d['Phenotype'] = phenotypeColumnMapper

In [16]:
prenatalUSmapper =  CustomColumnMapper(concept_recognizer=hpo_cr)
prenatalUSmapper.preview_column(df['Prenatal ultrasound phenotype'])
mapper_d['Prenatal ultrasound phenotype'] = prenatalUSmapper

NameError: name 'CustomColumnMapper' is not defined

In [None]:
mriMapper = CustomColumnMapper(concept_recognizer=hpo_cr)
mriMapper.preview_column(df['MRI phenotype'])
mapper_d['MRI phenotype'] = mriMapper

In [None]:
cardiacMapper = CustomColumnMapper(concept_recognizer=hpo_cr)
cardiacMapper.preview_column(df['Cardiac phenotype'])
mapper_d['Cardiac phenotype'] = cardiacMapper

<h1>Putting it all together</h1>

In [None]:
mdds13 = Disease(disease_id="OMIM:615471", disease_label="Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)")
disease_d = {"615471": mdds13}
diseaseMapper = DiseaseIdColumnMapper(column_name="omim_id", disease_id_map=disease_d)

In [None]:
encoder = MixedCohortEncoder(df=df,
                            hpo_cr=hpo_cr,
                             column_mapper_d=mapper_d,
                             individual_column_name="patient_id",
                             disease_id_mapper=diseaseMapper,
                             pmid_column="PMID",
                             title_column="title",
                             sexmapper=sexMapper,
                             agemapper=ageMapper,
                             metadata=metadata
                        )

In [None]:
individuals = encoder.get_individuals()

In [None]:
# retrieve the variant strings and add Variant objects to each individual
# the individual id (i.id) is also the index of the pandas dataframe
for i in individuals:
    print(i.id)
    row = df.loc[i.id] 
    #print(row)
    v1 = row['var1']
    v2 = row['var2']
    print(f"v1={v1} and v2={v2}")
    if v1 == v2:
        var1 = variant_d.get(v1)
        var1.set_homozygous()
        i.add_variant(var1)
    else:
        var1 = variant_d.get(v1)
        var2 = variant_d.get(v2)
        var1.set_heterozygous()
        var2.set_heterozygous()
        i.add_variant(var1)
        i.add_variant(var2)
    

<h1>Validation</h1>
<p>pyphetools offers a quick validation that phenopackets contain a mininum number of variants and HPO terms.
We recommend additional validation with <a href="https://github.com/phenopackets/phenopacket-tools">phenopacket-tools</a>.</p>

In [None]:
cvalidator = ContentValidator(min_var=1, min_hpo=1)
errors = cvalidator.validate_phenopacket_list([i.to_ga4gh_phenopacket(metadata.to_ga4gh()) for i in individuals])
print(f"We found {len(errors)} validation errors")

<h2>Visualization</h2>

In [None]:
from IPython.display import HTML, display
phenopackets = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individuals]
table = PhenopacketTable(phenopacket_list=phenopackets)
display(HTML(table.to_html()))