# FBXL4

Variants in FBXL4 are associated with [Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)](https://omim.org/entry/605654)
This notebook collects clinical data from several publications.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
from IPython.display import display, HTML
import pyphetools
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import CohortValidator
print(f"Using pyphetools version {pyphetools.__version__}")

Using pyphetools version 0.9.11


In [2]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
metadata = MetaData(created_by="ORCID:0000-0002-0736-9199")
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


In [3]:
df = pd.read_excel('input/FBXL4-data.xlsx')

In [4]:
df.head(2)

Unnamed: 0,ID,omim_id,omim_title,omim_name,hgnc_id,gene_symbol,Zigosity,Location,Variant annotation,Consequence,...,Gender,Age at death,Phenotype,Prenatal ultrasound phenotype,MRI phenotype,Cardiac phenotype,Family history,Source,PMID,title
0,34738379_P1,615471,MITOCHONDRIAL DNA DEPLETION SYNDROME 13 (ENCEPHALOMYOPATHIC TYPE),Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type),HGNC:13601,FBXL4,Compound heterozygous,NM_001278716.1/NM_001278716.1,c.772G>T(p.Gly258*)/c.1061G>C(p.Trp354Ser),nonsense/missense,...,,,"abnormal eyebrow morphology HP:0000534 , triangular face HP:0000325, abnormal shape of the palpebral fissure HP:0200005, mandibular prognathia HP:0000303, increased serum lactate HP:0002151 HP:0002151, increased serum pyruvate HP:0003542",,,,"Primigravid, no family history",https://doi.org/10.24953/turkjped.2021.05.025,PMID:34738379,Entero-encephalopathy due to FBXL4-related mtDNA depletion syndrome
1,32559514_P1,615471,MITOCHONDRIAL DNA DEPLETION SYNDROME 13 (ENCEPHALOMYOPATHIC TYPE),Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type),HGNC:13601,FBXL4,homozygous,NM_001278716.1,c.993dupA (p.L332Tfs*3),,...,,,"global developmental delay HP:0001263, hypotonia HP:0001252, abnormal eating behavior HP:0100738, hypertelorism HP:0000316, short palpebral fissure HP:0012745, internuclear ophthalmoplegia HP:0030773, wide nasal bridge HP:0000431",,abnormality of brain morphology HP:0012443,,no family history,https://doi.org/10.1016/j.jns.2020.116948,PMID:32559514,Novel homozygous mutation in the FBXL4 gene is associated with mitochondria DNA depletion syndrome-13


In [5]:
df['patient_id'] = df['ID']
df.set_index('patient_id', inplace=True)

In [6]:
df.columns

Index(['ID', 'omim_id', 'omim_title', 'omim_name', 'hgnc_id', 'gene_symbol',
       'Zigosity', 'Location', 'Variant annotation', 'Consequence', 'Refseq',
       'Protein ID', 'Genomic location', 'ACGM classification',
       'Protein structure', 'Age of diagnosis', 'Gender', 'Age at death',
       'Phenotype', 'Prenatal ultrasound phenotype', 'MRI phenotype',
       'Cardiac phenotype', 'Family history', 'Source', 'PMID', 'title'],
      dtype='object')

<h1>Variants</h1>
<p>Variant data is available in strings such as c.772G>T(p.Gly258*)/c.1061G>C(p.Trp354Ser) or c.772G>T(p.Gly258*) for homozygous variants.
We need to extract the HGVS cDNA string to pass to Variant Validator</p>

In [7]:
def extract_cdna(variant):
    """
    split strings like c.772G>T(p.Gly258*) on the open-parenthesis symbol and return the first part
    """
    v = variant.split("(")[0]
    v = v.replace(" ", "").replace("p.","")
    return v
    
def extract_variant_1(variants):
    """
    Split on the slash ("/") and return the first part (or entire string for homozygous)
    """
    v1 = variants.split("/")[0]
    return extract_cdna(v1)

def extract_variant_2(variants):
    """
    Split on the slash ("/") and return the second part (or entire string for homozygous)
    """
    fields = variants.split("/")
    if len(fields) == 2:
        return extract_cdna(fields[1])
    else:
        # there was only one variant
        return extract_cdna(variants)

In [8]:
df["var1"] = df['Variant annotation'].apply(lambda x: extract_variant_1(x))
df["var2"] = df['Variant annotation'].apply(lambda x: extract_variant_2(x))

In [9]:
from time import sleep
var1_list = df["var1"].unique()
var2_list = df["var2"].unique()
var_set = set()
var_set.update(var1_list)
var_set.update(var2_list)
variant_d = {}
hg38 = "hg38"
FBXL4_transcript = "NM_001278716.2"
vvalidator = VariantValidator(genome_build=hg38, transcript=FBXL4_transcript)
for v in var_set:
    print(f"{v}")
    var = vvalidator.encode_hgvs(v)
    variant_d[v] = var
    sleep(1)
print(f"extracted {len(variant_d)} variants with VariantValidator")

c.1790A>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.2%3Ac.1790A>C/NM_001278716.2?content-type=application%2Fjson
c.858+5G>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.2%3Ac.858+5G>C/NM_001278716.2?content-type=application%2Fjson
c.64C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.2%3Ac.64C>T/NM_001278716.2?content-type=application%2Fjson
c.1442T>C
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.2%3Ac.1442T>C/NM_001278716.2?content-type=application%2Fjson
c.1389+3_1389+6delAAGT
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.2%3Ac.1389+3_1389+6delAAGT/NM_001278716.2?content-type=application%2Fjson
c.1303C>T
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_001278716.2%3Ac.1303C>T/NM_001278716.2?content-type=application%2Fjson


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

<h2>Mapping the age column</h2>
<p>We add a dictionary to the corresponding ISO 8601 age strings. Anything not here will be mapped to "not provided"</p>

In [10]:
string_to_iso_dict = {
    "1.5": "P1Y6M",
    "3.5": "P3Y6M",
    'birth': "P1D",
    '51 days': "P1M21D",
    '19 months': "P1Y7M",
    '10 monhts': "P10M",
    '14 months': "P1Y2M",
    '23 months': "P1Y11M",
    '22 mohts': "P1Y10M",
    '9 months': "P9M",
    '12 years': "P12Y",
    '1 day of age': "P1D",
    '4.5 years': "P4Y6M",
    '4 months': "P4M",
    '1 month': "P1M"
}
ageMapper = AgeColumnMapper.custom_dictionary(column_name='Age of diagnosis', string_to_iso_d=string_to_iso_dict)
#ageMapper.preview_column(df['Age of diagnosis']).head()

In [11]:
sexMapper = SexColumnMapper(male_symbol="M", female_symbol="F", unknown_symbol="nan", column_name="Gender")
#sexMapper.preview_column(df['Gender']).head()

In [12]:
# todo, add vital status to encoder or encode as HPO terms
df['Age at death']

patient_id
34738379_P1                        NaN
32559514_P1                        NaN
27182039_P1                        NaN
31442532_P1                   7 months
31969900_P1                  10 months
31969900_P2                     4 days
32779419_P1            11 months alive
27743463_P1            19 months alive
27743463_P2            10 months alive
27743463_P3            14 months alive
27743463_P4                     5 days
27743463_P5            23 months alive
27743463_P6            22 months alive
27743463_P7                        NaN
27743463_P8                     3 days
27743463_P9                   9 monhts
27743463_P10                  12 years
26404457_P1            14 months alive
30804983_P1     2 years 8 months alive
34602956_P1     4 years 6 months alive
30771478_P1                    3 years
23993194_P1                        NaN
23993194_P2                    4 years
23993194_P3                        NaN
23993194_P4                        NaN
23993194_P5   

In [13]:
aod_d = {
    "7 months": "P7M",
    "10 months": "P10M",
    "4 days": "P4D",
    "5 days": "P5D",
    "3 days": "P3D",
    "9 monhts": "P9M",
    "12 years": "P12Y",
    "3 years": "P3Y",
    "4 years": "P4Y",
    "2 years": "P2Y",
    "2.5 years": "P2Y6M",
    "16 months": "P1Y4M",
}
aodMapper = AgeOfDeathColumnMapper(column_name='Age at death', string_to_iso_d=aod_d)

<h1>Mapping phenotypic features</h1>
<p>We use the CustomColumnMapper class, which uses text mining to recognize one or more HPO terms in each table cell. The preview
column function can be used to check results, which appear very accurate. Each mapper is put into a map that is used to encode the entire cohort, below</p>

In [14]:
mapper_d = {}

In [15]:
phenotypeColumnMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
phenotypeColumnMapper.preview_column(df['Phenotype'])
mapper_d['Phenotype'] = phenotypeColumnMapper

In [16]:
prenatalUSmapper =  OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
prenatalUSmapper.preview_column(df['Prenatal ultrasound phenotype'])
mapper_d['Prenatal ultrasound phenotype'] = prenatalUSmapper

In [17]:
mriMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
mriMapper.preview_column(df['MRI phenotype'])
mapper_d['MRI phenotype'] = mriMapper

In [18]:
cardiacMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d={})
cardiacMapper.preview_column(df['Cardiac phenotype'])
mapper_d['Cardiac phenotype'] = cardiacMapper

<h1>Putting it all together</h1>

In [19]:
mdds13 = Disease(disease_id="OMIM:615471", disease_label="Mitochondrial DNA depletion syndrome 13 (encephalomyopathic type)")
disease_d = {"615471": mdds13}
diseaseMapper = DiseaseIdColumnMapper(column_name="omim_id", disease_id_map=disease_d)

In [25]:
encoder = MixedCohortEncoder(df=df,
                            hpo_cr=hpo_cr,
                             column_mapper_d=mapper_d,
                             individual_column_name="patient_id",
                             disease_id_mapper=diseaseMapper,
                             pmid_column="PMID",
                             title_column="title",
                             sexmapper=sexMapper,
                             agemapper=ageMapper,
                             age_of_death_mapper=aodMapper,
                             metadata=metadata
                        )

TypeError: MixedCohortEncoder.__init__() missing 1 required positional argument: 'disease_id_mapper'

In [21]:
individuals = encoder.get_individuals()

In [22]:
# retrieve the variant strings and add Variant objects to each individual
# the individual id (i.id) is also the index of the pandas dataframe
for i in individuals:
    row = df.loc[i.id] 
    v1 = row['var1']
    v2 = row['var2']
    #print(f"{i.id}: v1={v1} and v2={v2}")
    if v1 == v2:
        var1 = variant_d.get(v1)
        var1.set_homozygous()
        i.add_variant(var1)
    else:
        var1 = variant_d.get(v1)
        var2 = variant_d.get(v2)
        var1.set_heterozygous()
        var2.set_heterozygous()
        i.add_variant(var1)
        i.add_variant(var2)

AttributeError: 'NoneType' object has no attribute 'set_heterozygous'

<h1>Validation</h1>
<p>pyphetools offers a quick validation that phenopackets contain a mininum number of variants and HPO terms.
We recommend additional validation with <a href="https://github.com/phenopackets/phenopacket-tools">phenopacket-tools</a>.</p>

In [23]:
#cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.BI_ALLELIC)
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1)
qc = QcVisualizer(cohort_validator=cvalidator)
display(HTML(qc.to_summary_html()))

Level,Error category,Count


<h2>Visualization</h2>

In [24]:
individuals = cvalidator.get_error_free_individual_list()
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
34738379_P1 (UNKNOWN; ),,,Abnormal eyebrow morphology (HP:0000534); Triangular face (HP:0000325); Abnormal shape of the palpebral fissure (HP:0200005); Mandibular prognathia (HP:0000303); Increased serum lactate (HP:0002151); Increased serum pyruvate (HP:0003542)
32559514_P1 (UNKNOWN; P1Y6M),,,Global developmental delay (HP:0001263); Hypotonia (HP:0001252); Abnormal eating behavior (HP:0100738); Hypertelorism (HP:0000316); Short palpebral fissure (HP:0012745); Internuclear ophthalmoplegia (HP:0030773); Wide nasal bridge (HP:0000431); Abnormal brain morphology (HP:0012443)
27182039_P1 (UNKNOWN; P3Y6M),,,Metabolic acidosis (HP:0001942); Wolff-Parkinson-White syndrome (HP:0001716); Increased serum lactate (HP:0002151); Global developmental delay (HP:0001263); Nystagmus (HP:0000639); Hypotonia (HP:0001252); Focal impaired awareness seizure (HP:0002384); Prominent metopic ridge (HP:0005487); Thick eyebrow (HP:0000574); Ptosis (HP:0000508); Short palpebral fissure (HP:0012745); Epicanthus (HP:0000286); Strabismus (HP:0000486); Wide nasal bridge (HP:0000431); Smooth philtrum (HP:0000319); Short chin (HP:0000331); Hernia (HP:0100790); Hepatomegaly (HP:0002240); Hypospadias (HP:0000047); Cryptorchidism (HP:0000028); Periventricular cysts (HP:0007109); CNS hypomyelination (HP:0003429); Global brain atrophy (HP:0002283)
31442532_P1 (FEMALE; ),,,Encephalopathy (HP:0001298); Hypertrophic cardiomyopathy (HP:0001639); Increased serum lactate (HP:0002151); Polymicrogyria (HP:0002126); Cryptorchidism (HP:0000028); Hydronephrosis (HP:0000126); Cerebellar hypoplasia (HP:0001321)
31969900_P1 (FEMALE; P1D),,,Global developmental delay (HP:0001263); Hypotonia (HP:0001252); Nystagmus (HP:0000639); Increased serum lactate (HP:0002151); Gastroesophageal reflux (HP:0002020); Epileptic spasm (HP:0011097); Dysphagia (HP:0002015); Encephalopathy (HP:0001298); Single umbilical artery (HP:0001195); Enlarged fetal cisterna magna (HP:0011427); Abnormal cortical gyration (HP:0002536); Cerebellar hypoplasia (HP:0001321)
31969900_P2 (MALE; ),,,Growth delay (HP:0001510); Cyanosis (HP:0000961); Hypoxemia (HP:0012418); Hypospadias (HP:0000047); Cryptorchidism (HP:0000028)
32779419_P1 (MALE; P1M21D),,,Micrognathia (HP:0000347); Low-set ears (HP:0000369); Abnormality of the forehead (HP:0000290); Long philtrum (HP:0000343); Global developmental delay (HP:0001263); Hypotonia (HP:0001252); Optic atrophy (HP:0000648); Increased serum lactate (HP:0002151); Bloody diarrhea (HP:0025085); Ventriculomegaly (HP:0002119); Thin corpus callosum (HP:0033725); CNS hypomyelination (HP:0003429)
27743463_P1 (MALE; P1Y7M),,,Decreased body weight (HP:0004325); Short stature (HP:0004322); Microcephaly (HP:0000252); Abnormal facial shape (HP:0001999); Global developmental delay (HP:0001263); Hypotonia (HP:0001252); Leukodystrophy (HP:0002415); Cerebral atrophy (HP:0002059); Atrial septal defect (HP:0001631); Left ventricular hypertrophy (HP:0001712)
27743463_P2 (FEMALE; P10M),,,Decreased body weight (HP:0004325); Abnormal facial shape (HP:0001999); Global developmental delay (HP:0001263); Hypotonia (HP:0001252); Leukodystrophy (HP:0002415); Pulmonary arterial hypertension (HP:0002092); Right ventricular hypertrophy (HP:0001667)
27743463_P3 (MALE; P1Y2M),,,Decreased body weight (HP:0004325); Short stature (HP:0004322); Microcephaly (HP:0000252); Abnormal facial shape (HP:0001999); Global developmental delay (HP:0001263); Hypotonia (HP:0001252); Nystagmus (HP:0000639); Cerebral atrophy (HP:0002059); Tetralogy of Fallot (HP:0001636)


In [23]:
MixedCohortEncoder.output_individuals_as_phenopackets(individual_list=individuals)

We output 30 GA4GH phenopackets to the directory phenopackets
