<H1>SLC45A2: Oculo-Cutaneous Albinism Type 4 (OCA4) - Moreno-Artero et al., 2022</H1>
<p>Extract clinical data from <a href="https://pubmed.ncbi.nlm.nih.gov/36553465/" target="__blank">
Moreno-Artero E, et al. (2022). Oculo-Cutaneous Albinism Type 4 (OCA4): Phenotype-Genotype Correlation. Genes (Basel). 2022 Nov 23;13(12):2198</a>:  PMID:36553465.</p>

In [None]:
import phenopackets as php
from google.protobuf.json_format import MessageToDict, MessageToJson
from google.protobuf.json_format import Parse, ParseDict
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from collections import defaultdict
import os
import sys
import numpy as np
from IPython.display import display, HTML, JSON
from pyphetools.creation import *
from pyphetools.creation.simple_column_mapper import try_mapping_columns
from pyphetools.output import PhenopacketTable
# last tested with pyphetools 0.4.5

In [None]:
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155")
metadata.default_versions_with_hpo(version=hpo_version)

<h3>Ingest the data</h3>
<p>The clinical and variant data were copied from Table 1 of the publication. For ease of parsing, we manually split the Gender,Age column into two columns.</p>
<p>The authors classify patients 1-20 as group 1 and patients 21-30 as group 2. The describe the following genotype-phenotype correlation: The first, found in 20 patients, is clinically indistinguishable from the classical OCA1 phenotype. The genotype-to-phenotype correlation suggests that this phenotype is associated with homozygous or compound heterozygous nonsense or deletion variants with frameshift leading to translation interruption in the SLC45A2 gene. The second phenotype, found in 10 patients, is characterized by very mild hypopigmentation of the hair (light brown or even dark hair) and skin that is similar to the general population. In this group, visual acuity is variable, but it can be subnormal, foveal hypoplasia can be low grade or even normal, and nystagmus may be lacking. These mild to moderate phenotypes are associated with at least one missense mutation in SLC45A2.</p>

In [None]:
df = pd.read_excel('input/Moreno-Artero2022_table1.xlsx')

In [None]:
df.head()

In [None]:

column_mapper_d = defaultdict(ColumnMapper)
nystagmusMapper = SimpleColumnMapper(hpo_id="HP:0000639", hpo_label="Nystagmus",observed='Yes',excluded='No')
nystagmusMapper.preview_column(df["Nystagmus"])
column_mapper_d["Nystagmus"] = nystagmusMapper

In [None]:
# This was used to conveniently generate OptionColumnMapper code, but is not longer needed.
#result = OptionColumnMapper.autoformat(df, hpo_cr)
#print(result)

In [None]:
nevi_d = {'Present': 'Nevus',
 'amelanotic': 'Nevus',  ## TODO needs new HPO term
 'pigmented': 'Melanocytic nevus',
}
neviMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=nevi_d, 
                                negative_label="Nevus", negative_symbol="Absent")
#neviMapper.preview_column(df['Nevi'])
column_mapper_d['Nevi'] = neviMapper

In [None]:
eyes_d = {'Blue': 'Iris hypopigmentation',
 'Blue grey': 'Iris hypopigmentation',}
eyesMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=eyes_d,
                                   negative_label="Iris hypopigmentation",
                               negative_symbol="Brown")
#eyesMapper.preview_column(df['Eyes'])
column_mapper_d['Eyes'] = eyesMapper

In [None]:
hair_d = {'White': 'Hypopigmentation of hair',
 'White blond': 'Hypopigmentation of hair',
 'Blond': 'Hypopigmentation of hair',
 'Dark blond': 'Hypopigmentation of hair',
 'Red blond': 'Hypopigmentation of hair'}
hairMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=hair_d)
hairMapper.preview_column(df['Hair'])
column_mapper_d['Hair'] = hairMapper

In [None]:
eyebrows_d = {'White': 'White eyebrow',
 'Blond': 'White eyebrow',
 'White + Blond': 'White eyebrow'}
eyebrowsMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=eyebrows_d, 
                                    negative_label="Brown", negative_symbol="White eyebrow")
eyebrowsMapper.preview_column(df['Eyebrows'])
column_mapper_d['Eyebrows'] = eyebrowsMapper

In [None]:
eyelashes_d = {'White': 'White eyelashes',
 'Blond': 'White eyelashes',
 'White + Blond': 'White eyelashes'}
eyelashesMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=eyelashes_d)
eyelashesMapper.preview_column(df['Eyelashes'])
column_mapper_d['Eyelashes'] = eyelashesMapper


In [None]:
strabismus_d = {'Yes': 'Strabismus',
 'esotropia': 'Esotropia',
 'left exotropia': 'Exotropia',
 'No': 'PLACEHOLDER',
 'exotropia': 'Exotropia',
 'Yes microexotropia': 'Exotropia'}
strabismusMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=strabismus_d,
                                     negative_label='Strabismus', negative_symbol="No")
strabismusMapper.preview_column(df['Strabismus'])
column_mapper_d['Strabismus'] = strabismusMapper

In [None]:
va_d = {'1.6/10 RE; 2/10 LE': 'Reduced visual acuity',
 '1/20 RE; 1/20 LE': 'Reduced visual acuity',
 '2/10 RE; 2/10 LE': 'Reduced visual acuity',
 '2/10 RE; 3/10 LE': 'Reduced visual acuity',
 '1/10 RE; 1/10 LE': 'Reduced visual acuity',
 '3/10 RE; 3/10 LE': 'Reduced visual acuity',
 '2/10 RE; 2': 'Reduced visual acuity',
 '5/10 LE': 'Reduced visual acuity',
 '1.6/10 RE; 1.6/10 LE': 'Reduced visual acuity',
 '9/10 RE; 7/10 LE': 'Reduced visual acuity',
 '1.2/10 RE; 1.4/10 LE': 'Reduced visual acuity',
 '5/10 RE; 5/10 LE': 'Reduced visual acuity'}
vaMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=va_d)
vaMapper.preview_column(df['VA'])
#column_mapper_d['VA'] = vaMapper

In [None]:
refraction_d = {'Hypermetropia astigmatism': 'Hypermetropia',
 'Hypermetropia Astigmatism': 'Astigmatism',
 'Hypermetropia\nAstigmatism': 'Astigmatism',
 'Hypermetropia': 'Hypermetropia',
 'HypermetropiaAstigmatism': 'Hypermetropia',
 'Myopia Astigmatism': 'Myopia'}
refractionMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=refraction_d)
refractionMapper.preview_column(df['Refraction'])
column_mapper_d['Refraction'] = refractionMapper

In [None]:
iti_d = {
 'Grade IV': 'Iris transillumination defect',
 'Grade III': 'Iris transillumination defect',
 'Grade II': 'Iris transillumination defect',
 'Grade I': 'Iris transillumination defect'}
itiMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=iti_d,
                              negative_symbol="No",
                              negative_label="Iris transillumination defect")
itiMapper.preview_column(df['ITI'])
column_mapper_d['ITI'] = itiMapper

In [None]:
mt_d = {'nan': 'PLACEHOLDER',
 'Grade II': 'PLACEHOLDER',
 'Grade III': 'PLACEHOLDER',
 'Grade I': 'PLACEHOLDER'}
#mtMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=mt_d)
#mtMapper.preview_column(df['MT']))
#column_mapper_d['MT'] = mtMapper
# Macular transparency -- need HPO term

In [None]:
fhp_d = {'Grade IV': 'Hypoplasia of the fovea',
 'Grade III': 'Hypoplasia of the fovea',
 'Grade II': 'Hypoplasia of the fovea',
 'Grade I': 'Hypoplasia of the fovea'}
fhpMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=fhp_d)
fhpMapper.preview_column(df['FHP'])
column_mapper_d['FHP'] = fhpMapper

<h2>Variants</h2>
<p>The original table describes variants like this: <tt>NM_016180.5(SLC45A2):c.267_271del\nChr5(GRCh37):g.33984422_33984426del\np.(Ser90Glnfs*42)</tt>.
    The following code extracts the transcript variant - c.267_271del in this example.</p>

In [None]:
def extract_var(cell_contents):
    if not cell_contents.startswith("NM_016180.5(SLC45A2):"):
        return cell_contents
    cell_contents = cell_contents[21:] # remove the above string
    if '\n' in cell_contents:
        return cell_contents.split('\n')[0]
    else:
        return cell_contents
    

In [None]:
df["var1"] = df["Variant 1 (SLC45A2 NM_016180.5)"].transform(lambda x: extract_var(x))

In [None]:
df["var2"] = df["Variant 2 (SLC45A2 NM_016180.5)"].transform(lambda x: extract_var(x))

In [None]:
all_variant_set = set(df["var1"]).union(df["var2"])
hgvs_to_variant_d = defaultdict(Variant)

In [None]:
validator = VariantValidator(genome_build='hg38', transcript="NM_016180.5")
validated_var_d = defaultdict()

In [None]:
for var in all_variant_set:
    if var == 'Deletion exons 1-4':
        sv = StructuralVariant.chromosomal_deletion(cell_contents='Deletion exons 1-4',
                 gene_symbol="SLC45A2",
                 gene_id="HGNC:16472",
                 genotype="heterozygous")
        validated_var_d[var] = sv
    else:
        var_object = validator.encode_hgvs(hgvs=var)
        validated_var_d[var] = var_object
print(f"We got {len(validated_var_d)} variant objects")

In [None]:
ageMapper = AgeColumnMapper.by_year('Age (Years)')
#ageMapper.preview_column(df['Age (Years)'])
sexMapper = SexColumnMapper(male_symbol='M', female_symbol='F', column_name='Gender')
#sexMapper.preview_column(df['Gender'])

In [None]:
pmid = "PMID:36553465"
encoder = CohortEncoder(df=df, hpo_cr=hpo_cr, column_mapper_d=column_mapper_d, 
                        individual_column_name="Patients", agemapper=ageMapper, sexmapper=sexMapper,
                       metadata=metadata,
                       pmid=pmid)
encoder.set_disease(disease_id='OMIM:606574', label='Albinism, oculocutaneous, type IV')

In [None]:
individuals = encoder.get_individuals()

In [None]:
for i in individuals:
    rows = df.loc[df['Patients'] == i.id]
    if len(rows) != 1:
        raise ValueError(f"Got {len(rows)} rows but expected only 1")
    var1 = rows.iloc[0]['var1']
    var2 = rows.iloc[0]['var2']
    if var1 == var2:
        # homozygous
        var_object = validated_var_d.get(var1)
        var_object.set_homozygous()
        i.add_variant(var_object)
    else:
        var1_object  = validated_var_d.get(var1) 
        var2_object  = validated_var_d.get(var2)
        i.add_variant(var1_object)
        i.add_variant(var2_object)

In [None]:
i1 = individuals[-1]
phenopacket1 = i1.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh())
json_string = MessageToJson(phenopacket1)
print(json_string)

In [None]:
ppacket_list = [i.to_ga4gh_phenopacket(metadata=metadata.to_ga4gh()) for i in individuals]
table = PhenopacketTable(phenopacket_list=ppacket_list)
display(HTML(table.to_html()))