<h1>COL3A1: Vandervore (2017)</h1>
<p>We will process <a href="https://pubmed.ncbi.nlm.nih.gov/28258187/" target="__blank">Vandervore, et al. (2017) Bi-allelic variants in COL3A1 encoding the ligand to GPR56 are associated with cobblestone-like cortical malformation, white matter changes and cerebellar cysts</a></p>

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
from IPython.display import HTML, display
from pyphetools.creation import *
from pyphetools.visualization import *
from pyphetools.validation import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.9.1


<h2>Importing HPO data</h2>
<p>pyphetools uses the Human Phenotype Ontology (HPO) to encode phenotypic features. The recommended way of doing this is to ingest the hp.json file using HpoParser, which in turn creates an HpoConceptRecognizer object. </p>
<p>The HpoParser can accept a hpo_json_file argument if you want to use a specific file. If the argument is not passed, it will download the latext hp.json file from the HPO GitHub site and store it in a new subdirectory called hpo_data. It will not download the file if the file is already downloaded.</p>

In [2]:
PMID = "PMID:28258187"
title = "Bi-allelic variants in COL3A1 encoding the ligand to GPR56 are associated with cobblestone-like cortical malformation, white matter changes and cerebellar cysts"
cite = Citation(pmid=PMID, title=title)
parser = HpoParser()
hpo_cr = parser.get_hpo_concept_recognizer()
hpo_version = parser.get_version()
hpo_ontology = parser.get_ontology()
metadata = MetaData(created_by="ORCID:0000-0002-5648-2155", citation=cite)
metadata.default_versions_with_hpo(version=hpo_version)
print(f"HPO version {hpo_version}")

HPO version 2023-10-09


<h2>Importing the supplemental table</h2>
<p>Here, we use the pandas library to import this file (note that the Python package called openpyxl must be installed to read Excel files with pandas, although the library does not need to be imported in this notebook). pyphetools expects a pandas DataFrame as input, and users can choose any input format available for pandas include CSV, TSV, and Excel, or can use any other method to transform their input data into a Pandas DataFrame before using pyphetools.</p>

In [3]:
df = pd.read_excel('input/PMID_28258187.xlsx')

In [4]:
df.head()

Unnamed: 0,Clinical features,Patient 1 (11.3 this manuscript),Patient 2 (11.4 this manuscript),"Patient 3 (Plancke et al, 2009)","Patient 4 (Jergensen et al, 2014)"
0,patient_id,Patient 1,Patient 2,Patient 3,Patient 4
1,Sex,female,male,female,female
2,Age at examination(years),7,3.5,10,19
3,Mutation in COL3A1,c.145C>G,c.145C>G,c.479dupT,c.1786C>T
4,Second mutation in COL3A1,c.145C>G,c.145C>G,c.479dupT,c.3851G>A


<h1>Converting to row-based format</h1>
<p>To use pyphetools, we need to have the individuals represented as rows (one row per individual) and have the items of interest be encoded as column names. The required transformations for doing this may be different for different input data, but often we will want to transpose the table (using the pandas <tt>transpose</tt> function) and set the column names of the new table to the zero-th row. After this, we drop the zero-th row (otherwise, it will be interpreted as an individual by the pyphetools code).</p>
<p>After this step is completed, the remaining steps to create phenopackets are the same as in the 
    <a href="http://localhost:8888/notebooks/notebooks/Create%20phenopackets%20from%20tabular%20data%20with%20individuals%20in%20rows.ipynb" target="__blank">row-based notebook</a>.</p>
    
Furthermore, for this specific case, there is a Count features row that we want dropped, so we filter out any row that does not have Patient in the first column.

In [5]:
dft = df.transpose()
dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)
import re
dft.columns = dft.columns.str.strip()
dft = dft.dropna(axis=1, how='all')
# simplify the name of the id column to remove e.g.,(11.3 this manuscript) f
dft.set_index("patient_id", inplace = True)
dft.head(2)

Clinical features,Sex,Age at examination(years),Mutation in COL3A1,Second mutation in COL3A1,Major features,Minor features,Additional features,Congenital anomalies,Neurological examination,Head circumference,...,Basal ganglia,Corpus callosum,Hippocampus,Cortex,White matter_5,Vermis,Post fossa,Pituitary,Arachnoid cysts,Vessels
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Patient 1,female,7.0,c.145C>G,c.145C>G,-,-,-,-,"Global developmental delay, walks without support, uses a few words",90th centile,...,"Thalamus normal putamen/globus pallidus small, accentuated Virchow- Robin spaces","Present, elongated and mildly thickened",Normal,"Dysplastic cerebellar cortex, multiple cortical cysts superior>inferior","No hypoplasia, multifocal lesions in cerebellar white matter",Vermis hypoplasia cysts,Mega cisterna magna,Normal,,Intracranial segment of the A carotis interna is normal
Patient 2,male,3.5,c.145C>G,c.145C>G,-,-,-,-,"Global developmental delay. sits independently, no words",>97th centile,...,"Normal volume and signal intensity, accentuated Virchow-Robin spaces",Present elongated and mildly thickened,Normal,"Dysplastic cerebellar cortex, multiple cortical cysts superior>inferior",No hypoplasia multifocal lesions in cerebellar white matter,Vermis hypoplasia cysts,Mega cisterna magna,Normal,Bilateral temporal pole arachnoidal cysts,Intracranial segment of the A carotis interna is normal


<h2>Column mappers</h2>
<p>Please see the notebook "Create phenopackets from tabular data with individuals in rows" for explanations. In the following cell we create a dictionary for the ColumnMappers. Note that the code is identical except that we use the df.loc function to get the corresponding row data</p>

In [6]:
column_mapper_d = {}

Lets try to get code autoformatted so that we can easily copy-paste and change it.

In [7]:
# This was used to help generate the following code
#output = OptionColumnMapper.autoformat(df=dft, concept_recognizer=hpo_cr)
major_features = {'Easy bruising thin translucent skin': 'Bruising susceptibility',
 'arterial tissue fragility': 'Abnormal arterial physiology',
 'Easy bruising': 'Bruising susceptibility',
 'thin translucent skin': 'Dermal translucency',
 'arterial dissections': 'Arterial dissection',}
major_featuresMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=major_features)
#print(major_featuresMapper.preview_column(dft['Major features']))
column_mapper_d['Major features'] = major_featuresMapper

cortex = {'Dysplastic cerebellar cortex': 'Abnormal cerebellar cortex morphology'}
cortexMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=cortex)
#print(cortexMapper.preview_column(dft['Cortex']))
column_mapper_d['Cortex'] = cortexMapper

minor_features = {'Early-onset varicose veins': 'Varicose veins',
 'small joint hypermobility': 'Joint hypermobility',
 'tendon rupture': 'Tendon rupture'}
minor_featuresMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=minor_features)
#print(minor_featuresMapper.preview_column(dft['Minor features']))
column_mapper_d['Minor features'] = minor_featuresMapper

additional_features = {'Pulmonary valve stenosis': 'Pulmonic stenosis',
 'pronounced atrophic scars': 'Atrophic scars',
 'multiple gingival recessions': 'Gingival recession',
 'slender fingers': 'Slender finger'}
additional_featuresMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=additional_features)
#print(major_featuresMapper.preview_column(dft['Additional features']))
column_mapper_d['Additional features'] = additional_featuresMapper


congenital_anomalies = {'Talipes equinovarus': 'Talipes equinovarus'}
congenital_anomaliesMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=congenital_anomalies)
#print(congenital_anomaliesMapper.preview_column(dft['Congenital anomalies']))
column_mapper_d['Congenital anomalies'] = congenital_anomaliesMapper

neurological_examination = {'Global developmental delay': 'Global developmental delay',
 'no words': 'Absent speech',
 'Delayed motor milestones': 'Motor delay'}
neurological_examinationMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=neurological_examination)
#print(neurological_examinationMapper.preview_column(dft['Neurological examination']))
column_mapper_d['Neurological examination'] = neurological_examinationMapper

head_circumference = {'>97th centile': 'Macrocephaly'}
head_circumferenceMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=head_circumference)
#print(head_circumferenceMapper.preview_column(dft['Head circumference']))
column_mapper_d['Head circumference'] = head_circumferenceMapper

epilepsy_onset = {'Spasms/5 years': 'Seizure',
 'Spasms/26 months': 'Seizure',
 'Absence seizures/unknown': 'Typical absence seizure'}
epilepsy_onsetMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=epilepsy_onset)
#print(epilepsy_onsetMapper.preview_column(dft['Epilepsy/onset']))
column_mapper_d['Epilepsy/onset'] = epilepsy_onsetMapper


gyral_pattern = {'Diffuse thickened cobblestone cortex with relative sparing of the temporal lobes': 'Dysgyria with thickened cortex',
 'Diffuse thickened cobblestone cortex with relative sparing of the temporal poles': 'Dysgyria with thickened cortex',
 'Frontal cobblestone cortex': 'Dysgyria with thickened cortex',
 'parietal polymicrogyria': 'Polymicrogyria',
 'Bilateral frontal polymicrogyria including cingulate gyri': 'Polymicrogyria'}
gyral_patternMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=gyral_pattern)
#print(gyral_patternMapper.preview_column(dft['Gyral pattern']))
column_mapper_d['Gyral pattern'] = gyral_patternMapper


white_matter = {'Globale reduction of white matter': 'Hypointensity of cerebral white matter on MRI',
 'Diffuse hypomyelination': 'Cerebral hypomyelination'}
white_matterMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=white_matter)
#print(white_matterMapper.preview_column(dft['White matter']))
column_mapper_d['White matter'] = white_matterMapper

white_matter_2 = {'Prominent perivascular spaces': 'Dilation of Virchow-Robin spaces',
 'Prominent perivascular spaces bilateral frontal': 'Dilation of Virchow-Robin spaces'}
white_matter_2Mapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=white_matter_2)
#print(white_matter_2Mapper .preview_column(dft['White matter_2']))
column_mapper_d['White matter_2'] = white_matter_2Mapper

white_matter_3 = {'Frontal nodular heterotopia (beads)': 'Gray matter heterotopia',
 'perisylvian and occipital band heterotopia': 'Gray matter heterotopia'}
white_matter_3Mapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=white_matter_3)
#print(white_matter_3Mapper.preview_column(dft['White matter_3']))
column_mapper_d['White matter_3'] = white_matter_3Mapper


lateral_ventricles = {'Ventriculomegaly': 'Lateral ventricle dilatation',
 'Mild enlargement': 'Lateral ventricle dilatation'}
lateral_ventriclesMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=lateral_ventricles)
#print(lateral_ventriclesMapper.preview_column(dft['Lateral ventricles']))
column_mapper_d['Lateral ventricles'] = lateral_ventriclesMapper

third_ventricle = {'Ventriculomegaly': 'Dilated third ventricle',
 'Mild enlargement': 'Dilated third ventricle'}
third_ventricleMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=third_ventricle)
#print(third_ventricleMapper.preview_column(dft['Third ventricle']))
column_mapper_d['Third ventricle'] = third_ventricleMapper

brainstem = {'Hypoplastic': 'Abnormal brainstem morphology',
 'Mildly hypoplastic': 'Abnormal brainstem morphology',
 'Hypoplasia of the pons': 'Hypoplasia of the pons'}
brainstemMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=brainstem)
#print(brainstemMapper.preview_column(dft['Brainstem']))
column_mapper_d['Brainstem'] = brainstemMapper

basal_ganglia = {'Thalamus normal putamen/globus pallidus small': 'Abnormal globus pallidus morphology',
 'accentuated Virchow- Robin spaces': 'Dilation of Virchow-Robin spaces'}
basal_gangliaMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=basal_ganglia)
#print(basal_gangliaMapper.preview_column(dft['Basal ganglia']))
column_mapper_d['Basal ganglia'] = basal_gangliaMapper

corpus_callosum = {'elongated and mildly thickened': 'Abnormal length of corpus callosum',
 'Present elongated and mildly thickened': 'Abnormal length of corpus callosum',
 'elongated': 'Abnormal length of corpus callosum'}
corpus_callosumMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=corpus_callosum)
#print(corpus_callosumMapper.preview_column(dft['Corpus callosum']))
column_mapper_d['Corpus callosum'] = corpus_callosumMapper

vermis = {'Vermis hypoplasia cysts': 'Cerebellar vermis hypoplasia',
 'Mild atrophy': 'Cerebellar vermis hypoplasia'}
vermisMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=vermis)
#print(vermisMapper.preview_column(dft['Vermis']))
column_mapper_d['Vermis'] = vermisMapper

post_fossa = {'Mega cisterna magna': 'Enlarged cisterna magna',
 'Mega cistema magna': 'Enlarged cisterna magna'}
post_fossaMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=post_fossa)
#print(post_fossaMapper.preview_column(dft['Post fossa']))
column_mapper_d['Post fossa'] = post_fossaMapper

vessels = { 'Dilatation left A carotis interna': 'Carotid artery dilatation',
 'stenosis right A carotis interna': 'Carotid artery stenosis'}
vesselsMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=vessels)
#print(vesselsMapper.preview_column(dft['Vessels']))
column_mapper_d['Vessels'] = vesselsMapper

arachnoid_cysts = {'Bilateral temporal pole arachnoidal cysts': 'Arachnoid cyst'}
arachnoid_cystsMapper = OptionColumnMapper(concept_recognizer=hpo_cr, option_d=arachnoid_cysts)
#print(arachnoid_cystsMapper.preview_column(dft['Arachnoid cysts']))
column_mapper_d['Arachnoid cysts'] = arachnoid_cystsMapper

<h2>Variant Data</h2>
<p>The variant data (HGVS< transcript) is listed in the Variant (hg19, NM_015133.4) column.</p>

In [8]:
allele_1 = dft["Mutation in COL3A1"].unique()
allele_2 = dft["Second mutation in COL3A1"].unique()
alleles = set(allele_1)
alleles.update(allele_2)
genome = 'hg38'
default_genotype = 'homozygous'
COL3A1_transcript='NM_000090.3'
vvalidator = VariantValidator(genome_build=genome, transcript=COL3A1_transcript)
variant_d = {}
for v in alleles:
    var = vvalidator.encode_hgvs(v)
    variant_d[v] = var
print(f"We extracted {len(variant_d)} variants")

https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_000090.3%3A c.3851G>A/NM_000090.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_000090.3%3A c.479dupT/NM_000090.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_000090.3%3Ac.479dupT/NM_000090.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_000090.3%3Ac.1786C>T/NM_000090.3?content-type=application%2Fjson
https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg38/NM_000090.3%3Ac.145C>G/NM_000090.3?content-type=application%2Fjson
We extracted 5 variants


<h1>Demographic data</h1>
<p>pyphetools can be used to capture information about age, sex, and individual identifiers. This information is stored in a map of "IndividualMapper" objects. Special treatment may be required for the indifiers, which may be used as the column names or row index.</p>

In [9]:
ageMapper = AgeColumnMapper.by_year('Age at examination(years)')
ageMapper.preview_column(dft['Age at examination(years)'])

Unnamed: 0,original column contents,age
0,7.0,P7Y
1,3.5,P3Y6M
2,10.0,P10Y
3,19.0,P19Y


In [10]:
sexMapper = SexColumnMapper(male_symbol='male', female_symbol='female', column_name='Sex')
sexMapper.preview_column(dft['Sex'])

Unnamed: 0,original column contents,sex
0,female,FEMALE
1,male,MALE
2,female,FEMALE
3,female,FEMALE


In [11]:
disease = Disease(disease_id='OMIM:618343', disease_label='Polymicrogyria with or without vascular-type EDS')
encoder = CohortEncoder(df=dft, 
                        hpo_cr=hpo_cr, 
                        column_mapper_d=column_mapper_d, 
                        individual_column_name="patient_id", 
                        agemapper=ageMapper, 
                        sexmapper=sexMapper,
                        metadata=metadata,
                        citation=cite)
encoder.set_disease(disease)

# Add variant data

In [12]:
individuals = encoder.get_individuals()
for indi in individuals:
    row = dft.loc[indi.id]
    v1 = row['Mutation in COL3A1']
    v2 = row['Second mutation in COL3A1']
    if v1 == v2:
        var = variant_d.get(v1)
        var.set_homozygous()
        indi.add_variant(var)
    else:
        var1 = variant_d.get(v1)
        var1.set_heterozygous()
        var2 = variant_d.get(v2)
        var2.set_heterozygous()
        indi.add_variant(var1)
        indi.add_variant(var2)

In [13]:
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.BI_ALLELIC)
validated_individuals = cvalidator.get_validated_individual_list()
qc = QcVisualizer(ontology=hpo_ontology, cohort_validator=cvalidator)
display(HTML(qc.to_html()))

ID,Level,Category,Message,HPO Term
PMID_28258187_Patient_1,WARNING,REDUNDANT,Dilation of Virchow-Robin spaces is listed multiple times,Dilation of Virchow-Robin spaces (HP:0012520)


## Clean annotations
We use the validated individuals to get a version of the phenopackets without the redundant term

In [14]:
individuals = cvalidator.get_error_free_individual_list()
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1, allelic_requirement=AllelicRequirement.BI_ALLELIC)
validated_individuals = cvalidator.get_validated_individual_list()
qc = QcVisualizer(ontology=hpo_ontology, cohort_validator=cvalidator)
display(HTML(qc.to_html()))

In [15]:
table = PhenopacketTable(individual_list=individuals, metadata=metadata)
display(HTML(table.to_html()))

Individual,Disease,Genotype,Phenotypic features
Patient 1 (FEMALE; P7Y),Polymicrogyria with or without vascular-type EDS (OMIM:618343),NM_000090.3:c.145C>G (homozygous),Cerebellar vermis hypoplasia (HP:0001320); Abnormal brainstem morphology (HP:0002363); Dilation of Virchow-Robin spaces (HP:0012520); Dysgyria with thickened cortex (HP:0032400); Dilated third ventricle (HP:0007082); Renal cortical cysts (HP:0000803); Abnormal length of corpus callosum (HP:0200011); Lateral ventricle dilatation (HP:0006956); Enlarged cisterna magna (HP:0002280); Global developmental delay (HP:0001263); Seizure (HP:0001250); Abnormal cerebellar cortex morphology (HP:0031422); Hypointensity of cerebral white matter on MRI (HP:0007103)
Patient 2 (MALE; P3Y6M),Polymicrogyria with or without vascular-type EDS (OMIM:618343),NM_000090.3:c.145C>G (homozygous),Abnormal cerebellar cortex morphology (HP:0031422); Renal cortical cysts (HP:0000803); Global developmental delay (HP:0001263); Absent speech (HP:0001344); Macrocephaly (HP:0000256); Seizure (HP:0001250); Dysgyria with thickened cortex (HP:0032400); Cerebral hypomyelination (HP:0006808); Dilation of Virchow-Robin spaces (HP:0012520); Lateral ventricle dilatation (HP:0006956); Dilated third ventricle (HP:0007082); Abnormal brainstem morphology (HP:0002363); Abnormal length of corpus callosum (HP:0200011); Cerebellar vermis hypoplasia (HP:0001320); Enlarged cisterna magna (HP:0002280); Arachnoid cyst (HP:0100702)
Patient 3 (FEMALE; P10Y),Polymicrogyria with or without vascular-type EDS (OMIM:618343),NM_000090.3:c.479dup (heterozygous) NM_000090.3:c.479dup (heterozygous),Bruising susceptibility (HP:0000978); Abnormal arterial physiology (HP:0025323); Renal cortical cysts (HP:0000803); Varicose veins (HP:0002619); Joint hypermobility (HP:0001382); Pulmonic stenosis (HP:0001642); Atrophic scars (HP:0001075); Gingival recession (HP:0030816); Talipes equinovarus (HP:0001762); Motor delay (HP:0001270); Typical absence seizure (HP:0011147); Dysgyria with thickened cortex (HP:0032400); Polymicrogyria (HP:0002126); Dilation of Virchow-Robin spaces (HP:0012520); Lateral ventricle dilatation (HP:0006956); Dilated third ventricle (HP:0007082); Hypoplasia of the pons (HP:0012110); Abnormal length of corpus callosum (HP:0200011); Arachnoid cyst (HP:0100702)
Patient 4 (FEMALE; P19Y),Polymicrogyria with or without vascular-type EDS (OMIM:618343),NM_000090.3:c.1786C>T (heterozygous) NM_000090.3:c.3851G>A (heterozygous),Bruising susceptibility (HP:0000978); Dermal translucency (HP:0010648); Arterial dissection (HP:0005294); Renal cortical cysts (HP:0000803); Joint hypermobility (HP:0001382); Tendon rupture (HP:0100550); Slender finger (HP:0001238); Polymicrogyria (HP:0002126); Dilation of Virchow-Robin spaces (HP:0012520); Cerebellar vermis hypoplasia (HP:0001320); Enlarged cisterna magna (HP:0002280); Carotid artery dilatation (HP:0012163); Carotid artery stenosis (HP:0100546)


In [16]:
output_directory = "phenopackets"
Individual.output_individuals_as_phenopackets(individual_list=individuals,
                                              metadata=metadata,
                                              outdir=output_directory)

We output 4 GA4GH phenopackets to the directory phenopackets
