<H1>MAPK8IP3 genotype phenotype correlations</H1>

In [15]:
import genophenocorr
import hpotk
from genophenocorr.preprocessing import configure_caching_cohort_creator, load_phenopacket_folder
from genophenocorr.view import CohortViewable, StatsViewable
from genophenocorr.model import VariantEffect
from genophenocorr.analysis import configure_cohort_analysis, CohortAnalysisConfiguration
from genophenocorr.analysis.predicate import PatientCategories

from IPython.display import display, HTML

store = hpotk.configure_ontology_store()
hpo = store.load_minimal_hpo(release='v2023-10-09')
print(f'Loaded HPO v{hpo.version}')
print(f'hpotk version {hpotk.__version__}')
print(f"Using genophenocorr version {genophenocorr.__version__}")

Loaded HPO v2023-10-09
hpotk version 0.5.0
Using genophenocorr version 0.1.1dev


### Setup

Here we specificy the path to the folder with the phenopackets to be analyzed, and the transcript to be used for the analysis (in general, the MANE transcript should be used). Here, we use the transcript `NM_001318852.2` which is the MANE transcript of the *MAPK8IP3* gene.

In [5]:
fpath_phenopackets = 'phenopackets'
MAPK8IP3_id = 'NM_001318852.2'

### Create the cohort



In [7]:
cohort_creator = configure_caching_cohort_creator(hpo)
cohort = load_phenopacket_folder(fpath_phenopackets, cohort_creator)
cv = CohortViewable(hpo=hpo, transcript_id=MAPK8IP3_id)
html = cv.process(cohort=cohort)

display(HTML(html))

Patients Created: 100%|██████████| 20/20 [00:00<00:00, 389.83it/s]
Validated under none policy


HPO Term,ID,Annotation Count
Global developmental delay,HP:0001263,19
Hypotonia,HP:0001252,11
Thin corpus callosum,HP:0033725,9
"Intellectual disability, moderate",HP:0002342,7
Seizure,HP:0001250,6
Intellectual disability,HP:0001249,6
Spastic diplegia,HP:0001264,6
Delayed ability to walk,HP:0031936,6
Cerebral atrophy,HP:0002059,5
Thin upper lip vermilion,HP:0000219,5

Variant,Variant name,Variant Count
16_1762843_1762843_C_T,todo,6
16_1767834_1767834_C_T,todo,5
16_1760409_1760409_T_C,todo,2
16_1762388_1762388_G_A,todo,1
16_1706418_1706418_G_T,todo,1
16_1748705_1748705_G_A,todo,1
16_1706402_1706403_CG_C,todo,1
16_1706384_1706384_C_G,todo,1
16_1766768_1766768_C_G,todo,1
16_1706450_1706450_C_G,todo,1

Disease,Annotation Count
OMIM:618443,20

Variant effect,Annotation Count
MISSENSE_VARIANT,16
STOP_GAINED,3
FRAMESHIFT_VARIANT,1


## Configure the analysis
For the first run, we will not remove unneeded terms prior to statistical analysis

In [9]:
analysis_config = CohortAnalysisConfiguration()
analysis_config.missing_implies_excluded = True
analysis_config.pval_correction = 'fdr_bh'
analysis_config.min_perc_patients_w_hpo = 0.1
analysis = configure_cohort_analysis(cohort, hpo, config=analysis_config)

In [11]:
frameshift = analysis.compare_by_variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id=MAPK8IP3_id)
frameshift.summarize(hpo, PatientCategories.YES)

FRAMESHIFT_VARIANT on NM_001318852.2,Yes,Yes,No,No,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,Count,Percent,Count,Percent,p value,Corrected p value
Autism [HP:0000717],1/1,100%,1/14,7%,0.133333,1.0
Abnormality of movement [HP:0100022],0/1,0%,8/13,62%,0.428571,1.0
Abnormal axial skeleton morphology [HP:0009121],1/1,100%,3/3,100%,1.000000,1.0
Motor deterioration [HP:0002333],0/0,0%,1/1,100%,1.000000,1.0
Abnormal cerebellum morphology [HP:0001317],1/1,100%,1/1,100%,1.000000,1.0
...,...,...,...,...,...,...
Hypoplasia of the brainstem [HP:0002365],0/0,0%,1/1,100%,1.000000,1.0
Finger clinodactyly [HP:0040019],0/0,0%,1/1,100%,1.000000,1.0
Finger joint contracture [HP:0034681],0/0,0%,1/1,100%,1.000000,1.0
Abnormal phalangeal joint morphology of the hand [HP:0006261],0/0,0%,1/1,100%,1.000000,1.0


# Heuristic to omit terms from statistical testing
These terms are unlikely to lead to significant results are are thus removed prior to testing to reduce the multiple testing burden.

In [12]:
analysis_config = CohortAnalysisConfiguration()
analysis_config.missing_implies_excluded = True
analysis_config.pval_correction = 'fdr_bh'
analysis_config.min_perc_patients_w_hpo = 0.1
analysis_config.heuristic_strategy()
analysis = configure_cohort_analysis(cohort, hpo, config=analysis_config)

In [13]:
frameshift = analysis.compare_by_variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id=MAPK8IP3_id)
frameshift.summarize(hpo, PatientCategories.YES)

FRAMESHIFT_VARIANT on NM_001318852.2,Yes,Yes,No,No,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,Count,Percent,Count,Percent,p value,Corrected p value
Hypotonia [HP:0001252],1/1,100%,10/17,59%,1.0,1.0
Abnormal muscle tone [HP:0003808],1/1,100%,17/17,100%,1.0,1.0
Abnormal muscle physiology [HP:0011804],1/1,100%,17/17,100%,1.0,1.0
Abnormal upper lip morphology [HP:0000177],0/0,0%,7/7,100%,1.0,1.0
Abnormal lip morphology [HP:0000159],0/0,0%,7/7,100%,1.0,1.0
Abnormal oral cavity morphology [HP:0000163],0/0,0%,9/9,100%,1.0,1.0
Abnormal oral morphology [HP:0031816],0/0,0%,9/9,100%,1.0,1.0
Abnormality of the mouth [HP:0000153],0/0,0%,9/9,100%,1.0,1.0
Abnormality of the face [HP:0000271],0/0,0%,9/9,100%,1.0,1.0
Upper motor neuron dysfunction [HP:0002493],0/0,0%,10/10,100%,1.0,1.0


In [14]:
report = frameshift.mtc_filter_report
sv = StatsViewable(filter_method_name=report.filter_method(), mtc_name=report.mtc_method(), filter_results_map=report.skipped_terms_dict(), term_count=report.n_terms_tested())



In [8]:

from genophenocorr.analysis import configure_cohort_analysis, CohortAnalysisConfiguration
from genophenocorr.analysis.predicate import PatientCategories

analysis_config = CohortAnalysisConfiguration()
analysis_config.missing_implies_excluded = True
analysis_config.pval_correction = 'fdr_bh'
analysis_config.min_perc_patients_w_hpo = 0.1
analysis_config.heuristic_strategy()
analysis = configure_cohort_analysis(cohort, hpo, config=analysis_config)


In [9]:
frameshift = analysis.compare_by_variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id=MAPK8IP3_id)
frameshift.summarize(hpo, PatientCategories.YES)

FRAMESHIFT_VARIANT on NM_001318852.2,Yes,Yes,No,No,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,Count,Percent,Count,Percent,p value,Corrected p value
Abnormal nervous system physiology [HP:0012638],1/1,100%,19/19,100%,1.0,1.0
Global developmental delay [HP:0001263],1/1,100%,18/18,100%,1.0,1.0
Neurodevelopmental delay [HP:0012758],1/1,100%,18/18,100%,1.0,1.0
Neurodevelopmental abnormality [HP:0012759],1/1,100%,19/19,100%,1.0,1.0
Intellectual disability [HP:0001249],1/1,100%,18/18,100%,1.0,1.0
Abnormality of mental function [HP:0011446],1/1,100%,19/19,100%,1.0,1.0
Thin corpus callosum [HP:0033725],0/0,0%,9/10,90%,1.0,1.0
Abnormal corpus callosum morphology [HP:0001273],0/0,0%,10/10,100%,1.0,1.0
Abnormal cerebral white matter morphology [HP:0002500],0/0,0%,10/10,100%,1.0,1.0
Abnormal cerebral subcortex morphology [HP:0010993],0/0,0%,10/10,100%,1.0,1.0


In [16]:
report = frameshift.mtc_filter_report
from IPython.display import display, HTML
from genophenocorr.view import StatsViewable

cv = StatsViewable(filter_method_name=report.filter_method(),
            mtc_name=report.mtc_method(),
            filter_results_map=report.skipped_terms_dict(),
            term_count=report.n_terms_tested())
html = cv.process(cohort=cohort)

display(HTML(html))

Skipped,Count
Skipping term with only 1 observations (not powered for 2x2),150
Skipping term with only 2 observations (not powered for 2x2),33
Skipping term with only 3 observations (not powered for 2x2),27
Skipping term with only 4 observations (not powered for 2x2),17
Skipping term with only 6 observations (not powered for 2x2),15
Skipping top level term,12
Skipping term with only 5 observations (not powered for 2x2),12
Child term with same counts previously tested,11
Skipping non phenotype term,2
Skipping term HP:0000343 because no term has more than one count,1


## Correlation analysis for c.1735C>T

`NM_001318852.2:c.1735C>T` is the most commonly encountered variant in our cohort. In the following code, we investigate whether this variant displays significant genotype-phenotype correlations.

For the purpose of the analysis, the variant is denoted by its key: `16_1762843_1762843_C_T`.

Let's run the analysis and summarize the results.

In [12]:
from genophenocorr.analysis.predicate import PatientCategories

variant_key = '16_1762843_1762843_C_T'

by_variant = analysis.compare_by_variant_key(variant_key=variant_key)
by_variant.summarize(hpo, PatientCategories.YES)

>=1 allele of the variant 16_1762843_1762843_C_T,Yes,Yes,No,No,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,Count,Percent,Count,Percent,p value,Corrected p value
Seizure [HP:0001250],4/5,80%,2/13,15%,0.021709,0.436364
Inability to walk [HP:0002540],3/3,100%,1/8,12%,0.024242,0.436364
Gait disturbance [HP:0001288],3/3,100%,2/8,25%,0.060606,0.727273
Abnormality of movement [HP:0100022],4/4,100%,4/10,40%,0.084915,0.764236
Thin corpus callosum [HP:0033725],6/6,100%,3/4,75%,0.4,1.0
Spastic diplegia [HP:0001264],4/4,100%,2/3,67%,0.428571,1.0
Abnormal nervous system physiology [HP:0012638],6/6,100%,14/14,100%,1.0,1.0
Global developmental delay [HP:0001263],6/6,100%,13/13,100%,1.0,1.0
Neurodevelopmental delay [HP:0012758],6/6,100%,13/13,100%,1.0,1.0
Neurodevelopmental abnormality [HP:0012759],6/6,100%,14/14,100%,1.0,1.0


TODO - finalize!