**Check if the phenopacket-store phenopackets validate with C2S2 sanitizer**

The notebook loads the cohorts into C2S2 domain models, runs the validation, and reports the actions that were applied to sanitize the phenopackets for use with C2S2. 

First, we find the cohort directories. Not all cohorts can be loaded, so we skip those for now (see [Pending cohorts](#Pending-cohorts) for more details).

Next, we load the cohorts into C2S2 domain model and summarize the total cohort member count.

Last, we run the validation and report the sanitation actions.


**Imports**

The notebook will not run without these libraries:

In [1]:
import os

import c2s2
import hpotk

# Load cohorts

## Pending cohorts

The `phenopackets` folder is not present in these cohorts:

In [2]:
blacklist = {
    'HMGCR', 
    'LAMB2',
    'LYN',
    'ZMYM3',
}

## Find cohort directories

In [3]:
notebooks_dir = os.path.abspath(os.path.join(os.pardir, 'notebooks'))
notebooks_dir

'/home/ielis/data/phenopacket-store/notebooks'

## Read samples

In [4]:
cohort_names = sorted(filter(lambda cn: cn not in blacklist, os.listdir(notebooks_dir)))

pp_dirs = {}
for cohort_name in cohort_names:
    if cohort_name == 'LIRICAL':
        pp_dir = os.path.join(notebooks_dir, cohort_name, 'v2phenopackets')
    else:
        pp_dir = os.path.join(notebooks_dir, cohort_name, 'phenopackets')
    pp_dirs[cohort_name] = pp_dir

f'Got {len(pp_dirs)} cohorts'

'Got 99 cohorts'

In [5]:
from c2s2.ingest import read_phenopacket_dir

cohorts = {}

for cohort_name, pp_dir in pp_dirs.items():
    cohorts[cohort_name] = read_phenopacket_dir(pp_dir)

total_count = 0
for pps in cohorts.values():
    total_count += len(pps)
f'Read {total_count} samples'

'Read 4189 samples'

# Sanitize

## Load HPO

Let's use the latest HPO - `v2024-04-04` as of now.

In [6]:
store = hpotk.configure_ontology_store()
hpo = store.load_minimal_hpo(release='v2024-04-04')
hpo.version

'2024-04-04'

## Sanitize

In [7]:
from c2s2.preprocessing.sanitize import sanitize_samples

for cohort_name, cohort in cohorts.items():
    print(f'Sanitizing {cohort_name} '.ljust(120, '-'))
    sanitation_result = sanitize_samples(cohort, hpo)

    found_an_issue = False
    for sample, actions in sanitation_result.get_samples_and_actions():
        found_an_issue = True
        print(f'Sample: {sample.labels}')
        for action in actions:
            print(f' - {action}')
    
    if not found_an_issue:
        print(' OK')
    else:
        print('')
    found_an_issue = False

Sanitizing 11q_terminal_deletion ---------------------------------------------------------------------------------------
 OK
Sanitizing ABCB7 -------------------------------------------------------------------------------------------------------
 OK
Sanitizing ACTB --------------------------------------------------------------------------------------------------------
 OK
Sanitizing ADAMTS15 ----------------------------------------------------------------------------------------------------
 OK
Sanitizing ANKH --------------------------------------------------------------------------------------------------------
 OK
Sanitizing ANKRD11 -----------------------------------------------------------------------------------------------------
 OK
Sanitizing ARPC5 -------------------------------------------------------------------------------------------------------
 OK
Sanitizing ATP13A2 -----------------------------------------------------------------------------------------------------
 OK


EOF