**Setup for visualizers**

The notebook loads a cohort of patients with mutations in *PTPN11*, performs the functional variant annotation, and collates the data into a `Cohort` object.

In [1]:
%matplotlib inline

import os
import hpotk

hpotk.util.setup_logging()

# Setup resources

Set up paths to resources that work on the system. 

We need just the path to the local copy of the *phenopacket-store* repo.

In [2]:
fpath_phenopacket_store = '/home/ielis/data/phenopacket-store'

fpath_ptpn11 = os.path.join(fpath_phenopacket_store, 'notebooks', 'PTPN11', 'phenopackets')
assert os.path.isdir(fpath_ptpn11), 'Update path to folder with PTPN11 phenopackets' 

## Load HPO

Use HPO release *2023-10-09*. The ontology is downloaded from the PURL.

In [3]:
import hpotk

fpath_hpo = 'https://github.com/obophenotype/human-phenotype-ontology/releases/download/v2023-10-09/hp.json'

hpo = hpotk.load_minimal_ontology(fpath_hpo)
hpo.version

'2023-10-09'

# Load samples

## Configure patient creator

Patient creator transforms phenopackets into `Patient`s - the internal representation of the sample data. 

The transformation includes checking that the phenotypic features -  the uses HPO to check all phenotypic features are annotated with current HPO terms 

### Setup phenotypic feature validation

We ensure that the phenotypic features of the subjects meet the following validation requirements:
- the phenotypic features are represented using current (non-obsolete) HPO term IDs
- all phenotypic features are descendants of *Phenotypic abnormality* branch of HPO
- the terms do not violate the annotation propagation rule - subjects are not annotated by a term and its ancestor/descendant

In [4]:
from hpotk.validate import ValidationRunner
from hpotk.validate import ObsoleteTermIdsValidator, PhenotypicAbnormalityValidator, AnnotationPropagationValidator

validation_runner = ValidationRunner(
    validators=(
        ObsoleteTermIdsValidator(hpo), 
        PhenotypicAbnormalityValidator(hpo), 
        AnnotationPropagationValidator(hpo)
        ))

In [5]:
from genophenocorr.preprocessing import configure_caching_patient_creator

patient_creator = configure_caching_patient_creator(hpo, validation_runner=validation_runner)

## Load phenopackets

Walk the directory, find all JSON files, load them into phenopackets, and transform the phenopackets to patients.

In [6]:
from phenopackets import Phenopacket
from google.protobuf.json_format import Parse

samples = []
for dirpath, dirnames, filenames in os.walk(fpath_ptpn11):
    for filename in filenames:
        if filename.endswith('.json'):
            fpath_pp = os.path.join(dirpath, filename)
            pp = Phenopacket()
            with open(fpath_pp) as fh:
                Parse(fh.read(), pp)
            patient = patient_creator.create_patient(pp)
            samples.append(patient)

f'Loaded {len(samples)} samples'

'Loaded 42 samples'

## Gather samples into cohort

Gather the samples and calculate the cohort summary statistics such as transcripts affected by the variants.

In [7]:
from genophenocorr.model import Cohort

cohort = Cohort.from_patients(samples)
cohort.all_transcripts

{'NM_001330437.2', 'NM_001374625.1', 'NM_002834.5', 'NM_080601.3'}

# Gather data for visualization

Here we get the data required for visualizing the variants on selected transcript or protein.

## Choose the transcript

We need to choose the transcript and protein IDs - currently this is done manually but we will find a way how to do this automatically, e.g. using MANE transcript.

The MANE transcript for *PTPN11* is [NM_002834.5](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:9644).

In [8]:
tx_id = 'NM_002834.5'
protein_id = 'NP_002825.3'

## Gather the data for visualization

We need to get:
- variants
- transcript coordinates
- protein metadata

### Variants

Variants are easy, `Cohort` exposes all the variants via the `all_variants` property:

In [9]:
variants = cohort.all_variants
len(variants)

42

### Transcript coordinates

Transcript coordinates can be fetched from Variant Validator API:

In [10]:
from genophenocorr.model.genome import GRCh38 
from genophenocorr.preprocessing import VVTranscriptCoordinateService

txc_service = VVTranscriptCoordinateService(genome_build=GRCh38)
tx_coordinates = txc_service.fetch(tx_id)
tx_coordinates

HTTPError: 503 Server Error: Service Unavailable for url: https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/NM_002834.5

The `TranscriptCoordinates` object knows about the number of coding bases and aminoacid codons. 

Note, the counts of coding bases and codons do *not* include the termination codon.

In [None]:
print(f'{tx_id} has {tx_coordinates.get_coding_base_count():,} coding bases')
print(f'{tx_id} has {tx_coordinates.get_codon_count():,} codons')

NameError: name 'tx_coordinates' is not defined

We can get the UTR regions (both 5' and 3') as well as the CDS regions.

Note, for simplicity, the CDS regions include *both* initiation and termination codons!

5' UTR regions:

In [None]:
for utr in tx_coordinates.get_five_prime_utrs():
    print(f'{utr.start:,}-{utr.end:,}')

CDS regions:

In [None]:
for cds in tx_coordinates.get_cds_regions():
    print(f'{cds.start:,}-{cds.end:,}')

3' UTR regions

In [None]:
for utr in tx_coordinates.get_three_prime_utrs():
    print(f'{utr.start:,}-{utr.end:,}')

### Protein metadata

Last, we fetch the protein metadata from Uniprot.

The significance of the warning that is logged is unclear to me at this time. We need to investigate.

In [None]:
from genophenocorr.preprocessing import UniprotProteinMetadataService

pms = UniprotProteinMetadataService()

protein_metas = pms.annotate(protein_id)

assert len(protein_metas) == 1
protein_meta = protein_metas[0]
protein_meta

We get metadata with 4 features (3 domains and 1 region), which is in line with the Uniprot [Family & Domains section](https://www.uniprot.org/uniprotkb/Q06124/entry#family_and_domains).

In [None]:
for feature in protein_meta.protein_features:
    print(feature)

What we *do not* get is the length of the protein sequence. In the worst case, we can calculate it from the CDS length.

# Visualize the data

### Draw the figures

Now, let's draw the plots:

In [None]:
from genophenocorr.view import VariantTranscriptProteinArtist

artist = VariantTranscriptProteinArtist()

### TODO - implement

In [None]:
artist.draw_variants(variants, tx_coordinates, protein_meta)