<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#load-input-data" data-toc-modified-id="load-input-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>load input data</a></span></li><li><span><a href="#set-data-splits" data-toc-modified-id="set-data-splits-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>set data splits</a></span><ul class="toc-item"><li><span><a href="#MHC-dimension" data-toc-modified-id="MHC-dimension-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>MHC dimension</a></span></li><li><span><a href="#Protein-dimension" data-toc-modified-id="Protein-dimension-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Protein dimension</a></span></li></ul></li></ul></div>

In [1]:
import numpy as np
import pandas as pd
import pdb
from argparse import ArgumentParser
import shlex
from tqdm import tqdm

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

import pMHC
from pMHC import OUTPUT_FOLDER, SEP, \
    SPLITS, SPLIT_TRAIN, SPLIT_VAL, SPLIT_TEST, \
    VIEWS, VIEW_SA, VIEW_SAMA, VIEW_DECONV, \
    INPUT_PEPTIDE, INPUT_CONTEXT
from pMHC.logic import PresentationPredictor
from pMHC.data import from_input
from pMHC.data.example import Sample, Observation, MhcAllele

tqdm.pandas()

In [2]:
import networkx as nx
from pMHC.data.example import Sample, Observation, MhcAllele
from pMHC.data.split import overview_split, \
    find_connected_mhc_alleles, suggest_split_mhc_alleles, save_split_mhc_alleles, load_split_mhc_alleles, \
    find_split_proteins, load_split_proteins

In [3]:
pMHC.set_paths(r"C:\Users\tux\Documents\MScProject")

Update project folder to: C:\Users\tux\Documents\MScProject
Load permutation


In [4]:
parser = ArgumentParser()
parser = PresentationPredictor.add_argparse_args(parser)
parser = Trainer.add_argparse_args(parser)

argString = f"--gpus 0 --num_workers 0 "
argString += f"--precision 32 --datasources 'Edi;Atlas' "
argString += f"--default_root_dir '{OUTPUT_FOLDER}{SEP}' "
argString += f"--batch_size 8 --learning_rate 0.00001 "
argString += f"--input_mode CONTEXT --mhc_rep FULL "
args = parser.parse_args(shlex.split(argString))

print(args)

trainer = Trainer.from_argparse_args(args, checkpoint_callback=False, logger=False)
args_dict = vars(args)
model = PresentationPredictor(**args_dict)

GPU available: True, used: False
TPU available: False, using: 0 TPU cores


Namespace(accelerator=None, accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, backbone='TAPE', batch_size=8, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, datasources='Edi;Atlas', decoys_per_obs=1, default_root_dir='\\', deterministic=False, distributed_backend=None, fast_dev_run=False, flush_logs_every_n_steps=100, gpus=0, gradient_clip_algorithm='norm', gradient_clip_val=0.0, head='Cls', head_hidden_features=512, input_mode='CONTEXT', learning_rate=1e-05, limit_predict_batches=1.0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_every_n_steps=50, log_gpu_memory=None, logger=True, max_epochs=None, max_seq_length=260, max_steps=None, max_time=None, mhc_rep='FULL', min_epochs=None, min_steps=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', name='model', num_nodes=1, num_processes=1, num_sanity_val_steps=2, num_workers=0, 

  rank_zero_warn(
Global seed set to 42


# load input data

In [5]:
from_input()

MhcAllele.from_input


MhcAlleles from input: 11074it [00:01, 10599.69it/s]


Protein.from_input


Proteins from input: 185930it [00:16, 11031.29it/s]


Sample.from_input


Samples from input: 472it [00:00, 9142.35it/s]


Peptide.from_input


Peptides from input: 429339it [00:39, 10780.40it/s]


Observation.from_input


Observations from input: 1959736it [03:09, 10351.38it/s]


# set data splits

In [6]:
overview_split()



ABSOLUTE
                             train        val       test   val-prot  test-prot    val-mhc   test-mhc 
Observations in splits:  1,959,736          0          0          0          0          0          0 
                    SA:    293,334          0          0          0          0          0          0 
                    MA:  1,666,402          0          0          0          0          0          0 


RELATIVE
                             train        val       test   val-prot  test-prot    val-mhc   test-mhc 
Observations in splits:     100.0%       0.0%       0.0%       0.0%       0.0%       0.0%       0.0% 
                    SA:      15.0%       0.0%       0.0%       0.0%       0.0%       0.0%       0.0% 
                    MA:      85.0%       0.0%       0.0%       0.0%       0.0%       0.0%       0.0% 


## MHC dimension

In [6]:
suggest_split_mhc_alleles()

Global seed set to 42


group name:    samples    observations
HLA-A01   :         99         359,049
HLA-A02   :        156         505,417
HLA-A03   :         92         644,773
HLA-A11   :         80         200,286
HLA-A23   :         25          73,141
HLA-A24   :         60         299,863
HLA-A25   :          3          14,163
HLA-A26   :          9          16,779
HLA-A29   :         24         244,649
HLA-A30   :         16          22,321
HLA-A31   :         18         199,800
HLA-A32   :         39         272,779
HLA-A33   :          2           8,015
HLA-A34   :          2           7,749
HLA-A36   :          1           3,960
HLA-A66   :          1           2,532
HLA-A68   :         75         246,119
HLA-A69   :         22          50,198
HLA-A74   :          1           3,543
HLA-B07   :         66         378,396
HLA-B08   :         37         128,339
HLA-B13   :         18          42,331
HLA-B14   :         47         219,850
HLA-B15   :         75         228,832
HLA-B18   :         25   

In [7]:
## direct suggestions
test_mhc_allele_groups = ['HLA-C17', 'HLA-C01', 'HLA-C15', 'HLA-B58', 'HLA-B37']
val_mhc_allele_groups = ['HLA-B51', 'HLA-B53', 'HLA-B39', 'HLA-B47', 'HLA-B08']

## add mhc alleles that are not connected with any other (insular)
test_mhc_allele_groups += ['HLA-A33', 'HLA-A36', 'HLA-A74', 'HLA-B46', 'HLA-B54']
val_mhc_allele_groups += ['HLA-A34', 'HLA-A66', 'HLA-B42', 'HLA-B52', 'HLA-B56']

save_split_mhc_alleles(test_mhc_allele_groups, val_mhc_allele_groups)

In [8]:
load_split_mhc_alleles()

In [9]:
overview_split()



ABSOLUTE
                             train        val       test   val-prot  test-prot    val-mhc   test-mhc 
Observations in splits:  1,549,098    204,306    206,332          0          0    204,306    206,332 
                    SA:    226,796     23,771     42,767          0          0     23,771     42,767 
                    MA:  1,322,302    180,535    163,565          0          0    180,535    163,565 


RELATIVE
                             train        val       test   val-prot  test-prot    val-mhc   test-mhc 
Observations in splits:      79.0%      10.4%      10.5%       0.0%       0.0%      10.4%      10.5% 
                    SA:      11.6%       1.2%       2.2%       0.0%       0.0%       1.2%       2.2% 
                    MA:      67.5%       9.2%       8.3%       0.0%       0.0%       9.2%       8.3% 


## Protein dimension

In [10]:
find_split_proteins()

100%|█████████████████████████████████████████████████████████████████████████| 47699/47699 [00:01<00:00, 26638.14it/s]
  exec(code_obj, self.user_global_ns, self.user_ns)
100%|███████████████████████████████████████████████████████████████████████| 196110/196110 [00:02<00:00, 80687.05it/s]
77027it [00:01, 75506.38it/s]
100%|█████████████████████████████████████████████████████████████████████████| 67135/67135 [00:00<00:00, 85929.81it/s]
108903it [00:02, 46735.54it/s]
100%|█████████████████████████████████████████████████████████████████████| 5263223/5263223 [01:35<00:00, 54946.27it/s]
Global seed set to 42


In [11]:
load_split_proteins()

In [12]:
overview_split()



ABSOLUTE
                             train        val       test   val-prot  test-prot    val-mhc   test-mhc 
Observations in splits:  1,407,876    274,378    277,482     70,072     71,150    204,306    206,332 
                    SA:    205,591     33,920     53,823     10,149     11,056     23,771     42,767 
                    MA:  1,202,285    240,458    223,659     59,923     60,094    180,535    163,565 


RELATIVE
                             train        val       test   val-prot  test-prot    val-mhc   test-mhc 
Observations in splits:      71.8%      14.0%      14.2%       3.6%       3.6%      10.4%      10.5% 
                    SA:      10.5%       1.7%       2.7%       0.5%       0.6%       1.2%       2.2% 
                    MA:      61.3%      12.3%      11.4%       3.1%       3.1%       9.2%       8.3% 
