**Installing PoverNovo from PyPi**

In [1]:
!pip install powernovo==1.0.9

Collecting powernovo==1.0.9
  Downloading powernovo-1.0.9-py3-none-any.whl (53 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.4/53.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting numpy~=1.26.3 (from powernovo==1.0.9)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers~=4.36.2 (from powernovo==1.0.9)
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
Collecting pandas~=2.1.4 (from powernovo==1.0.9)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━

**Mount disk to load MGF file**

In [2]:
import sys
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


**Example 1. Simple pipeline launch**

In [4]:
import os
from powernovo.run import run_inference
# path to mgf file
input_mgf = os.path.join('/content/gdrive/My Drive', 'MassiveKB_Test.mgf')
# Run pipline
run_inference(input_mgf, batch_size=4, use_bert=True)

powernovo 2024-03-07 13:05:02,084 INFO Check environment...
INFO:powernovo:Check environment...
powernovo 2024-03-07 13:05:02,091 INFO Environment check completed successfully
INFO:powernovo:Environment check completed successfully
powernovo 2024-03-07 13:05:02,527 INFO Model loaded successfully pwn_work/models/pwn_spectrum.pt
INFO:powernovo:Model loaded successfully pwn_work/models/pwn_spectrum.pt
powernovo 2024-03-07 13:05:02,531 INFO Use device: cpu
INFO:powernovo:Use device: cpu
powernovo 2024-03-07 13:05:02,536 INFO Load peptide bert model...
INFO:powernovo:Load peptide bert model...
powernovo 2024-03-07 13:05:06,375 INFO Peptide bert model loaded successfully
INFO:powernovo:Peptide bert model loaded successfully
powernovo 2024-03-07 13:05:06,400 INFO Preprocessing input file: /content/gdrive/My Drive/MassiveKB_Test.mgf
INFO:powernovo:Preprocessing input file: /content/gdrive/My Drive/MassiveKB_Test.mgf
powernovo 2024-03-07 13:05:06,428 INFO Preprocessing completed successfully
IN

KeyboardInterrupt: 

**Example 2. Launch the pipeline, catch all predictions in the callback function and print them**
Data structure passed to callback for batch:

`{scan_id: {"predicted": predicted_record, "annotation":antotation_strig}}`

**Where:**


```

predicted_record = {'sequence': str # predicted sequence as string
                  'canonical_seq': str # canonical representation of predicted sequence (without modifications)
                  'mass_error': float, # mass error (delta ppm)
                  'score': float, # confidence score [0..1] of prediction
                  'mod_dict': dict, # dict of modifications with positions
                  'mod_str': str, # string representation of modifications
                  'aa_scores': str # confidence score [0..1] of each amino acid (ыpace separated string)  
                  }
```
annotation - Annotation as string, if present in the mgf file

In [5]:
"""we are using the annotated mgf from Massive-KB, so we will print both the predictions and the original peptides from the annotation """
def my_callback(batch):
  for scan_id in batch:
     predicted_seq = batch[scan_id]['predicted']['sequence']
     score = batch[scan_id]['predicted']['score']
     true_seq  = batch[scan_id]['annotation']
     print(f'ScanID: {scan_id}')
     print(f'Predicted seq.: {predicted_seq}   Score: {score}')
     print(f'True seq: {true_seq}')


Run and specify our callback function as a parameter

In [6]:
import os
from powernovo.run import run_inference
# path to mgf file
input_mgf = os.path.join('/content/gdrive/My Drive', 'MassiveKB_Test.mgf')
# Run pipline
run_inference(input_mgf, annotated_spectrum=True, batch_size=4, use_bert=True, callback_fn=my_callback)

powernovo 2024-03-07 13:17:15,272 INFO Check environment...
INFO:powernovo:Check environment...
powernovo 2024-03-07 13:17:15,277 INFO Environment check completed successfully
INFO:powernovo:Environment check completed successfully
powernovo 2024-03-07 13:17:15,726 INFO Model loaded successfully pwn_work/models/pwn_spectrum.pt
INFO:powernovo:Model loaded successfully pwn_work/models/pwn_spectrum.pt
powernovo 2024-03-07 13:17:15,730 INFO Use device: cpu
INFO:powernovo:Use device: cpu
powernovo 2024-03-07 13:17:15,735 INFO Load peptide bert model...
INFO:powernovo:Load peptide bert model...
powernovo 2024-03-07 13:17:19,968 INFO Peptide bert model loaded successfully
INFO:powernovo:Peptide bert model loaded successfully
powernovo 2024-03-07 13:17:19,999 INFO Preprocessing input file: /content/gdrive/My Drive/MassiveKB_Test.mgf
INFO:powernovo:Preprocessing input file: /content/gdrive/My Drive/MassiveKB_Test.mgf
powernovo 2024-03-07 13:17:20,015 INFO The datafile already exists, but the pa

/content/gdrive/My Drive/MassiveKB_Test.mgf: 0spectra [00:00, ?spectra/s]

powernovo 2024-03-07 13:17:35,856 INFO Preprocessing completed successfully
INFO:powernovo:Preprocessing completed successfully
powernovo 2024-03-07 13:17:35,881 INFO Start processing. Total items: 3488 
INFO:powernovo:Start processing. Total items: 3488 
  0%|          | 1/3488 [01:43<99:52:49, 103.12s/it]

ScanID: index=0
Predicted seq.: (+42.01)AAAADSFSGGPAGVRLPR   Score: 0.75
True seq: [Acetyl]-AAAADSFSGGPAGVRLPR
ScanID: index=1
Predicted seq.: (+42.01)AAAEGPVGDGELWQTWLPNHVVFLR   Score: 0.82
True seq: [Acetyl]-AAAEGPVGDGELWQTWLPNHVVFLR
ScanID: index=2
Predicted seq.: (+42.01)AADALEQQER   Score: 1.0
True seq: [Acetyl]-AADALEEQQR
ScanID: index=3
Predicted seq.: (+42.01)AAFRDLEEVSQGLLSLLGANGV   Score: 0.72
True seq: [Acetyl]-AAFRDLEEVSQGLLSLLGANR


  0%|          | 2/3488 [03:17<94:46:58, 97.88s/it] 

ScanID: index=4
Predicted seq.: (+42.01)AALGPSSQNVTEMVLLTR   Score: 0.67
True seq: [Acetyl]-AALGPSSQNVTEYVVRVP
ScanID: index=5
Predicted seq.: (+42.01)AALMTPGTGAPPAPGDFSGEGSQGLPDPSPEPK   Score: 1.0
True seq: [Acetyl]-AALMTPGTGAPPAPGDFSGEGSQGLPDPSPEPK
ScanID: index=6
Predicted seq.: (+42.01)AALMTPGTGAPPAPGDFSGEGSQGLPDPSPEPK   Score: 1.0
True seq: [Acetyl]-AALMTPGTGAPPAPGDFSGEGSQGLPDPSPEPK
ScanID: index=7
Predicted seq.: (+42.01)AALTHHPAAMSNGNMNTMGHMMEMMGSR   Score: 1.0
True seq: [Acetyl]-AALTHHPAAMSNGNMNTMGHMMEMMGSR


  0%|          | 3/3488 [05:13<102:50:57, 106.24s/it]

ScanID: index=8
Predicted seq.: (+42.01)AANATTNPSQLLPLELVDKC(+57.02)LGSR   Score: 0.34
True seq: [Acetyl]-AANATTNPSQLLPLELVDKC[Carbamidomethyl]LGSR
ScanID: index=9
Predicted seq.: (+42.01)AAPEGSGLGEDARLEQD   Score: 0.68
True seq: [Acetyl]-AAPEGSGLGEDARLDQE
ScanID: index=10
Predicted seq.: AASDELSGLTLSPMVMDAK   Score: 1.0
True seq: [Acetyl]-AASDELSKTLSPMVMDAK
ScanID: index=11
Predicted seq.: (+42.01)AASSLTVTLGR   Score: 0.68
True seq: [Acetyl]-AASSLTVTLGR


  0%|          | 4/3488 [07:18<110:00:40, 113.67s/it]

ScanID: index=12
Predicted seq.: (+42.01)AATSGTDEPVSGELVSVAHALSLPAESYGND   Score: 0.83
True seq: [Acetyl]-AATSGTDEPVSGELVSVAHALSLPAESYGND
ScanID: index=13
Predicted seq.: EAVLTLHLQTSGLK   Score: 0.99
True seq: [Acetyl]-AAVKTLNPKAEVAR
ScanID: index=14
Predicted seq.: (+42.01)AC(+57.02)NALEDAQSTR   Score: 0.46
True seq: [Acetyl]-AC[Carbamidomethyl]NALEDAQSTR
ScanID: index=15
Predicted seq.: (+42.01)AC(+57.02)PALGLEALGAPLQPEPPPEPAFSEA   Score: 0.65
True seq: [Acetyl]-AC[Carbamidomethyl]PALGLEALQPLQPEPPPEPAFSEA


  0%|          | 5/3488 [08:37<97:56:52, 101.24s/it] 

ScanID: index=16
Predicted seq.: (+42.01)ADDFGFFSSSSEGAPEAAEEDPAAA   Score: 0.76
True seq: [Acetyl]-ADDFGFFSSSESGAPEAAEEDPAAA
ScanID: index=17
Predicted seq.: (+42.01)ADDLGKGGNEESTKTGNAGSR   Score: 1.0
True seq: [Acetyl]-ADDLGKGGNEESTKTGNAGSR
ScanID: index=18
Predicted seq.: (+42.01)ADGYAGNPDSK   Score: 0.6
True seq: [Acetyl]-ADGYNQPDSK


  0%|          | 6/3488 [10:54<109:39:22, 113.37s/it]

ScanID: index=20
Predicted seq.: (+42.01)AEALTYADLRFVKAPLKK   Score: 0.77
True seq: [Acetyl]-AEALTYADLRFVKAPLKK
ScanID: index=21
Predicted seq.: AEAMDLGKDPNGPTHSSTLFVRDFGSSMSEYVRPSPAR   Score: 0.98
True seq: [Acetyl]-AEAMDLGKDPNGPTHSSTLFVRDDGSSMSFYVRPSPAK
ScanID: index=22
Predicted seq.: (+42.01)AEELVLERC(+57.02)DLELETNGRDHHTADLC(+57.02)REK   Score: 0.58
True seq: [Acetyl]-AEELVLERC[Carbamidomethyl]DLELETNGRDHHTADLC[Carbamidomethyl]REK
ScanID: index=23
Predicted seq.: (+42.01)AEELVLERC(+57.02)DLELETNGRDHHTADLC(+57.02)REK   Score: 0.58
True seq: [Acetyl]-AEELVLERC[Carbamidomethyl]DLELETNGRDHHTADLC[Carbamidomethyl]REK


  0%|          | 7/3488 [12:52<111:05:49, 114.89s/it]

ScanID: index=24
Predicted seq.: VTLQQ(+.98)LEQQLEEAQTENFNLK   Score: 0.73
True seq: [Acetyl]-AELQSLEQQLEEAQTENFNLK
ScanID: index=25
Predicted seq.: (+42.01)AERAALEELVKLQGERVRGLK   Score: 0.77
True seq: [Acetyl]-AERAALEELVKLQGERVRGLK
ScanID: index=26
Predicted seq.: (+42.01)AETLSGLGDSGAAGAAALSSASSETGTR   Score: 1.0
True seq: [Acetyl]-AETLSGLGDSGAAGAAALSSASSETGTR
ScanID: index=27
Predicted seq.: PFADAM(+15.99)EVLPSTLAENAGLNPLSTVTELR   Score: 0.77
True seq: [Acetyl]-AFADAMEVLPSTLAENAGLNPLSTVTELR


  0%|          | 7/3488 [14:22<119:11:52, 123.27s/it]


KeyboardInterrupt: 