# Precursor Charge State Prediction

This notebook presents a short walkthrough the process of reading a dataset and training a model for precursor charge state prediction. The dataset is an example dataset extracted from a ProteomTools dataset generated in the **Chair of Bioanalytics** at the **School of Life Sciences** at the **Technical University of Munich**.

DLOmix is the framework being used and is a custom wrapper on top of Keras/TensorFlow.

In [None]:
# install the DLOmix package in the current environment using pip

!python -m pip install -q dlomix

In [1]:
import dlomix
dlomix.__version__

'0.1.9'

The available modules in the framework are as follows:

- `constants`: constants to be used in the framework (e.g. Aminoacid alphabet mapping)
- `data`:  classes for representing dataset, wrappers around HuggingFace datasets to process input data and generate tensor datasets
- `eval`: custom evaluation metrics implemented in Keras/TF to work as `metrics` for model training
- `layers`: custom layer implementation required for the different models
- `models`: different model implementations for Retention Time Prediction
- `pipelines`: complete pipelines to run a task (e.g. Retention Time prediction)

**Note**: reports and pipelines are work-in-progress, some funtionalities are not complete.

In [2]:
from dlomix import constants, data, eval, layers, models, pipelines, reports
print([x for x in dir(dlomix) if not x.startswith("_")])


Avaliable feature extractors are (use the key of the following dict and pass it to features_to_extract in the Dataset Class):
{
   "atom_count": "Atom count of PTM.",
   "delta_mass": "Delta mass of PTM.",
   "mod_gain": "Gain of atoms due to PTM.",
   "mod_loss": "Loss of atoms due to PTM.",
   "red_smiles": "Reduced SMILES representation of PTM."
}.
When writing your own feature extractor, you can either
    (1) use the FeatureExtractor class or
    (2) write a function that can be mapped to the Hugging Face dataset.
In both cases, you can access the parsed sequence information from the dataset using the following keys, which all provide python lists:
    - _parsed_sequence: parsed sequence
    - _n_term_mods: N-terminal modifications
    - _c_term_mods: C-terminal modifications



2025-02-11 15:22:12.499138: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-11 15:22:12.501060: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-11 15:22:12.528817: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-11 15:22:12.528846: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-11 15:22:12.529725: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

['META_DATA', 'constants', 'data', 'eval', 'layers', 'losses', 'models', 'pipelines', 'reports', 'types']


## required imports

In [3]:
import tensorflow as tf

from dlomix.constants import PTMS_ALPHABET
from dlomix.data import ChargeStateDataset
from dlomix.eval import adjusted_mean_absolute_error
from dlomix.models import ChargeStatePredictor

## 1. Load Data

We can import the dataset class and create an object of type `ChargeStateDataset`. This object wraps around a Hugging Face dataset that can generate TensorFlow Dataset objects or Torch Dataset for training, validation, or testing. This can be controlled by the arguments `val_ratio`, `val_data_source`, and `test_data_source`.

The most important columns of the charge state dataset are:
* "modified_sequence", representing the peptide sequences, modifications are annotated using the UNIMOD encoding.
* "charge_state_dist", representing the relative charge state distribution per peptide. It is to be used together with the model_flavour="relative", which is the default.
* "most_abundant_charge_state", representing the most abundant charge state (as binary vector) per peptide. It is to be used together with the model_flavour="dominant".
* "observed_charge_states", representing all observed charge states (as binary vector) per peptide. It is to be used together with the model_flavour="observed".

In [None]:
DATA_PATH = "Wilhelmlab/prospect-ptms-charge"   # complete PROSPECT dataset prepared for charge state prediction
BATCH_SIZE = 8

In [None]:
d = ChargeStateDataset(
    data_format="hub",
    data_source=DATA_PATH,
    sequence_column="modified_sequence",
    label_column="charge_state_dist",   # use this column for relative charge state distribution
    max_seq_len=30,
    batch_size=BATCH_SIZE,
)

Now we have an CS dataset that can be used directly with standard or custom `Keras` models. This wrapper contains the splits we chose when creating it. In our case, they are training and validation splits. To get the TF Dataset, we call the attributes `.tensor_rain_data` and `.tensor_val_data`.

In [6]:
print("Hugging Face Dataset:", d)

print("Training examples:", len(d["train"]))
print("one training batch looks like:")
for x in d.tensor_train_data:
    print(x)
    break

print("Validation examples:", len(d["val"]))

Hugging Face Dataset: DatasetDict({
    train: Dataset({
        features: ['modified_sequence', 'charge_state_dist', '_parsed_sequence', '_n_term_mods', '_c_term_mods'],
        num_rows: 1138142
    })
    val: Dataset({
        features: ['modified_sequence', 'charge_state_dist', '_parsed_sequence', '_n_term_mods', '_c_term_mods'],
        num_rows: 326769
    })
    test: Dataset({
        features: ['modified_sequence', 'charge_state_dist', '_parsed_sequence', '_n_term_mods', '_c_term_mods'],
        num_rows: 161588
    })
})
Training examples: 1138142
one training batch looks like:


Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


(<tf.Tensor: shape=(8, 30), dtype=int64, numpy=
array([[21,  7,  2, 18,  3, 13,  1, 18,  8,  1,  1,  8,  8, 16, 15, 22,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21, 16, 13, 11, 20, 16,  8,  8, 17, 13, 12,  8, 10, 15, 10,  4,
        22,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21,  8, 10,  2, 16,  8, 14,  6,  5,  9,  3, 22,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21, 11, 10,  5, 10, 13,  4, 16,  1, 15, 19,  8, 14, 15, 22,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21, 12,  6, 19, 16,  7,  9,  3, 10, 10, 15, 22,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21, 17,  3, 17,  6, 20,  6, 14, 16, 16, 20, 22,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [21, 16,  4, 20, 12, 18, 17, 12, 12, 16, 22,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0

## 2. Model

We can now create the model. We will use the relative charge state distribution version (set via the parameter model_flavour="relative") of the Prosit-based Precursor Charge State Prediction model `ChargeStatePredictor`. It has default working arguments, but most of the parameters can be customized.

**Note**: Important is to ensure that the padding length used for the dataset object is equal to the sequence length passed to the model.

The three model flavours of `ChargeStatePredictor` are:

1. Dominant Charge State Prediction:
   - Task: Predict the dominant charge state of a given peptide sequence.
   - Model: Uses a deep learning model (RNN-based) inspired by Prosit's architecture to predict the most likely charge state.

2. Observed Charge State Prediction:
   - Task: Predict the observed charge states for a given peptide sequence.
   - Model: Uses a multi-label classification approach to predict all possible charge states.

3. Relative Charge State Prediction:
   - Task: Predict the proportion of each charge state for a given peptide sequence.
   - Model: Uses a regression approach to predict the proportion of each charge state.

In [7]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)

In [8]:
model = model = ChargeStatePredictor(
    num_classes=6, seq_length=30, alphabet=PTMS_ALPHABET, model_flavour="relative"
)

## 3. Training

We can then train the model like a standard Keras model. You can observe the decreasing loss value

In [10]:
model.compile(
    optimizer=optimizer,
    loss="mean_squared_error",
    metrics=[adjusted_mean_absolute_error],
)

In [11]:
history = model.fit(
    d.tensor_train_data, 
    validation_data=d.tensor_val_data,
    epochs=1,  # reduced for demonstration
)

  3554/142268 [..............................] - ETA: 3:33:36 - loss: 0.0442 - adjusted_mean_absolute_error: 0.1117

KeyboardInterrupt: 

### Train History

In [23]:
import matplotlib.pyplot as plt

In [24]:
def plot_learning_curves(history, title='Learning Curves'):
    history_dict = history.history
    loss = history_dict['loss']
    val_loss = history_dict.get('val_loss', [])
    
    epochs = range(1, len(loss) + 1)
    
    plt.figure(figsize=(8, 5))
    plt.plot(epochs, loss, 'b-', label='Training Loss')
    if val_loss:
        plt.plot(epochs, val_loss, 'r-', label='Validation Loss')
    plt.title(title)
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

In [None]:
plot_learning_curves(history, title='Charge State Distribution Model')

## 3. Testing

The ChargeStateDataset also contains a test dataset to test our model.

**Note**: Currently there is no reporting module available for CS prediction.

In [26]:
test_targets = d["test"]["charge_state_dist"]
test_sequences = d["test"]["modified_sequence"]

In [27]:
predictions = model.predict(test_sequences)
print(test_sequences[:5])
print(test_targets[:5])
print(predictions[:5])
print(predictions.shape, len(test_targets))

1022/5050 [=====>........................] - ETA: 3:32

KeyboardInterrupt: 

## 4. Saving and Loading Models

Models can be saved normally the same Keras models would be saved. It is better to save the weights and the not the model since it makes it easier and more platform-indepdent when loading the model again. The extra step needed is to create a model object and then load the weights.

In [None]:
# save the model weights

save_path = "./output/csd_model"
model.save_weights(save_path)

In [None]:
# models can be later loaded by creating a model object and then loading the weights

trained_model = ChargeStatePredictor(seq_length=32)
trained_model.load_weights(save_path)