# Fine-tuning with the ProkBERT Model Family
This notebook demonstrates how to utilize ProkBERT's pre-trained models for transfer learning tasks. We will apply the model to identify promoter sequences, framed as a binary classification problem where each segment is assigned a label.

The main steps include:
- Preparing the dataset to outline the labels for each segment.
- Tokenizing nucleotide sequences.
- Configuring training parameters such as learning rate, epochs, batch size, etc.
- Training and evaluating the model.


## Setting Up the Environment

While ProkBERT can operate on CPUs, leveraging GPUs significantly accelerates the process. Google Colab offers free GPU usage (subject to time and memory limits), making it an ideal platform for trying and experimenting with ProkBERT models.

## Enabling and testing the GPU (if you are using google colab)

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down


First, we'll install the ProkBERT package directly from its GitHub repository:


In [None]:
# ProkBERT
!pip install datasets

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import matthews_corrcoef, accuracy_score, roc_auc_score, recall_score
from datasets import load_dataset
import torch
import os


Next, we'll confirm that we can connect to the GPU with pytorch:

In [None]:
# Check if CUDA (GPU support) is available
if not torch.cuda.is_available():
    raise SystemError('GPU device not found')
else:
    device_name = torch.cuda.get_device_name(0)
    print(f'Found GPU at: {device_name}')
num_cores = os.cpu_count() 
print(f'Number of available CPU cores: {num_cores}')    

## Loading the Pretrained Model and Tokenizer for Transfer Learning

In this section, we load the ProkBERT model and its tokenizer from Hugging Face to perform transfer learning for sequence classification tasks. Transfer learning allows us to leverage the knowledge captured during pretraining on large genomic datasets and fine-tune the model for specific downstream tasks, such as promoter sequence classification.

### Model Overview

The `ProkBertForSequenceClassification` class is a specialized architecture for sequence classification tasks. It extends the `ProkBertPreTrainedModel` by adding a custom classification head on top of the ProkBERT base model. Key features include:

- **Weighted Sum Pooling**: Instead of relying solely on the `[CLS]` token, the model computes a weighted sum of hidden states across all sequence positions. These weights are learned through a linear layer and normalized using a softmax function. This approach captures contributions from all positions, ensuring more nuanced representation of sequence information.
  
 
- **Classification Head**: A fully connected layer maps the pooled output to logits, corresponding to the number of labels in the task.

- **Loss Computation**: For binary classification, the model computes the Cross-Entropy loss based on the logits and ground truth labels.

### Tokenizer Alignment

The tokenizer, which uses Local Context Aware (LCA) tokenization, processes the nucleotide sequences into input tokens for the model. It is crucial that the tokenizer parameters, such as `k-mer` size and `shift`, align with the settings used during the model's pretraining. For this example, we use the `neuralbioinfo/prokbert-mini` model with its pre-configured tokenizer.

### Loading the Model

We use the `AutoTokenizer` and `AutoModelForSequenceClassification` classes to load the tokenizer and model from Hugging Face. This ensures compatibility with the model's architecture and pretrained weights.



In [None]:
model_name_path = 'neuralbioinfo/prokbert-mini'
tokenizer = AutoTokenizer.from_pretrained(model_name_path, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name_path, trust_remote_code=True)



## Sequence Data Preparation

This project focuses on prokaryotic promoter sequences, labeled with `y=1` for known promoters and `y=0` otherwise. Each sequence includes an 80bp region with the Transcription Start Site (TSS) located at position 60. 

The dataset is preprocessed to ensure clean nucleotide sequences without empty or invalid entries. For more details on preprocessing, refer to the [segmentation notebook](https://github.com/nbrg-ppcu/prokbert/blob/main/examples/Segmentation.ipynb).

## Tokenizing and Preparing the Dataset

This block handles the preparation of prokaryotic promoter sequences for model training by tokenizing, masking, and labeling the data.

### Key Steps:

1. **Tokenization**:
   - Sequences are tokenized using `batch_encode_plus` with padding and special tokens added.

2. **Label Preparation**:
   - The `y` column from the dataset is converted into a PyTorch tensor labeled as `labels`, which are included in the output dictionary.

3. **Dataset Loading and Preprocessing**:
   - The dataset is loaded from Hugging Face (`neuralbioinfo/bacterial_promoters`).
   - It is randomized using a fixed seed (`42`) for reproducibility.

4. **Tokenization Mapping**:
   - The `tokenize_function` is applied across the dataset in batches using multiple CPU cores for efficiency.

This prepares the dataset for training with ProkBERT, ensuring clean and compatible inputs for the model.


In [None]:
def tokenize_function(examples):
    # Tokenize the input sequences
    encoded = tokenizer.batch_encode_plus(
        examples["segment"],
        padding=True,
        add_special_tokens=True,
        return_tensors="pt",
    )
    
    # Get the input_ids and attention_mask
    input_ids = encoded["input_ids"]
    attention_mask = encoded["attention_mask"]
    
    # Mask tokens with IDs 2 and 3 in a vectorized way
    mask_tokens = (input_ids == 2) | (input_ids == 3)  # Identify where tokens 2 and 3 occur
    attention_mask[mask_tokens] = 0  # Set attention_mask to 0 for these tokens
    y = torch.tensor(examples["y"], dtype=torch.int64)    
    
    # Return the updated dictionary
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": y,  # Include labels for training
    }
dataset = load_dataset("neuralbioinfo/bacterial_promoters")
randomize_dataset = dataset.shuffle(seed=42)
randomize_dataset.flatten_indices()
tokenized_dataset = randomize_dataset.map(tokenize_function, batched=True, num_proc=num_cores)


### Evaluation Metrics

For promoter prediction, the following metrics are computed to evaluate model performance:

- **MCC (Matthews Correlation Coefficient)**: A balanced metric that considers true and false positives and negatives. 

- **Accuracy**: The proportion of correct predictions among all samples, providing an overall sense of the model's performance.

- **Recall**: Measures the ability of the model to identify all positive samples (true promoters). It is weighted to account for class imbalances.

- **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**: Evaluates the model's capability to distinguish between classes. For binary classification, it focuses on the positive class probabilities; for multiclass classification, it uses a one-vs-rest approach.



In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)  # Predicted class indices

    # Calculate metrics
    mcc = matthews_corrcoef(labels, predictions)
    acc = accuracy_score(labels, predictions)
    recall = recall_score(labels, predictions, average='weighted')

    try:
        # Adjust for binary or multiclass classification
        if logits.shape[1] == 2:  # Binary classification
            roc_auc = roc_auc_score(labels, logits[:, 1])  # Use positive class probabilities
        else:  # Multiclass classification
            roc_auc = roc_auc_score(labels, logits, multi_class='ovr')  # Use probabilities for all classes
    except ValueError:
        # Handle edge cases where ROC-AUC cannot be computed
        roc_auc = float('nan')

    return {
        "mcc": mcc,
        "accuracy": acc,
        "recall": recall,
        "roc_auc": roc_auc,
    }





### Training the ProkBERT Model

We fine-tune the ProkBERT model for promoter prediction using a customized training setup. The key steps include configuring training parameters, initializing the Trainer, and starting the training process.

#### Training Configuration
The `TrainingArguments` define the training and evaluation setup:
- **Batch Size**: Both training and evaluation use a batch size of 128 for efficient processing.
- **Evaluation Strategy**: Evaluations are conducted every 50 steps to monitor progress.
- **Logging**: Training metrics are logged every 50 steps, and the logs are stored in the `./logs` directory.
- **Learning Rate**: A learning rate of `0.0004` is used for gradient updates.
- **Epochs**: The model is trained for 1 epoch, suitable for demonstration or initial fine-tuning.
- **Checkpointing**: Only the most recent checkpoint is saved to conserve storage.

#### Trainer Initialization
The Hugging Face `Trainer` API simplifies model training by:
- Automatically handling batching, gradient updates, and evaluation.
- Using `compute_metrics` to evaluate the model's performance on the test dataset (`test_sigma70`).

#### Training Process
The `trainer.train()` method starts the training loop, fine-tuning the ProkBERT model on the provided training dataset. Progress is logged, and metrics are computed periodically to assess the model's learning.

This setup efficiently adapts the pre-trained ProkBERT model to the promoter prediction task.


In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=128,
    per_device_train_batch_size=128,
    logging_dir="./logs",
    report_to="none",
    evaluation_strategy="steps",
    eval_steps=50,
    num_train_epochs=1,
    learning_rate=0.0004,
    logging_steps=50,
    save_total_limit=1,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test_sigma70'],
    compute_metrics=compute_metrics,
)
# Start training
trainer.train()

# Fine-tuned Model

The final fine-tuned model is available at the specified path, ready for deployment or further evaluation. While the current setup provides a good foundation, there's always room for improvement by experimenting with different hyperparameters. Fine-tuning these parameters can help greatly improve the model's performance on specific tasks or datasets.

## Considerations for Further Optimization

- **Experiment with Hyperparameters**: Adjust learning rate, batch size, number of epochs, and other training parameters to find the optimal configuration for your specific use case.
- **Cross-validation**: Use cross-validation techniques to ensure that your model generalizes well across different subsets of your data.
- **Data Augmentation**: Explore data augmentation strategies for sequence data, such as introducing random mutations or utilizing synthetic data generation, to increase the robustness of your model.
- **Advanced Architectures**: Consider experimenting with different model architectures or integrating additional layers (i.e. convolution could be a good idea) to improve the model's capacity to capture complex patterns in the data.

## Closing Remarks

Fine-tuning a pre-trained model like ProkBERT offers a powerful approach to leveraging large language moels for biological sequence analysis. By carefully selecting and optimizing your model's hyperparameters, you can achieve significant improvements in performance. 
