# Inference and Evaluation with the Finetuned Models

In this notebook, we demonstrate how one can evaluate various finetuned models on both the promoter and phage test datasets.

The main steps are:
  * Preparing the models and datasets
  * Setting up the parameters for the evaluation
  * Running inference and collecting the results


## Setting Up the Environment

While ProkBERT can operate on CPUs, leveraging GPUs significantly accelerates the process. Google Colab offers free GPU usage (subject to time and memory limits), making it an ideal platform for trying and experimenting with ProkBERT models.


### Enabling and testing the GPU (if you are using google colab)

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down



### Setting up the packages and the installs
First, we'll install the ProkBERT package directly from its GitHub repository:


In [None]:
!pip install datasets

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import matthews_corrcoef, accuracy_score, roc_auc_score, recall_score
from datasets import load_dataset
import torch
import os


Next, we'll confirm that we can connect to the GPU with pytorch:

In [None]:
# Check if CUDA (GPU support) is available
if not torch.cuda.is_available():
    raise SystemError('GPU device not found')
else:
    device_name = torch.cuda.get_device_name(0)
    print(f'Found GPU at: {device_name}')
num_cores = os.cpu_count() 
print(f'Number of available CPU cores: {num_cores}')    

## Loading the Finetuned Model and the tokenizer for Promoter Identification

Next, we will download the finetuned model for promoter identification. For more details about the model, please see the Finetuning notebook.


In [None]:
finetuned_model = "neuralbioinfo/prokbert-mini-promoter"
tokenizer = AutoTokenizer.from_pretrained(finetuned_model, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(finetuned_model, trust_remote_code=True)



## Loading the Dataset

In this section, we will evaluate the test sets of the promoter dataset. We have two different types of tests: a sigma70 test (known E. coli promoters) referred to as the 'test_sigma70' set, and a multispecies dataset, which consists of promoters from various species as well as CDS sequences (non-promoters) and randomly generated sequences. For a more detailed description, see: [Bacterial Promoters Dataset](https://huggingface.co/datasets/neuralbioinfo/bacterial_promoters)



### Tokenization of Dataset

The following code defines a function to tokenize the input sequences and applies it to a prokaryotic promoter dataset. 

#### Key Components:

1. **`tokenize_function`**:
   - Tokenizes the nucleotide sequences found in the `segment` field of the dataset.
   - Utilizes `batch_encode_plus` from the tokenizer to:
     - Add padding for uniform sequence lengths.
     - Include special tokens required by the model.
     - Return the encoded output as PyTorch tensors (`return_tensors="pt"`).

2. **Dataset Loading**:
   - The dataset is sourced from Hugging Face's `neuralbioinfo/bacterial_promoters` dataset, specifically using the `test_sigma70` split.

3. **Tokenization Mapping**:
   - The `tokenize_function` is applied to the dataset in a **batched** manner.
   - The `num_proc` parameter allows the use of multiple CPU cores to speed up the tokenization process.

This pro

In [None]:
def tokenize_function(examples):
    # Tokenize the input sequences
    encoded = tokenizer.batch_encode_plus(
        examples["segment"],
        padding=True,
        add_special_tokens=True,
        return_tensors="pt",
    )
    # Return the updated dictionary
    return encoded
    
dataset = load_dataset("neuralbioinfo/bacterial_promoters", split='test_sigma70')
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=num_cores)





### Performing Inference with ProkBERT

The following code demonstrates how to perform inference using a pretrained ProkBERT model with the Hugging Face `Trainer` API.

#### Key Components:

1. **Defining `TrainingArguments`**:
   - **`output_dir`**: Specifies the directory where results and logs will be stored.
   - **`per_device_eval_batch_size`**: Sets the batch size for inference to 128, ensuring efficient processing.
   - **`logging_dir`**: Directory for logging during evaluation.
   - **`report_to`**: Disables logging to external services like Weights & Biases (W&B).

2. **Initializing the Trainer**:
   - The `Trainer` is initialized with:
     - **`model`**: The pretrained ProkBERT model used for inference.
     - **`args`**: Inference-specific arguments defined in `training_args`.

3. **Performing Inference**:
   - The `predict` method of the `Trainer` is used to perform inference on the `tokenized_dataset`.
   - The output, stored in `predictions`, contains the model's predictions for the input sequences.


In [None]:
from transformers import Trainer, TrainingArguments

# Define TrainingArguments for inference
training_args = TrainingArguments(
    output_dir="./results",              # Directory for storing logs/results
    per_device_eval_batch_size=128,     # Batch size for inference
    logging_dir="./logs",               # Directory for logs
    report_to="none",                   # Disable reporting to W&B or other loggers
)

# Initialize Trainer for inference
trainer = Trainer(
    model=model,                        # Pretrained ProkBERT model
    args=training_args,                 # Inference arguments
)

# Perform inference
predictions = trainer.predict(tokenized_dataset)

### Processing Predictions and Creating Final DataFrame

This code processes model predictions to generate class probabilities, predicted labels, and a final DataFrame for analysis.

#### Steps:

1. **Compute Probabilities**:
   - Apply `softmax` to model outputs to calculate probabilities for each class (`prob_class_0` and `prob_class_1`).

2. **Predict Classes**:
   - Use `np.argmax` to determine the predicted class (`predicted_y`) with the highest probability.

3. **Prepare DataFrame**:
   - Convert the dataset to a Pandas DataFrame.
   - Drop unnecessary columns (`segment`, `Strand`, `ppd_original_SpeciesName`).

4. **Add Predictions**:
   - Add columns for class probabilities, predicted class, and human-readable labels (`promoter` or `non_promoter`).


In [None]:
# Process predictions
probabilities = softmax(predictions.predictions, axis=1)
predicted_y = np.argmax(probabilities, axis=1)
dataset_df = dataset.to_pandas()
dataset_df.drop(columns=["segment", "Strand", "ppd_original_SpeciesName"], inplace=True)
dataset_df["prob_class_0"] = probabilities[:, 0]
dataset_df["prob_class_1"] = probabilities[:, 1]
dataset_df["predicted_y"] = predicted_y

# Add 'promoter' or 'non_promoter' label
dataset_df["predicted_label"] = dataset_df["predicted_y"].apply(lambda x: "promoter" if x == 1 else "non_promoter")

# Display final dataframe
dataset_df

# Evaluating the promoter models prediction performance

This code evaluates the ProkBERT model on a labeled dataset using Hugging Face's `Trainer` and evaluation utilities from the ProkBERT package.

**Install ProkBERT**:
   - Install the ProkBERT package: `!pip install prokbert`.
   - Import the `compute_metrics_eval_prediction` function for computing evaluation metrics.


In [None]:
!pip install prokbert


### Tokenizing the Evaluation Dataset

This code defines a `tokenize_function` to preprocess the evaluation dataset for ProkBERT. It tokenizes the `segment` sequences, adds padding and special tokens, and adjusts the `attention_mask` to exclude masked tokens (IDs `2` and `3`). The labels (`y`) are converted into PyTorch tensors and included in the output. The function is applied to the entire dataset in batches using parallel processing to create a tokenized dataset with labels.


In [None]:
# Tokenize the evaluation dataset
def tokenize_function(examples):
    encoded = tokenizer.batch_encode_plus(
        examples["segment"],
        padding=True,
        add_special_tokens=True,
        return_tensors="pt",
    )
    input_ids = encoded["input_ids"].clone().detach()
    attention_mask = encoded["attention_mask"].clone().detach()
    mask_tokens = (input_ids == 2) | (input_ids == 3)
    attention_mask[mask_tokens] = 0
    y = torch.tensor(examples["y"], dtype=torch.int64)

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": y,
    }

tokenized_dataset_with_labels = dataset.map(tokenize_function, batched=True, num_proc=num_cores)


### Evaluating the Model

The `Trainer` is initialized for evaluation with the ProkBERT model, using the tokenized dataset and predefined evaluation arguments. The `compute_metrics_eval_prediction` function calculates performance metrics such as accuracy, precision, recall, and F1 score. The evaluation is performed using the `evaluate` method, and the results are stored in `evaluation_results`.


In [None]:
# Initialize the Trainer for evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=tokenized_dataset_with_labels,
    compute_metrics=compute_metrics_eval_prediction,
)

# Perform evaluation
print("Starting evaluation...")
evaluation_results = trainer.evaluate()
evaluation_results

### Final Remarks
Stay curious, stay caffeinated, and happy coding! 😎💻
