#### Binary sentiment prediction using transformer-based models:
- **Objective**: Predict whether a movie review is negative (0) or positive (1).
- **Approach**:
  1. **Step 1**: Take a pre-trained transformer model and convert textual data into numerical representations (embeddings).
  2. **Step 2**: Fine-tune the model on labeled data (e.g., reviews with positive/negative labels) to perform the classification task.
  
- this process begins by transforming text into numerical form using transformer-based architectures.


#### Using the Rotten Tomatoes dataset for binary classification:
- **Task**: Binary classification, a common predictive task.
- **Application**: Sentiment analysis.
- **Objective**: Detect whether a document (e.g., review) is positive or negative.
- **Dataset**: Contains customer reviews labeled as either positive or negative (binary labels).


In [4]:
from datasets import load_dataset

In [5]:
dataset = load_dataset('cornell-movie-review-data/rotten_tomatoes')

In [6]:
import pandas as pd

In [7]:
# Pandas for easier control
train_df = pd.DataFrame(dataset["train"])
eval_df  = pd.DataFrame(dataset["test"])

train_df.shape, eval_df.shape

((8530, 2), (1066, 2))

In [8]:
#pip install simpletransformers

#### Training a classifier with a transformer-based model:

- **Step 1**: Use a pre-trained Large Language Model (LLM), such as BERT, to convert textual data into numerical representations (embeddings).
  
  - **Optimization**: The LLM's weights are "frozen" during training to speed up the process, but this may reduce accuracy.

- **Step 2**: Add a classification head on top of the pre-trained model.
  - A single linear layer (classification head) is placed on top, which is fine-tuned for the binary classification task.


- The `simpletransformers` library is an easy-to-use wrapper around the Hugging Face Transformers library, designed to simplify the process of training and fine-tuning transformer models.
  
- It abstracts many of the complexities of working with transformer models, making it a great choice for users who want to quickly train models for various NLP tasks without diving into too much detail.
  
- simpletransformers supports several common NLP tasks, including:

    - Text classification (binary, multi-label, and multi-class)
    - Named entity recognition (NER)
    - Question-answering (QA)
    - Language modeling
    - Sequence-to-sequence tasks (e.g., translation, summarization)
    - Text generation

In [9]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [10]:
import pandas as pd

# Prepare your data
# Data should be in a DataFrame with two columns: "text" and "labels"
train_data = [
    ["I love transformers!", 1],
    ["Transformers are challenging.", 0],
    ["I enjoy learning about NLP.", 1],
    ["Sometimes transformers are difficult.", 0]
]

In [11]:
df = pd.DataFrame(train_data, columns=["text", "labels"])

In [12]:
# Define the model
# For this example, we'll use 'roberta' model for binary classification
# model_type: The type of model (bert, xlnet, xlm, roberta, distilbert)
# model_name: The exact architecture and trained weights to use. 
#             This may be a Hugging Face Transformers compatible pre-trained model, 
#             a community model, or the path to a directory containing model files.
# tokenizer_type: The type of tokenizer 
#             (auto, bert, xlnet, xlm, roberta, distilbert, etc.)

# Specify model and tokenizer type
model_type     = "roberta"
model_name     = "roberta-base"
tokenizer_type = "roberta"

model = ClassificationModel(
    model_type    = model_type,
    model_name    = model_name,
    tokenizer_type= tokenizer_type,
    use_cuda      = False,
    args={
            'overwrite_output_dir': True
        }
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# Train the model with your data
global_step, training_details = model.train_model(df,
                                                show_running_loss       = True,
                                                evaluate_during_training= False)

0it [00:00, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
# Output the results
print(f"Global Steps: {global_step}")
print(f"Training Details: {training_details}")

Global Steps: 1
Training Details: 0.7056629657745361


In [15]:
# Sample input for prediction
test_texts = ["Weather is very bad today"]

# Make predictions
predictions, probabilities = model.predict(test_texts)

# Display predictions
print(predictions)     # This might print: [1]
print(probabilities)   # This might print: [0.7084888219833374]

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

[0]
[[ 0.05071136 -0.16328129]]


... back to original dataset

In [19]:
pd.set_option('display.max_colwidth', None)

In [23]:
train_df.sample(5)

Unnamed: 0,text,label
6344,imagine a really bad community theater production of west side story without the songs .,0
8354,this sort of cute and cloying material is far from zhang's forte and it shows .,0
3388,"we've seen it all before in one form or another , but director hoffman , with great help from kevin kline , makes us care about this latest reincarnation of the world's greatest teacher .",1
1586,"'no es la mejor cinta de la serie , ni la mejor con brosnan a la cabeza , pero de que entretiene ni duda cabe . '",1
5359,less worrying about covering all the drama in frida's life and more time spent exploring her process of turning pain into art would have made this a superior movie .,0


In [25]:
# Set up model arguments
model_args = ClassificationArgs()

In [29]:
type(model_args)

simpletransformers.config.model_args.ClassificationArgs

In [32]:
import pprint

pprint.pprint(model_args)

ClassificationArgs(adafactor_beta1=None,
                   adafactor_clip_threshold=1.0,
                   adafactor_decay_rate=-0.8,
                   adafactor_eps=(1e-30, 0.001),
                   adafactor_relative_step=True,
                   adafactor_scale_parameter=True,
                   adafactor_warmup_init=True,
                   adam_betas=(0.9, 0.999),
                   adam_epsilon=1e-08,
                   best_model_dir='outputs/best_model',
                   cache_dir='cache_dir/',
                   config={},
                   cosine_schedule_num_cycles=0.5,
                   custom_layer_parameters=[],
                   custom_parameter_groups=[],
                   dataloader_num_workers=0,
                   do_lower_case=False,
                   dynamic_quantize=False,
                   early_stopping_consider_epochs=False,
                   early_stopping_delta=0,
                   early_stopping_metric='eval_loss',
                   early_stop

In [33]:
model_args.train_custom_parameters_only = True

#### `train_custom_parameters_only` Parameter Explanation

- **Default Value:** `False`
- **Purpose:** Controls whether only the custom (newly added) layers are trained during fine-tuning, or if the entire model (including the base model) is trained.

---

##### When `train_custom_parameters_only = True`
- **What happens:**
  - Only the custom layers you’ve added on top of the RoBERTa model (e.g., the classification head) are trained.
  - The **pre-trained RoBERTa layers are frozen** and their weights are not updated during training.

- **Use Case:**
  - Useful when you want **quick training** or **fine-tuning** with limited data or computational resources.
  - You don't modify the pre-trained RoBERTa layers, but adapt the model by training only the classifier (or custom layers) for your dataset.

---

##### When `train_custom_parameters_only = False` (default)
- **What happens:**
  - The entire model, including both the pre-trained RoBERTa layers and the custom layers, is trained.
  - This means all parameters are updated during the fine-tuning process.

- **Use Case:**
  - Used when you want to **fine-tune the entire model** to fully adapt RoBERTa to the nuances of your specific dataset.
  - This is the typical setting for most fine-tuning tasks where you adjust both the base model and the custom layers for better task-specific performance.

---


In [34]:
model_args.custom_parameter_groups = [
    {
        "params": ["classifier.weight"],
        "lr": 1e-3,
    },
    {
        "params": ["classifier.bias"],
        "lr": 1e-3,
        "weight_decay": 0.0,
    },
]

#### custom_parameter_groups

- **Type:** List of dictionaries
- **Purpose:** Allows you to define custom parameter groups for fine-tuning specific parts of the model, assigning them different learning rates, weight decays, or other settings during optimization.

---

##### Structure of `custom_parameter_groups`:

Each dictionary in the list can include:
- **`params`:** The list of parameters or parameter layers to which the settings will apply.
- **`lr`:** Learning rate for the specific parameter group.
- **`weight_decay`:** Weight decay for the specific group.
- You can also include other optimizer-specific settings.

---

##### Example Usage:

```python
custom_parameter_groups = [
    {
        "params": ["roberta.embeddings.*"],      # Parameter group for the embedding layer
        "lr": 5e-5,
        "weight_decay": 0.01
    },
    {
        "params": ["roberta.encoder.layer.0.*"], # Parameter group for encoder layer 0
        "lr": 4e-5,
        "weight_decay": 0.01
    },
    {
        "params": ["classifier.*"],              # Parameter group for the custom classification head
        "lr": 1e-4,
        "weight_decay": 0.0
    }
]



| **Name of the Layer**             | **Value**                                      | **Description**                                         |
|-----------------------------------|------------------------------------------------|---------------------------------------------------------|
| **Embedding Layers**              | `"roberta.embeddings.*"`                       | Parameters for the embedding layers                     |
| **Transformer Encoder Layers**    | `"roberta.encoder.layer.*"`                    | Parameters for all transformer encoder layers           |
| **Specific Encoder Layers**       | `"roberta.encoder.layer.0.*"`                  | Parameters for specific encoder layer (e.g., layer 0)   |
| **Attention Layers**              | `"roberta.encoder.layer.[N].attention.*"`      | Parameters for the attention sub-layer in layer N       |
| **Feedforward Layers**            | `"roberta.encoder.layer.[N].intermediate.*"`   | Parameters for the intermediate feedforward layer in N  |
| **Output Layers**                 | `"roberta.encoder.layer.[N].output.*"`         | Parameters for the output layer in transformer block N  |
| **Layer Normalization Layers**    | `"roberta.encoder.layer.[N].output.LayerNorm.*"` | Parameters for the LayerNorm in the output sub-layer of N |
| **Pooler Layer**                  | `"roberta.pooler.*"`                           | Parameters for the pooler layer                         |
| **Task-Specific Layers**          | `"classifier.*"`                               | Parameters for the task-specific classification head    |


In [56]:
# Specify model and tokenizer type
model_type     = "bert"
model_name     = "bert-base-cased"
tokenizer_type = "bert"

model = ClassificationModel(
    model_type    = model_type,
    model_name    = model_name,
    tokenizer_type= tokenizer_type,
    use_cuda      = False,
    args={
            'overwrite_output_dir': True
        }
)

model = ClassificationModel(
    model_type    = model_type,
    model_name    = model_name,
    tokenizer_type= tokenizer_type,
    use_cuda      = False,
    args={
            'custom_layer_parameters': [
                {"layer": 0, "type": "Dense", "units": 128, "activation": "relu"},  # First layer
                {"layer": 1, "type": "Dropout", "rate": 0.3},  # Dropout layer
                {"layer": 2, "type": "Dense", "units": 64, "activation": "relu"},  # Second layer
                {"layer": 3, "type": "Dense", "units": 1, "activation": "sigmoid"},  # Output layer
            ],
            'overwrite_output_dir': True,
            'num_train_epochs': 3,  # Number of epochs for training
            'train_batch_size': 16,  # Batch size for training
            'learning_rate': 5e-5,  # Learning rate
        }
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Using a Classification Head

The classification head is a specific architecture added on top of the pre-trained model to adapt it for a classification task. 
The `ClassificationModel` class from the `simpletransformers` library wraps the pre-trained RoBERTa model with a custom classification head suitable for multi-class or binary classification tasks.

By default, this head usually consists of one or more fully connected (dense) layers followed by an activation function 
(like softmax for multi-class classification or sigmoid for binary classification) that maps the model’s output 
(which is of size equal to the hidden states of the base model) to the number of classes.

In this case, since the labels are binary (0 and 1), the classification head is likely configured to output a 
single logit (or two logits if using softmax) corresponding to the two classes.


In [53]:
import numpy as np
from sklearn.metrics import f1_score

In [57]:
%%time
# Train the model
# takes a long time on CPUs
# 5 mins for 100 samples

model.train_model(train_df.sample(100))



0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/7 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/7 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/7 [00:00<?, ?it/s]

CPU times: total: 4min 22s
Wall time: 4min 2s


(21, 0.5545366051651183)

In [58]:
# Predict unseen instances
eval_df_samples = eval_df.sample(5)
result, model_outputs, wrong_predictions = model.eval_model(eval_df_samples)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

In [59]:
y_pred = np.argmax(model_outputs, axis=1)

In [60]:
from sklearn.metrics import classification_report

# Assuming eval_df.label contains the true labels and y_pred contains the predicted labels
print(classification_report(eval_df_samples.label, y_pred))

              precision    recall  f1-score   support

           0       0.60      1.00      0.75         3
           1       0.00      0.00      0.00         2

    accuracy                           0.60         5
   macro avg       0.30      0.50      0.38         5
weighted avg       0.36      0.60      0.45         5



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
