# **Model 1: `MarieAngeA13/Sentiment-Analysis-BERT`**

### ***IMPORTING LIBRARIES***

> The code below provides the foundation for setting up a Natural Language Processing (NLP) text classification project using the Hugging Face Transformers library. It begins by importing pandas, a powerful library for managing and analyzing tabular data. This is often used to preprocess and structure text data into a format suitable for modeling. The train_test_split function from sklearn is then brought in to divide the dataset into training and testing subsets, ensuring the model can be effectively evaluated.

>Key components of the Hugging Face Transformers library are also imported. The AutoTokenizer helps preprocess the text data by converting it into tokens that a transformer model can process. The AutoModelForSequenceClassification is used to load a pretrained model fine-tuned for text classification tasks.

> To manage training, the Trainer and TrainingArguments classes streamline the process by allowing configurations such as learning rate and batch size. The torch and torch.utils.data.Dataset modules are included to define custom datasets and utilize PyTorch's efficient data handling and computation features. Together, these tools lay the groundwork for building a robust text classification system.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

### ***LOADING DATASET***

> The code that loads the dataset below demonstrates how to load data from an Excel file into a Pandas DataFrame for further analysis. The variable file_path is used to specify the location of the Excel file on your system, in this case, at /content/test.xlsx. The pd.read_excel() function, a versatile tool in the Pandas library, reads the data from the specified file and converts it into a DataFrame. This DataFrame serves as an organized, tabular structure that makes it easy to manipulate, analyze, and visualize the dataset. This step is often the starting point in data analysis or preprocessing workflows.

In [None]:
file_path = "/content/TLC_student_feedback_dataset.xlsx"
df = pd.read_excel(file_path)

### ***SPLITTING DATA INTO TRAINING AND VALIDATION SETS***

> The code below focuses on dividing a dataset into training and validation subsets, an essential step in machine learning to evaluate model performance. The train_test_split function from sklearn is used here to split the feedback column (the input texts) and the label column (the target labels) from the dataset df. By converting these columns into lists, the function ensures compatibility with subsequent processing.

> The test_size=0.2 parameter specifies that 20% of the data will be allocated to the validation set, leaving 80% for training. This balanced split helps the model learn patterns from the majority of the data while reserving a smaller portion for unbiased evaluation. The random_state=42 ensures reproducibility by fixing the random seed, allowing consistent splits across multiple runs. As a result, train_texts and val_texts store the training and validation texts, while train_labels and val_labels hold the corresponding labels.

In [None]:

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['feedback'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42
)

### ***LOADING THE TOKENIZER***

> The line below initializes a tokenizer using the Hugging Face Transformers library, specifically from the pretrained model "MarieAngeA13/Sentiment-Analysis-BERT." The AutoTokenizer.from_pretrained() method loads the tokenizer configuration and vocabulary tailored to this model.

> The tokenizer's role is to preprocess raw text data by breaking it down into tokens, which are smaller chunks (such as words, subwords, or characters) that the model can process. It also converts these tokens into numerical representations, adds special tokens required by the model, and ensures consistent input formatting. By using a tokenizer pretrained with the same model, the input data is prepared in a way that aligns perfectly with the model's expectations, improving its ability to perform text classification tasks.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("MarieAngeA13/Sentiment-Analysis-BERT")

tokenizer_config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

### ***TOKENIZING THE DATA***

> The code below defines a custom dataset class named FeedbackDataset, designed to handle text and label data for a feedback classification task. It inherits from PyTorch's Dataset class, making it compatible with data loaders and enabling efficient batch processing during model training.

> When initialized, the class takes in the texts, their corresponding labels, a tokenizer, and an optional maximum sequence length (default set to 128). The __len__ method returns the total number of text samples, helping PyTorch determine the dataset size. The __getitem__ method retrieves a specific sample by its index.

> For each sample, the __getitem__ method uses the tokenizer to preprocess the text. This involves breaking the text into tokens, ensuring all samples are the same length (padding or truncating to the maximum length), and creating numerical representations in the form of input IDs and attention masks. These processed values, along with the corresponding label, are packaged into a dictionary and returned. By organizing the data in this structured way, the class ensures the model receives the input in the exact format it needs for training or evaluation.

In [None]:
class FeedbackDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encodings = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        # Return input IDs, attention masks, and the label
        return {
            "input_ids": encodings["input_ids"].squeeze(0),
            "attention_mask": encodings["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

### ***PREPARING TRAINING AND VALIDATION DATASETS***

> This code creates structured datasets for training and validation by using the FeedbackDataset class. First, train_texts and train_labels, which contain the training feedback texts and their corresponding labels, are passed to create train_dataset. Similarly, val_texts and val_labels are used to set up val_dataset for validation.

In [None]:
train_dataset = FeedbackDataset(train_texts, train_labels, tokenizer)
val_dataset = FeedbackDataset(val_texts, val_labels, tokenizer)

### ***LOADING THE PRETRAINED SENTIMENT ANALYSIS MODEL***

> The code below initializes a pretrained sentiment analysis model using Hugging Face's AutoModelForSequenceClassification. The specific model, "MarieAngeA13/Sentiment-Analysis-BERT", is tailored for text classification tasks. The parameter num_labels=3 ensures that the model is configured to predict one of three possible labels, aligning with the number of unique categories in the dataset.

> The model is already trained on text classification tasks, making it well-suited for analyzing feedback sentiment without starting from scratch. However, the user plans to experiment with other sentiment analysis models, replacing this one with four additional pretrained models. This approach will allow them to compare and identify the best-performing model for their specific dataset and task.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "MarieAngeA13/Sentiment-Analysis-BERT",
    num_labels=3  # Ensure the number of labels matches your dataset
)

config.json:   0%|          | 0.00/944 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

### ***CONFIGURING TRAINING PARAMETERS FOR THE MODEL***

> The code below sets up the training configuration using the TrainingArguments class from Hugging Face. These arguments define how the model will be trained and evaluated. The output_dir specifies where the training results and model checkpoints will be saved. Both evaluation and saving are set to occur at the end of each epoch, ensuring the process stays consistent throughout the training.

> The learning rate is set to 5e-5, controlling how much the model adjusts its weights during training. Batch sizes of 16 are used for both training and evaluation, balancing memory usage and processing speed. The training will run for 5 epochs, meaning the model will pass through the entire dataset five times. To prevent overfitting, weight decay is applied at 0.01. Logging is configured to record progress every 10 steps and store the logs in a directory named ./logs.

> Additional options enhance performance and resource management. For example, the model will automatically reload the best version based on accuracy at the end of training, and only the two most recent checkpoints are kept to save storage. Finally, while the setup allows for integration with Hugging Face's model hub, this option is turned off with push_to_hub=False. Together, these parameters create a structured and efficient training process.

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_total_limit=2,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)



### ***DEFINING COMPUTE METRICS FUNCTION***

> The code below defines a function called compute_metrics that calculates various evaluation metrics to measure how well a machine learning model is performing. It takes in eval_pred, which includes the predicted values (logits) and the actual labels (labels). The function first determines the predicted classes by identifying the index of the highest value in each prediction using the torch.argmax method. Then, it compares these predictions to the true labels to calculate four important metrics.

> The accuracy measures the proportion of correct predictions out of all predictions made. The F1 score, which combines precision and recall into a single value, gives a balanced measure of model performance, especially when classes are imbalanced. Precision calculates how many of the predicted positives are truly correct, while recall assesses how many of the actual positives were successfully predicted. Finally, the function returns these four metrics as a dictionary, making it easy to use in model evaluation. The design ensures the metrics are reliable by handling edge cases, such as dividing by zero, with the zero_division=0 parameter.

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    precision = precision_score(labels, predictions, average="weighted", zero_division=0)
    recall = recall_score(labels, predictions, average="weighted", zero_division=0)
    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

### ***INITIALIZING THE TRAINER***

> The code below initializes a Trainer object, which is a convenient tool for training and evaluating machine learning models. The model parameter specifies the model that will be trained, while training_args contains the configuration details, such as the number of training steps, learning rate, and batch size. The datasets for training and validation are provided through train_dataset and eval_dataset, respectively.

> To process input data, the tokenizer is included, ensuring that the text is converted into a format the model can understand. Additionally, the compute_metrics function is passed to evaluate the model's performance during validation, using metrics like accuracy and F1 score. This setup simplifies the training pipeline by bundling all the necessary components into a single, well-organized object.

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


### ***TRAINING THE MODEL AND SAVING THE FINE-TUNED MODEL***

> The code below trains the machine learning model using the trainer.train() method, which starts the training process with the configurations and datasets specified earlier. Once the training is complete, the fine-tuned model is saved to a folder named ./fine_tuned_model using the trainer.save_model() method. Additionally, the tokenizer, which processes input data for the model, is also saved to the same folder with tokenizer.save_pretrained(). This ensures that both the model and its tokenizer can be easily reused later for making predictions or further fine-tuning.

In [None]:
# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./bert_fine_tuned_model")
tokenizer.save_pretrained("./bert_fine_tuned_model")

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.635,0.436572,0.875,0.876815,0.880088,0.875
2,0.2846,0.300412,0.883333,0.883004,0.883182,0.883333
3,0.1161,0.365925,0.875,0.876385,0.891326,0.875
4,0.1144,0.508139,0.875,0.875245,0.882811,0.875
5,0.1133,0.523568,0.875,0.875124,0.878807,0.875


Non-default generation parameters: {'max_length': 64}
Non-default generation parameters: {'max_length': 64}
Non-default generation parameters: {'max_length': 64}
Non-default generation parameters: {'max_length': 64}
Non-default generation parameters: {'max_length': 64}
Non-default generation parameters: {'max_length': 64}
Non-default generation parameters: {'max_length': 64}


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

### ***EVALUATING THE MODEL***

> The code below evaluates the performance of the trained model using the trainer.evaluate() method, which assesses the model on the validation dataset. The results of this evaluation, including metrics like accuracy, precision, recall, and F1 score, are stored in the variable eval_results. Finally, these results are displayed on the screen using the print() function, giving a clear summary of how well the model performs on unseen data.

In [None]:
#Evaluate the model
eval_results = trainer.evaluate()

#Print evaluation results
print(eval_results)

{'eval_loss': 0.3004123568534851, 'eval_accuracy': 0.8833333333333333, 'eval_f1': 0.8830041315049226, 'eval_precision': 0.8831816838598755, 'eval_recall': 0.8833333333333333, 'eval_runtime': 1.1472, 'eval_samples_per_second': 104.604, 'eval_steps_per_second': 6.974, 'epoch': 5.0}


# **Model 2: `lxyuan/distilbert-base-multilingual-cased-sentiments-student`**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

# Load your dataset
file_path = "/content/TLC_student_feedback_dataset.xlsx"
df = pd.read_excel(file_path)

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['feedback'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("lxyuan/distilbert-base-multilingual-cased-sentiments-student")

# Tokenize the data
class FeedbackDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encodings = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        # Return input IDs, attention masks, and the label
        return {
            "input_ids": encodings["input_ids"].squeeze(0),
            "attention_mask": encodings["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

train_dataset = FeedbackDataset(train_texts, train_labels, tokenizer)
val_dataset = FeedbackDataset(val_texts, val_labels, tokenizer)

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "lxyuan/distilbert-base-multilingual-cased-sentiments-student",
    num_labels=3  # Ensure the number of labels matches your dataset
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_total_limit=2,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    precision = precision_score(labels, predictions, average="weighted", zero_division=0)
    recall = recall_score(labels, predictions, average="weighted", zero_division=0)
    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./distilbert_fine_tuned_model")
tokenizer.save_pretrained("./distilbert_fine_tuned_model")


config.json:   0%|          | 0.00/759 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7417,0.571357,0.775,0.756333,0.811499,0.775
2,0.3013,0.316452,0.841667,0.84256,0.851282,0.841667
3,0.2135,0.338188,0.866667,0.868203,0.88797,0.866667
4,0.1703,0.352927,0.85,0.852038,0.860972,0.85
5,0.1178,0.399938,0.858333,0.860185,0.867125,0.858333


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [None]:
#Evaluate the model
eval_results = trainer.evaluate()

#Print evaluation results
print(eval_results)

{'eval_loss': 0.338188499212265, 'eval_accuracy': 0.8666666666666667, 'eval_f1': 0.8682034632034631, 'eval_precision': 0.8879695885509838, 'eval_recall': 0.8666666666666667, 'eval_runtime': 0.4836, 'eval_samples_per_second': 248.133, 'eval_steps_per_second': 16.542, 'epoch': 5.0}


# **Model 3: `j-hartmann/sentiment-roberta-large-english-3-classes`**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

# Load your dataset
file_path = "/content/TLC_student_feedback_dataset.xlsx"
df = pd.read_excel(file_path)

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['feedback'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("j-hartmann/sentiment-roberta-large-english-3-classes")

# Tokenize the data
class FeedbackDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encodings = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        # Return input IDs, attention masks, and the label
        return {
            "input_ids": encodings["input_ids"].squeeze(0),
            "attention_mask": encodings["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

train_dataset = FeedbackDataset(train_texts, train_labels, tokenizer)
val_dataset = FeedbackDataset(val_texts, val_labels, tokenizer)

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "j-hartmann/sentiment-roberta-large-english-3-classes",
    num_labels=3  # Ensure the number of labels matches your dataset
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_total_limit=2,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    precision = precision_score(labels, predictions, average="weighted", zero_division=0)
    recall = recall_score(labels, predictions, average="weighted", zero_division=0)
    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./robert_fine_tuned_model")
tokenizer.save_pretrained("./robert_fine_tuned_model")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at j-hartmann/sentiment-roberta-large-english-3-classes were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3494,0.509322,0.866667,0.865379,0.876058,0.866667
2,0.3802,0.330654,0.9,0.900989,0.902563,0.9
3,0.2517,0.252783,0.85,0.851377,0.875983,0.85
4,0.1418,0.265049,0.85,0.850263,0.865442,0.85
5,0.1249,0.370808,0.858333,0.858652,0.866071,0.858333


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.json',
 './fine_tuned_model/merges.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [None]:
#Evaluate the model
eval_results = trainer.evaluate()

#Print evaluation results
print(eval_results)

{'eval_loss': 0.3306542932987213, 'eval_accuracy': 0.9, 'eval_f1': 0.9009885303699737, 'eval_precision': 0.902563316993464, 'eval_recall': 0.9, 'eval_runtime': 2.6877, 'eval_samples_per_second': 44.647, 'eval_steps_per_second': 2.976, 'epoch': 5.0}


# **Model 4: `blanchefort/rubert-base-cased-sentiment-rusentiment`**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

# Load your dataset
file_path = "/content/TLC_student_feedback_dataset.xlsx"
df = pd.read_excel(file_path)

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['feedback'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("blanchefort/rubert-base-cased-sentiment-rusentiment")

# Tokenize the data
class FeedbackDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encodings = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        # Return input IDs, attention masks, and the label
        return {
            "input_ids": encodings["input_ids"].squeeze(0),
            "attention_mask": encodings["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

train_dataset = FeedbackDataset(train_texts, train_labels, tokenizer)
val_dataset = FeedbackDataset(val_texts, val_labels, tokenizer)

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "blanchefort/rubert-base-cased-sentiment-rusentiment",
    num_labels=3  # Ensure the number of labels matches your dataset
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_total_limit=2,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    precision = precision_score(labels, predictions, average="weighted", zero_division=0)
    recall = recall_score(labels, predictions, average="weighted", zero_division=0)
    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./rubert_fine_tuned_model")
tokenizer.save_pretrained("./rubert_fine_tuned_model")

tokenizer_config.json:   0%|          | 0.00/495 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/952 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/711M [00:00<?, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4657,0.434491,0.816667,0.809106,0.8285,0.816667
2,0.2885,0.451812,0.841667,0.838806,0.846303,0.841667
3,0.1198,0.661376,0.816667,0.813807,0.82643,0.816667
4,0.1096,0.648007,0.808333,0.804807,0.813582,0.808333
5,0.0935,0.800077,0.808333,0.804807,0.813582,0.808333


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [None]:
#Evaluate the model
eval_results = trainer.evaluate()

#Print evaluation results
print(eval_results)

{'eval_loss': 0.45181208848953247, 'eval_accuracy': 0.8416666666666667, 'eval_f1': 0.8388056816572075, 'eval_precision': 0.84630277079405, 'eval_recall': 0.8416666666666667, 'eval_runtime': 0.9362, 'eval_samples_per_second': 128.176, 'eval_steps_per_second': 8.545, 'epoch': 5.0}


# **Model 5: `jbeno/electra-large-classifier-sentiment`**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset

# Load your dataset
file_path = "/content/TLC_student_feedback_dataset.xlsx"
df = pd.read_excel(file_path)

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['feedback'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("jbeno/electra-large-classifier-sentiment")

# Tokenize the data
class FeedbackDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encodings = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        # Return input IDs, attention masks, and the label
        return {
            "input_ids": encodings["input_ids"].squeeze(0),
            "attention_mask": encodings["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

train_dataset = FeedbackDataset(train_texts, train_labels, tokenizer)
val_dataset = FeedbackDataset(val_texts, val_labels, tokenizer)

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "jbeno/electra-large-classifier-sentiment",
    num_labels=3  # Ensure the number of labels matches your dataset
)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    save_total_limit=2,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=-1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    precision = precision_score(labels, predictions, average="weighted", zero_division=0)
    recall = recall_score(labels, predictions, average="weighted", zero_division=0)
    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall,
    }

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./electra_fine_tuned_model")
tokenizer.save_pretrained("./electra_fine_tuned_model")

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at jbeno/electra-large-classifier-sentiment and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4236,0.301889,0.908333,0.906427,0.91176,0.908333
2,0.3882,0.401389,0.891667,0.891893,0.893341,0.891667
3,0.762,0.62051,0.8,0.794104,0.834463,0.8
4,0.2943,0.327475,0.908333,0.908984,0.909848,0.908333
5,0.2538,0.315676,0.9,0.900409,0.901,0.9


('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [None]:
#Evaluate the model
eval_results = trainer.evaluate()

#Print evaluation results
print(eval_results)

{'eval_loss': 0.30188852548599243, 'eval_accuracy': 0.9083333333333333, 'eval_f1': 0.9064268346595933, 'eval_precision': 0.9117599067599068, 'eval_recall': 0.9083333333333333, 'eval_runtime': 2.7651, 'eval_samples_per_second': 43.397, 'eval_steps_per_second': 2.893, 'epoch': 5.0}


In [24]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!git clone https://hachikaze:ghp_YourPATHere@github.com/hachikaze/Sentiment-Analysis-on-Student-Satisfaction-Feedback.git