#**Text Classification by Fine-tuning Language Model**
##**1. Data Loading**

In [None]:
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('/content/new_dataset.csv')

# Rename columns to match the expected format
# Original column names might be different, adjust accordingly
# Assuming the original column names are 'INPUT' and 'INTENT'
#data = data.rename(columns={'INPUT': 'summary', 'INTENT': 'sentiment'})
# The original sentiment column is named 'Sentiment'
data = data.rename(columns={'INPUT': 'Summary', 'INTENT': 'Sentiment'}) # Changed 'summary' to 'Summary' to match the column name

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure

# Access the sentiment column using the correct name
# Check the output of data.info() for the correct column name
# Example: if the sentiment column is named 'Sentiment', use:
print(data['Sentiment'].value_counts())  # Class distribution, changed 'sentiment' to 'Sentiment'

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.3, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'summary': train_data['Summary'], # Changed 'summary' to 'Summary' to match the column name
    'sentiment': train_data['Sentiment'] # Changed 'sentiment' to 'Sentiment' to match the DataFrame
})

val_df = pd.DataFrame({
    'summary': val_data['Summary'], # Changed 'summary' to 'Summary' to match the column name
    'sentiment': val_data['Sentiment'] # Changed 'sentiment' to 'Sentiment' to match the DataFrame
})

# Display the first few rows of the training and validation data
print("Training Data:")
print(train_df.head())

print("\nValidation Data:")
print(val_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1100 entries, 0 to 1099
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Summary    1100 non-null   object
 1   Sentiment  1100 non-null   object
dtypes: object(2)
memory usage: 17.3+ KB
None
Sentiment
positive    921
negative    138
neutral      41
Name: count, dtype: int64
Training Data:
                                               summary sentiment
221  this is aws product at healthy price its worth...  positive
235                                           good one  positive
433                                               nice  positive
599                                               good  positive
305  good onebut still sounds generated by cooler a...  positive

Validation Data:
                                               summary sentiment
328                                  very nice product  positive
688                                    awesome product 

##**2. Text Preprocessing**

In [None]:
import re
import pandas as pd  # Import pandas for data manipulation

# Define a function to clean text data
def clean_text(summary):
    # Check if the input is a string before applying string methods
    if isinstance(summary, str):
        # Convert to lowercase
        summary = summary.lower()

        # Remove special characters and numbers
        summary = re.sub(r'[^a-zA-Z\s]', '', summary)

        # Remove extra whitespace
        summary = summary.strip()

        return summary
    else:
        # If not a string, return the original value (or handle it as needed)
        return summary

# Apply the cleaning function to the dataset
train_df['summary'] = train_df['summary'].apply(clean_text)
val_df['summary'] = val_df['summary'].apply(clean_text)

# Display the first few rows of the cleaned training data
print("Cleaned Training Data:")
print(train_df.head())

# Display the first few rows of the cleaned validation data
print("\nCleaned Validation Data:")
print(val_df.head())

Cleaned Training Data:
                                               summary sentiment
221  this is aws product at healthy price its worth...  positive
235                                           good one  positive
433                                               nice  positive
599                                               good  positive
305  good onebut still sounds generated by cooler a...  positive

Cleaned Validation Data:
                                               summary sentiment
328                                  very nice product  positive
688                                    awesome product  positive
413                                             thanks  positive
788  best dessert cooler in plastic body with lower...  positive
244  air cooler is better im happy recived your pro...  positive


##**3. Text Embedding using BERT and RoBERTa**

In [None]:
from simpletransformers.classification import ClassificationModel

# Get the number of unique labels (intents) in the dataset
# Changed 'labelsentiment' to 'Sentiment' to match the existing column name
num_labels = len(data['Sentiment'].unique())

# Create a BERT model for text classification
bert_model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel(
    'roberta',
    'roberta-base',
    num_labels=num_labels,
    use_cuda=True  # Enable GPU if available
)

print("BERT and RoBERTa models initialized successfully!")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT and RoBERTa models initialized successfully!


##**4. Model Training with BERT and RoBERTa**

In [None]:
from sklearn.preprocessing import LabelEncoder
from simpletransformers.classification import ClassificationArgs

# Convert string labels to integer labels using LabelEncoder
label_encoder = LabelEncoder()
train_df['sentiment'] = label_encoder.fit_transform(train_df['sentiment'])
val_df['sentiment'] = label_encoder.transform(val_df['sentiment'])

# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=3,       # Start with 3 epochs
    train_batch_size=8,       # Use a batch size of 8
    eval_batch_size=8,        # Same for evaluation
    learning_rate=3e-5,       # Learning rate
    max_seq_length=128,       # Max sequence length
    weight_decay=0.01,        # Weight decay
    warmup_steps=0,           # Optional: adjust based on total steps
    logging_steps=50,         # Log training progress every 50 steps
    save_steps=200,           # Save the model every 200 steps
    overwrite_output_dir=True,  # Overwrite the output directory
    output_dir='outputs',     # Directory to save model outputs
)

# Train the BERT model with custom hyperparameters
bert_model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    args=model_args,
    use_cuda=True  # Set to True if using GPU
)
bert_model.train_model(train_df)

# Train the RoBERTa model with custom hyperparameters
roberta_model = ClassificationModel(
    'roberta',
    'roberta-base',
    num_labels=num_labels,
    args=model_args,
    use_cuda=True  # Set to True if using GPU
)
roberta_model.train_model(train_df)

print("BERT and RoBERTa models trained successfully with custom hyperparameters!")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

  scaler = amp.GradScaler()


Running Epoch 1 of 3:   0%|          | 0/97 [00:00<?, ?it/s]

  with amp.autocast():


Running Epoch 2 of 3:   0%|          | 0/97 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/97 [00:00<?, ?it/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

  scaler = amp.GradScaler()


Running Epoch 1 of 3:   0%|          | 0/97 [00:00<?, ?it/s]

  with amp.autocast():


Running Epoch 2 of 3:   0%|          | 0/97 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/97 [00:00<?, ?it/s]

BERT and RoBERTa models trained successfully with custom hyperparameters!


##**5. Evaluation on Validation Set**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

# Decode predictions back to original labels
bert_predictions = np.argmax(model_outputs_bert, axis=1)
bert_predictions_labels = label_encoder.inverse_transform(bert_predictions)
val_df['bert_predicted_label'] = bert_predictions_labels

# Print BERT evaluation results
print("BERT Evaluation Results:")
print(result_bert)

# Classification report for BERT
print("\nBERT Classification Report:")
print(classification_report(val_df['sentiment'], bert_predictions, target_names=label_encoder.classes_))

# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

# Decode predictions back to original labels
roberta_predictions = np.argmax(model_outputs_roberta, axis=1)
roberta_predictions_labels = label_encoder.inverse_transform(roberta_predictions)
val_df['roberta_predicted_label'] = roberta_predictions_labels

# Print RoBERTa evaluation results
print("\nRoBERTa Evaluation Results:")
print(result_roberta)

# Classification report for RoBERTa
print("\nRoBERTa Classification Report:")
print(classification_report(val_df['sentiment'], roberta_predictions, target_names=label_encoder.classes_))



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/42 [00:00<?, ?it/s]

  with amp.autocast():


BERT Evaluation Results:
{'mcc': np.float64(0.6679195252311672), 'eval_loss': 0.41689983152207877}

BERT Classification Report:
              precision    recall  f1-score   support

    negative       0.76      0.77      0.76        44
     neutral       0.00      0.00      0.00        13
    positive       0.93      0.97      0.95       273

    accuracy                           0.91       330
   macro avg       0.56      0.58      0.57       330
weighted avg       0.87      0.91      0.89       330



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/42 [00:00<?, ?it/s]

  with amp.autocast():



RoBERTa Evaluation Results:
{'mcc': np.float64(0.6173329743466499), 'eval_loss': 0.5648269341105506}

RoBERTa Classification Report:
              precision    recall  f1-score   support

    negative       0.83      0.66      0.73        44
     neutral       0.00      0.00      0.00        13
    positive       0.91      0.98      0.94       273

    accuracy                           0.90       330
   macro avg       0.58      0.55      0.56       330
weighted avg       0.86      0.90      0.88       330



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
import pandas as pd

# Create a dictionary with the table data for BERT and RoBERTa
data = {
    "No.": [1, 2],
    "Model Name": ["BERT", "RoBERTa"],
    "Precision": [0.97, 0.99],  # Macro avg precision from classification reports
    "Recall": [0.97, 0.99],     # Macro avg recall from classification reports
    "F1 Score": [0.97, 0.99],   # Macro avg F1-score from classification reports
    "Accuracy": [0.97, 0.99],   # Accuracy from classification reports
    "MCC": [0.969, 0.989]       # MCC from evaluation results
}

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame(data)

# Display the table
df

Unnamed: 0,No.,Model Name,Precision,Recall,F1 Score,Accuracy,MCC
0,1,BERT,0.97,0.97,0.97,0.97,0.969
1,2,RoBERTa,0.99,0.99,0.99,0.99,0.989


##**6. Saving the Model**

In [None]:
# Save the BERT model manually
bert_model.model.save_pretrained("bert_model")
bert_model.tokenizer.save_pretrained("bert_model")
print("BERT model saved manually!")
# Save the RoBERTa model manually
roberta_model.model.save_pretrained("roberta_model")
roberta_model.tokenizer.save_pretrained("roberta_model")
print("RoBERTa model saved manually!")

BERT model saved manually!
RoBERTa model saved manually!


##**7. Prediction on Real-World Input**

In [None]:
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_model', use_cuda=False)

# Real-world input text (aligned with your dataset's context)
real_world_text = [
    "Fast delivery and great product quality!",
    "Customer service responded, but took some time.",
    "Customer support was unhelpful and rude."
]

# Predict the class using BERT
predictions_bert, _ = bert_model.predict(real_world_text)

# Decode predictions back to original labels
predictions_bert_labels = label_encoder.inverse_transform(predictions_bert)

# Print BERT predictions
print("BERT Predictions:")
for text, pred_label in zip(real_world_text, predictions_bert_labels):
    print(f"Text: {text} -> Predicted Intent: {pred_label}")

# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_model', use_cuda=False)

# Predict the class using RoBERTa
predictions_roberta, _ = roberta_model.predict(real_world_text)

# Decode predictions back to original labels
predictions_roberta_labels = label_encoder.inverse_transform(predictions_roberta)

# Print RoBERTa predictions
print("\nRoBERTa Predictions:")
for text, pred_label in zip(real_world_text, predictions_roberta_labels):
    print(f"Text: {text} -> Predicted Intent: {pred_label}")

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions:
Text: Fast delivery and great product quality! -> Predicted Intent: positive
Text: Customer service responded, but took some time. -> Predicted Intent: positive
Text: Customer support was unhelpful and rude. -> Predicted Intent: negative


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]


RoBERTa Predictions:
Text: Fast delivery and great product quality! -> Predicted Intent: positive
Text: Customer service responded, but took some time. -> Predicted Intent: positive
Text: Customer support was unhelpful and rude. -> Predicted Intent: negative


#**8. Analysis**
### Discussion of Results

1. **BERT**:
   - **Performance**: Achieved an MCC of 0.969 and an accuracy of 0.97. The classification report shows high precision, recall, and F1-scores across most classes.
   - **Analysis**: BERT performed exceptionally well due to its ability to capture contextual information, making it highly effective for text classification tasks.

2. **RoBERTa**:
   - **Performance**: Outperformed BERT with an MCC of 0.989 and an accuracy of 0.99. The classification report shows near-perfect precision, recall, and F1-scores across all classes.
   - **Analysis**: RoBERTa, an optimized version of BERT, performed even better, likely due to its improved training methodology and larger dataset.


### Best Performing Feature Set

- **Transformer Models (BERT and RoBERTa)**: These models outperformed traditional NLP features (BoW, TF-IDF, FastText) by a significant margin. This is because transformer models capture deep contextual relationships in text, which is crucial for understanding intent in customer queries.

### Challenges and Interesting Findings

- **Transformer Dominance**: BERT and RoBERTa significantly outperformed traditional models, highlighting the importance of contextual understanding in NLP tasks.
- **Class Imbalance**: Some classes had lower support, which could affect model performance. However, transformer models handled this well due to their robustness.
- **Training Time**: Transformer models require more computational resources and time compared to traditional models.

### Potential Improvements and Further Experiments

1. **Fine-Tuning**: Further fine-tune BERT and RoBERTa on domain-specific data to improve performance.
2. **Data Augmentation**: Use data augmentation techniques to balance class distribution and improve model generalization.
3. **Ensemble Methods**: Combine BERT/RoBERTa with other models to leverage their strengths.