<a href="https://colab.research.google.com/github/mahikajain20/LHL_LLM_Project/blob/main/notebooks/3-pre-trained-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import transformers as tr
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import hamming_loss, f1_score
from transformers import BertTokenizer, BertModel
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import requests
import getpass
import warnings
warnings.filterwarnings('ignore')

In [3]:
from transformers import pipelines
from sklearn.metrics import f1_score, classification_report
import joblib
import tqdm
from tqdm import tqdm

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
#Load the data
ds_train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ds_train_data.csv')
ds_test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ds_test_data.csv')

The pre-trained model that I used for this analysis is called SiEBERT - English-Language Sentiment Classification.
t enables reliable binary sentiment analysis for various types of English-language text. For each instance, it predicts either positive (1) or negative (0) sentiment.
I believe it is ideal for the IMDB dataset that I am using in this analysis, however, we will check that by running predictions using the data that I have.

In [6]:
print("Initializing sentiment analysis pipeline...")
pipe = pipeline('sentiment-analysis', model='siebert/sentiment-roberta-large-english')

Initializing sentiment analysis pipeline...


config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [10]:
#Analyzing the whole dataset was impossible with the computational resources of my computer or colab
#used a sample of the data based on the computation time I had
ds_testing = ds_test.sample(100, random_state=42)

In [11]:
#Preparing test data
max_length = pipe.tokenizer.model_max_length
data = [text[:max_length] for text in ds_testing['text']]

In [12]:
preds = pipe(data)
preds_df = pd.DataFrame(preds)
#save to csv
preds_df.to_csv('preds.csv', index=False)

In [13]:
print("\nPrediction distribution:")
print(preds_df['label'].value_counts())


Prediction distribution:
label
NEGATIVE    50
POSITIVE    50
Name: count, dtype: int64


In [14]:
# Convert predictions to binary format
predictions = [1 if label == 'POSITIVE' else 0 for label in preds_df['label']]

# Get corresponding true labels
true_labels = ds_testing['label'].tolist()

# Print classification report
print("\nClassification Report:")
print(classification_report(true_labels, predictions))


Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.94      0.91        47
           1       0.94      0.89      0.91        53

    accuracy                           0.91       100
   macro avg       0.91      0.91      0.91       100
weighted avg       0.91      0.91      0.91       100



**Prediction Distribution**

NEGATIVE: 50
POSITIVE : 50
Equal!

There is no class imbalance

**Model Performance:**

Overall Performance:

Accuracy: 0.91 (91%)
This indicates that the model correctly classifies 91% of all instances, which is very good.


Class-wise Performance:
For class 0 (presumably negative sentiment):

Precision: 0.88
Recall: 0.94
F1-score: 0.91
Support: 47 instances

For class 1 (presumably positive sentiment):

Precision: 0.94
Recall: 0.89
F1-score: 0.91
Support: 53 instances


Balance:

The dataset is now much more balanced (47 negative, 53 positive instances).
Both classes have very similar F1-scores (0.91), indicating balanced performance.


Precision vs Recall:

For class 0: Higher recall (0.94) than precision (0.88) suggests the model is more likely to classify negative instances correctly but might occasionally misclassify positive instances as negative.
For class 1: Higher precision (0.94) than recall (0.89) suggests that when the model predicts positive, it's usually correct, but it might miss some positive instances.

The next steps might include testing on a larger dataset, fine-tuning the model if needed, and potentially deploying it for real-world use if the performance holds on larger test sets.

In [15]:
# Sample of misclassified instances
misclassified = pd.DataFrame({
    'true_label': true_labels,
    'predicted': predictions,
    'text': data
})
misclassified = misclassified[misclassified['true_label'] != misclassified['predicted']]

print("\nSample of misclassified instances:")
for _, row in misclassified.head().iterrows():
    print(f"True label: {row['true_label']}")
    print(f"Predicted: {row['predicted']}")
    print(f"Text: {row['text'][:100]}...")  # Print first 100 characters
    print()


Sample of misclassified instances:
True label: 1
Predicted: 0
Text: I just watched it for the second time today and I must say with all my heart it is about damn time t...

True label: 1
Predicted: 0
Text: The first Disney animated film without the strong involvement of Disney himself, this film suffers f...

True label: 0
Predicted: 1
Text: Man with the Screaming Brain is a story of greed, betrayal and revenge in the a small Bulgarian town...

True label: 1
Predicted: 0
Text: This, which was shown dubbed in Italian at a Rome cinema (not as bad as it sounds) after being prese...

True label: 1
Predicted: 0
Text: First they came for the Communists, and I didn't speak up, because I wasn't a Communist. Then they c...



Positive Misclassifications: Four out of five misclassified instances are positive reviews incorrectly classified as negative. This suggests that the model might be slightly biased towards negative classifications or struggling more with identifying positive sentiment.


---


Ambiguous Language: Some of these reviews use language that could be interpreted as negative out of context. For example:

"it is about damn time" could be seen as frustration without further context.
"this film suffers" sounds negative but might be part of a larger positive statement.
"greed, betrayal and revenge" are negative themes but don't necessarily indicate a negative review.



---


Complex Sentiments: The last example, quoting a famous poem about the Holocaust, demonstrates that some reviews might discuss serious or negative topics while still positively reviewing the film. This complexity can be challenging for sentiment analysis models.



---


Contextual Understanding: The model seems to struggle with reviews that require broader contextual understanding. For instance, the review mentioning Disney might be positive overall despite mentioning some drawbacks.
Non-English Content: The mention of "Rome cinema" and "Italian" suggests that some reviews might contain non-English elements, which could confuse the model if it's primarily trained on English text.


---

To conclude, the model might need fine-tuning to better recognize positive sentiments, especially when they're expressed alongside negative elements or serious themes. These misclassifications highlight the challenges of sentiment analysis in domains like film reviews, where critiques often contain both positive and negative elements.