# Assignment

## Instructions

Use the following code as a starting point to load the rotten tomatoes dataset:

In [1]:
from datasets import load_dataset

# Load the Rotten Tomatoes dataset
dataset = load_dataset("rotten_tomatoes")

# Print the dataset information
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


In [2]:
# Example: Accessing the training split
train_dataset = dataset["train"]

# Print the first example in the training set
print(train_dataset[0])

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}


**Model Application:**

- Load a pre-trained sentiment analysis model from Hugging Face Transformers.
- Apply the model to a subset of the chosen dataset (e.g., the first 1000 samples from the training set).
- Evaluate the model's performance. You can start with qualitative analysis (inspecting predictions) and then explore quantitative metrics.

In [3]:
import torch
from transformers import pipeline, AutoTokenizer
import torch.nn.functional as F
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd

### Data Preparation

In [4]:
len(train_dataset)

8530

In [5]:
type(train_dataset)

datasets.arrow_dataset.Dataset

In [6]:
val_dataset  = dataset["validation"]
test_dataset = dataset["test"]

#### Convert to Pandas DataFrame

In [7]:
ds_train = train_dataset.shuffle(seed=0).select(range(1000))
ds_val = val_dataset.shuffle(seed=0).select(range(500))
ds_test = test_dataset.shuffle(seed=0).select(range(500))

In [8]:
ds_train = pd.DataFrame(ds_train)
ds_val = pd.DataFrame(ds_val)
ds_test = pd.DataFrame(ds_test)

#### Checking if the target is skewed

In [9]:
ds_train.label.value_counts()

label
0    513
1    487
Name: count, dtype: int64

In [10]:
ds_val.label.value_counts()

label
0    253
1    247
Name: count, dtype: int64

In [11]:
ds_test.label.value_counts()

label
0    253
1    247
Name: count, dtype: int64

### List of Pre-trained Models

In [12]:
# model_0 = "distilbert-base-uncased-finetuned-sst-2-english" Label is positive and negative different
model_1 = 'textattack/xlnet-base-cased-rotten-tomatoes' 
model_2 = 'textattack/roberta-base-rotten-tomatoes'
model_3 = 'textattack/albert-base-v2-rotten-tomatoes'
model_4 = "textattack/distilbert-base-uncased-rotten-tomatoes"

### Model Configuration

In [13]:
MODEL = model_2

sentiment = pipeline(
    "sentiment-analysis",
    model=MODEL,
    tokenizer=MODEL,
    batch_size=64,
    top_k=None,
)

Some weights of the model checkpoint at textattack/roberta-base-rotten-tomatoes were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [14]:
# (Optional) move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sentiment.model.to(device)

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

### Make Prediction on Validation Dataset

In [15]:
# Make predictions 
def sentiment_predict(text):
    pred = sentiment(text)[0][0]
    return pred['label'], pred['score']

In [16]:
# Make predictions on the validation dataset
predictions = ds_val['text'].apply(sentiment_predict)
ds_val['pred_label'] = [x[0] for x in predictions]
ds_val['pred_score'] = [x[1] for x in predictions]
ds_val


Unnamed: 0,text,label,pred_label,pred_score
0,sensitively examines general issues of race an...,1,LABEL_1,0.982658
1,a bodice-ripper for intellectuals .,1,LABEL_0,0.925646
2,crummy,0,LABEL_0,0.991047
3,more of the same old garbage hollywood has bee...,0,LABEL_0,0.996948
4,reggio and glass put on an intoxicating show .,1,LABEL_1,0.995789
...,...,...,...,...
495,"like its parade of predecessors , this hallowe...",0,LABEL_0,0.594332
496,the film does give a pretty good overall pictu...,1,LABEL_1,0.934354
497,insufferably naive .,0,LABEL_0,0.968480
498,"the movie is silly beyond comprehension , and ...",0,LABEL_0,0.972574


In [17]:
label_map = {"LABEL_0": 0,"LABEL_1": 1}
ds_val['pred_label'] = ds_val['pred_label'].map(label_map)
ds_val

Unnamed: 0,text,label,pred_label,pred_score
0,sensitively examines general issues of race an...,1,1,0.982658
1,a bodice-ripper for intellectuals .,1,0,0.925646
2,crummy,0,0,0.991047
3,more of the same old garbage hollywood has bee...,0,0,0.996948
4,reggio and glass put on an intoxicating show .,1,1,0.995789
...,...,...,...,...
495,"like its parade of predecessors , this hallowe...",0,0,0.594332
496,the film does give a pretty good overall pictu...,1,1,0.934354
497,insufferably naive .,0,0,0.968480
498,"the movie is silly beyond comprehension , and ...",0,0,0.972574


### Display Correct and Incorrect Classification

In [18]:
ds_val["correct"] = ds_val["label"] == ds_val["pred_label"]

print("### Correct examples")
display(ds_val[ds_val["correct"]].sample(5)[["text","label","pred_label","pred_score"]])

print("\n### Incorrect examples")
display(ds_val[~ds_val["correct"]].sample(5)[["text","label","pred_label","pred_score"]])

### Correct examples


Unnamed: 0,text,label,pred_label,pred_score
359,"about schmidt is nicholson's goofy , heartfelt...",1,1,0.989127
320,when 'science fiction' takes advantage of the ...,0,0,0.989822
291,some of the visual flourishes are a little too...,1,1,0.994428
464,es una de esas películas de las que uno sale r...,1,1,0.894351
197,"halfway through , however , having sucked dry ...",0,0,0.995286



### Incorrect examples


Unnamed: 0,text,label,pred_label,pred_score
385,""" home movie "" is the film equivalent of a lov...",1,0,0.963905
242,ou've got to love a disney pic with as little ...,1,0,0.567726
84,if you're the kind of parent who enjoys intent...,1,0,0.96443
28,the salton sea has moments of inspired humour ...,1,0,0.852354
273,"full of bland hotels , highways , parking lots...",1,0,0.548238


### Confusion Matrix and Classification Report

In [19]:
y_true = ds_val["label"]
y_pred = ds_val["pred_label"]

acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc:.3f}\n")

print("Classification Report:")
print(classification_report(y_true, y_pred, target_names=["negative","positive"]))

print("Confusion Matrix:")
cm = confusion_matrix(y_true, y_pred)
print(cm)

Accuracy: 0.912

Classification Report:
              precision    recall  f1-score   support

    negative       0.88      0.96      0.92       253
    positive       0.96      0.86      0.91       247

    accuracy                           0.91       500
   macro avg       0.92      0.91      0.91       500
weighted avg       0.92      0.91      0.91       500

Confusion Matrix:
[[243  10]
 [ 34 213]]


**Both models, model_1 = `textattack/xlnet-base-cased-rotten-tomatoes` and model_2 = `textattack/roberta-base-rotten-tomatoes` produce similar results of 91% accuracy rate.  Will apply the 2 models into the test data.**

### Apply Models to Test Data

In [20]:
model_1 = 'textattack/xlnet-base-cased-rotten-tomatoes' 
model_2 = 'textattack/roberta-base-rotten-tomatoes'

In [25]:
MODEL = model_1

sentiment = pipeline(
    "sentiment-analysis",
    model=MODEL,
    tokenizer=MODEL,
    batch_size=64,
    top_k=None,
)

Device set to use cpu


In [26]:
# Make predictions on the test dataset
predictions = ds_test['text'].apply(sentiment_predict)
ds_test['pred_label'] = [x[0] for x in predictions]
ds_test['pred_score'] = [x[1] for x in predictions]
ds_test

Unnamed: 0,text,label,pred_label,pred_score
0,missteps take what was otherwise a fascinating...,1,LABEL_0,0.998013
1,a forceful drama of an alienated executive who...,1,LABEL_1,0.999465
2,it's supposed to be post-feminist breezy but e...,0,LABEL_0,0.997268
3,"no amount of burning , blasting , stabbing , a...",0,LABEL_0,0.992949
4,insomnia loses points when it surrenders to a ...,1,LABEL_1,0.902779
...,...,...,...,...
495,cattaneo reworks the formula that made the ful...,0,LABEL_0,0.990291
496,"as it turns out , you can go home again .",1,LABEL_0,0.779619
497,don't hate el crimen del padre amaro because i...,0,LABEL_0,0.974460
498,"by the end , you just don't care whether that ...",0,LABEL_0,0.820570


In [27]:
ds_test['pred_label'] = ds_test['pred_label'].map(label_map)
ds_test

Unnamed: 0,text,label,pred_label,pred_score
0,missteps take what was otherwise a fascinating...,1,0,0.998013
1,a forceful drama of an alienated executive who...,1,1,0.999465
2,it's supposed to be post-feminist breezy but e...,0,0,0.997268
3,"no amount of burning , blasting , stabbing , a...",0,0,0.992949
4,insomnia loses points when it surrenders to a ...,1,1,0.902779
...,...,...,...,...
495,cattaneo reworks the formula that made the ful...,0,0,0.990291
496,"as it turns out , you can go home again .",1,0,0.779619
497,don't hate el crimen del padre amaro because i...,0,0,0.974460
498,"by the end , you just don't care whether that ...",0,0,0.820570


In [28]:
y_true = ds_test["label"]
y_pred = ds_test["pred_label"]

acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc:.3f}\n")

print("Classification Report:")
print(classification_report(y_true, y_pred, target_names=["negative","positive"]))

print("Confusion Matrix:")
cm = confusion_matrix(y_true, y_pred)
print(cm)

Accuracy: 0.884

Classification Report:
              precision    recall  f1-score   support

    negative       0.86      0.92      0.89       253
    positive       0.91      0.85      0.88       247

    accuracy                           0.88       500
   macro avg       0.89      0.88      0.88       500
weighted avg       0.89      0.88      0.88       500

Confusion Matrix:
[[232  21]
 [ 37 210]]


### Model Evaluation

**For test dataset, the 2 pre-trained models produce different result. the first model `textattack/xlnet-base-cased-rotten-tomatoes` return an accuracy of 88%, whereas model `textattack/roberta-base-rotten-tomatoes` return an accuracy of 90%. Both model accuracy is very close, but base on the result, we will use `textattack/roberta-base-rotten-tomatoes` as our preferred model.**

### Further Improvement

- **Need to develop a loop to evaluate each model for comparison, currently we evaluate each model manually.**
- **Need to research if there is any way to fine tune a pre-trained model.** 

## End