# IMDB Sentiment Analysis

## Authors
1. Jakub Swistak
2. Nikita Kozlov
3. Jacek Zalewski
4. Zosia Lagiewka

## Dataset
We are using the IMDB dataset with a defined split into train/test, which can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb).

## Methods
We will try different methods with embedding-based models.
## Outcome
The outcome will be a metrics for all tested models and data-processing pipelines.


## Introduction
In this notebook, we will perform sentiment analysis on the IMDB dataset using various embedding-based models. The goal is to compare the performance of different models and data-processing pipelines.


In [21]:
!pip uninstall -y numpy pandas
!pip install llmware numpy pandas seaborn

Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Found existing installation: pandas 2.2.3
Uninstalling pandas-2.2.3:
  Successfully uninstalled pandas-2.2.3
Collecting numpy
  Using cached numpy-2.1.2-cp310-cp310-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting numpy
  Using cached numpy-1.26.4-cp310-cp310-macosx_11_0_arm64.whl.metadata (61 kB)
Using cached pandas-2.2.3-cp310-cp310-macosx_11_0_arm64.whl (11.3 MB)
Using cached numpy-1.26.4-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
Installing collected packages: numpy, pandas
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
hdbscan 0.8.33 requires cython<3,>=0.27, but you have cython 3.0.10 which is incompatible.
sybil 1.5.0 requires 

In [1]:
# Load iMDB dataset 
#!%pip install transformers datasets torch

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json
import re
from sklearn.metrics import f1_score
from textblob import TextBlob
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from llmware.models import ModelCatalog



splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
imdb_dataset = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"])

In [2]:
imdb_dataset.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [5]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

model_scores = pd.DataFrame(columns=["model", "f1", "accuracy", "precision", "recall"])

### TextBlob

In [7]:
def get_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    return sentiment

# Convert list to pandas Series to use apply method
imdb_dataset['sentiment_blob'] = imdb_dataset['text'].apply(get_sentiment)
f1_textblob = f1_score(imdb_dataset['label'], imdb_dataset['sentiment_blob'].apply(lambda x: 1 if x > 0 else 0))
accuracy_textblob = accuracy_score(imdb_dataset['label'], imdb_dataset['sentiment_blob'].apply(lambda x: 1 if x > 0 else 0))
precision_textblob = precision_score(imdb_dataset['label'], imdb_dataset['sentiment_blob'].apply(lambda x: 1 if x > 0 else 0))
recall_textblob = recall_score(imdb_dataset['label'], imdb_dataset['sentiment_blob'].apply(lambda x: 1 if x > 0 else 0))

model_scores = pd.concat([model_scores, pd.DataFrame([["TextBlob", f1_textblob, accuracy_textblob, precision_textblob, recall_textblob]], columns=["model", "f1", "accuracy", "precision", "recall"])])

model_scores

  model_scores = pd.concat([model_scores, pd.DataFrame([["TextBlob", f1_textblob, accuracy_textblob, precision_textblob, recall_textblob]], columns=["model", "f1", "accuracy", "precision", "recall"])])


Unnamed: 0,model,f1,accuracy,precision,recall
0,TextBlob,0.750198,0.68516,0.621758,0.94552


### distilbert-base-uncased-finetuned-sst-2-english

In [14]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

total = len(imdb_dataset)

def get_bert_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    model.to(device)
    
    with torch.no_grad():
        inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
        logits = model(**inputs).logits
    predicted_class_id = logits.argmax().item()
    
    print(f"[{current}/{total}] {text[:10]} -> {model.config.id2label[predicted_class_id]}")
    return model.config.id2label[predicted_class_id]

imdb_dataset['sentiment_bert'] = imdb_dataset['text'].apply(get_bert_sentiment).map({'NEGATIVE': 0, 'POSITIVE': 1})

[1/25000] I rented I -> POSITIVE
[2/25000] "I Am Curi -> NEGATIVE
[3/25000] If only to -> NEGATIVE
[4/25000] This film  -> POSITIVE
[5/25000] Oh, brothe -> NEGATIVE
[6/25000] I would pu -> NEGATIVE
[7/25000] Whoever wr -> NEGATIVE
[8/25000] When I fir -> NEGATIVE
[9/25000] Who are th -> NEGATIVE
[10/25000] This is sa -> NEGATIVE
[11/25000] It was gre -> POSITIVE
[12/25000] I can't be -> NEGATIVE
[13/25000] Never cast -> NEGATIVE
[14/25000] Its not th -> NEGATIVE
[15/25000] Today I fo -> NEGATIVE
[16/25000] This film  -> NEGATIVE
[17/25000] My interes -> NEGATIVE
[18/25000] I have thi -> NEGATIVE
[19/25000] I think I  -> NEGATIVE
[20/25000] Pros: Noth -> NEGATIVE
[21/25000] If the cre -> NEGATIVE
[22/25000] 1st watche -> NEGATIVE
[23/25000] There's to -> NEGATIVE
[24/25000] En route t -> NEGATIVE
[25/25000] Without wi -> NEGATIVE
[26/25000] My girlfri -> NEGATIVE
[27/25000] Amateur, n -> NEGATIVE
[28/25000] OK its not -> NEGATIVE
[29/25000] Some films -> NEGATIVE
[30/25000] I received -

KeyboardInterrupt: 

In [51]:
#drop df to csv
imdb_dataset.to_csv('imdb_dataset.csv')

In [15]:
imdb_dataset = pd.read_csv('imdb_dataset.csv')

In [16]:
f1_bert = f1_score(imdb_dataset['label'], imdb_dataset['sentiment_bert'])
accuracy_bert = accuracy_score(imdb_dataset['label'], imdb_dataset['sentiment_bert'])
precision_bert = precision_score(imdb_dataset['label'], imdb_dataset['sentiment_bert'])
recall_bert = recall_score(imdb_dataset['label'], imdb_dataset['sentiment_bert'])

model_scores = pd.concat([model_scores, pd.DataFrame([["distilbert-base-uncased-finetuned-sst-2-english", f1_bert, accuracy_bert, precision_bert, recall_bert]], columns=["model", "f1", "accuracy", "precision", "recall"])])

model_scores

Unnamed: 0,model,f1,accuracy,precision,recall
0,TextBlob,0.750198,0.68516,0.621758,0.94552
0,distilbert-base-uncased-finetuned-sst-2-english,0.884697,0.88852,0.916117,0.85536


## Divide the dataset since all of the samles are quite long to run

In [60]:
# split the data into train and test
from sklearn.model_selection import train_test_split
train, test = train_test_split(imdb_dataset, test_size=0.2, random_state=42)


### Slim sentiment analysis

In [61]:
from llmware.models import ModelCatalog
slim_model = ModelCatalog().load_model("llmware/slim-sentiment")

def get_sentiment_llm(text):
    response = slim_model.function_call(text, params=["sentiment"], function="classify")
    return response

test['sentiment_slim_unprocessed'] = test['text'].apply(get_sentiment_llm)


[37mINFO: update: function call output could not be automatically converted, but remediation was successful to type - dict [39m
[37mINFO: update: function call output could not be automatically converted, but remediation was successful to type - dict [39m


In [62]:
imdb_dataset.to_csv('imdb_dataset2.csv')

In [17]:
imdb_dataset = pd.read_csv('imdb_dataset2.csv')

In [67]:
test.to_csv('test.csv')
test.head()

Unnamed: 0,text,label,sentiment,sentiment_bert,sentiment_slim_unprocessed
6868,"Dumb is as dumb does, in this thoroughly unint...",0,-0.040799,0,"{'llm_response': {}, 'usage': {'input': 189, '..."
24016,I dug out from my garage some old musicals and...,1,0.351402,1,"{'llm_response': {'sentiment': ['positive']}, ..."
9668,After watching this movie I was honestly disap...,0,-0.105758,0,"{'llm_response': {'sentiment': ['negative']}, ..."
13640,This movie was nominated for best picture but ...,1,0.412727,0,"{'llm_response': {'sentiment': ['negative']}, ..."
14018,Just like Al Gore shook us up with his painful...,1,0.231805,1,"{'llm_response': {'sentiment': ['positive']}, ..."


In [19]:
test["sentiment_slim_processed"] = test["sentiment_slim_unprocessed"].apply(lambda x: x['llm_response'])

TypeError: string indices must be integers

In [20]:
test["sentiment_slim"] = test["sentiment_slim_processed"].apply(lambda x: 1 if x.get('sentiment', ['negative'])[0] == "positive" else 0)

KeyError: 'sentiment_slim_processed'

In [76]:
test.to_csv('test2.csv')

In [21]:
test = pd.read_csv('test2.csv')

In [24]:
f1_slim = f1_score(test['label'], test['sentiment_slim'])
accuracy_slim = accuracy_score(test['label'], test['sentiment_slim'])
precision_slim = precision_score(test['label'], test['sentiment_slim'])
recall_slim = recall_score(test['label'], test['sentiment_slim'])

model_scores = pd.concat([model_scores, pd.DataFrame([["slim-sentiment", f1_slim, accuracy_slim, precision_slim, recall_slim]], columns=["model", "f1", "accuracy", "precision", "recall"])])

model_scores

Unnamed: 0,model,f1,accuracy,precision,recall
0,TextBlob,0.750198,0.68516,0.621758,0.94552
0,distilbert-base-uncased-finetuned-sst-2-english,0.884697,0.88852,0.916117,0.85536
0,slim-sentiment,0.901526,0.9006,0.887978,0.915493


### 