# IMDB Sentiment Analysis

## Authors
1. Jakub Swistak
2. Nikita Kozlov
3. Jacek Zalewski
4. Zosia Lagiewka

## Dataset
We are using the IMDB dataset with a defined split into train/test, which can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb).

## Methods
We will try different methods with embedding-based models.
## Outcome
The outcome will be a metrics for all tested models and data-processing pipelines.


## Introduction
In this notebook, we will perform sentiment analysis on the IMDB dataset using various embedding-based models. The goal is to compare the performance of different models and data-processing pipelines.


In [75]:
# Load iMDB dataset 
#!%pip install transformers datasets torch

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json
import re
from sklearn.metrics import f1_score
from textblob import TextBlob
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from llmware.models import ModelCatalog



splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
imdb_dataset = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"])

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 1f341281-f4f7-4fbd-9395-5c76176d98b7)')' thrown while requesting GET https://huggingface.co/datasets/stanfordnlp/imdb/resolve/main/plain_text/train-00000-of-00001.parquet
Retrying in 1s [Retry 1/5].
Exception ignored in: <pyarrow.fs.FSSpecHandler object at 0x325f6f7c0>
Traceback (most recent call last):
  File "pyarrow/_fs.pyx", line 1480, in pyarrow._fs._cb_get_type_name
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/site-packages/pyarrow/fs.py", line 304, in get_type_name
    def get_type_name(self):
KeyboardInterrupt: 


TypeError: Cannot wrap FileSystem pointer

In [36]:
imdb_dataset.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


### TextBlob

In [45]:
def get_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    return sentiment

# Convert list to pandas Series to use apply method
imdb_dataset['sentiment_blob'] = imdb_dataset['text'].apply(get_sentiment)
f1_textblob = f1_score(imdb_dataset['label'], imdb_dataset['sentiment_blob'].apply(lambda x: 1 if x > 0 else 0))
print(f1_textblob)


0.7501983560252626


### distilbert-base-uncased-finetuned-sst-2-english

In [48]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")



def get_bert_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        logits = model(**inputs).logits
    predicted_class_id = logits.argmax().item()
    return model.config.id2label[predicted_class_id]

imdb_dataset['sentiment_bert'] = imdb_dataset['text'].apply(get_bert_sentiment).map({'NEGATIVE': 0, 'POSITIVE': 1})






In [50]:
f1_textblob = f1_score(imdb_dataset['label'], imdb_dataset['sentiment_bert'].apply(lambda x: 1 if x > 0 else 0))
print(f1_textblob)


0.8846965371726448


In [51]:
#drop df to csv
imdb_dataset.to_csv('imdb_dataset.csv')

## Divide the dataset since all of the samles are quite long to run

In [60]:
# split the data into train and test
from sklearn.model_selection import train_test_split
train, test = train_test_split(imdb_dataset, test_size=0.2, random_state=42)


### Slim sentiment analysis

In [61]:
from llmware.models import ModelCatalog
slim_model = ModelCatalog().load_model("llmware/slim-sentiment")

def get_sentiment_llm(text):
    response = slim_model.function_call(text, params=["sentiment"], function="classify")
    return response

test['sentiment_slim_unprocessed'] = test['text'].apply(get_sentiment_llm)


[37mINFO: update: function call output could not be automatically converted, but remediation was successful to type - dict [39m
[37mINFO: update: function call output could not be automatically converted, but remediation was successful to type - dict [39m


In [62]:
imdb_dataset.to_csv('imdb_dataset2.csv')

In [67]:
test.to_csv('test.csv')
test.head()

Unnamed: 0,text,label,sentiment,sentiment_bert,sentiment_slim_unprocessed
6868,"Dumb is as dumb does, in this thoroughly unint...",0,-0.040799,0,"{'llm_response': {}, 'usage': {'input': 189, '..."
24016,I dug out from my garage some old musicals and...,1,0.351402,1,"{'llm_response': {'sentiment': ['positive']}, ..."
9668,After watching this movie I was honestly disap...,0,-0.105758,0,"{'llm_response': {'sentiment': ['negative']}, ..."
13640,This movie was nominated for best picture but ...,1,0.412727,0,"{'llm_response': {'sentiment': ['negative']}, ..."
14018,Just like Al Gore shook us up with his painful...,1,0.231805,1,"{'llm_response': {'sentiment': ['positive']}, ..."


In [68]:
test["sentiment_slim_processed"] = test["sentiment_slim_unprocessed"].apply(lambda x: x['llm_response'])

In [70]:
test["sentiment_slim"] = test["sentiment_slim_processed"].apply(lambda x: 1 if x.get('sentiment', ['negative'])[0] == "positive" else 0)

In [74]:
f1_slim = f1_score(test['label'], test['sentiment_slim'])
print(f1_slim)

0.9015256588072122


In [76]:
test.to_csv('test2.csv')