# Sentiment Analysis on Movie Reviews
<hr style="clear:both">

This notebook demonstrates our approach to performing sentiment analysis on a dataset of reviews. We utilized a pre-trained BERT model from HuggingFace to classify the reviews as either positive or negative. The training was done on Google Colab on a T4 GPU, and took around 4 hours to complete.

**Project Mentor:** [Shuo Wen](http://personnes.epfl.ch/shuo.wen) ([Email](shuo.wen@epfl.ch)), 
**Authors:** Mahmoud Dokmak, Matthieu Borello, Léo Brunneau, Loïc Domingos, Bastien Armstrong

<hr style="clear:both">

In [1]:
#from google.colab import drive
#drive.mount('/content/drive')
#import os
#os.chdir('/content/drive/MyDrive/ada-2024-project-gear5_mah')

In [2]:
import pandas as pd
from src.models.sentiment_analysis import sentiment_analysis

In [3]:
df_for_reviews = pd.read_pickle("./data/merged_data_with_reviews.pkl")

In [4]:
df_for_reviews

Unnamed: 0,index,item_id,review,freebase_movie_id,movie_name,vote_average,vote_count,popularity,movie_countries_final,movie_genres_final,combined_release_date,Box_Office,imdb_id,avgRating
828299,2044046,1,Horrible message; I was well prepared to like ...,/m/0dyb1,Toy Story,7.971,17152,78.404,United States of America,"Buddy film, Adventure, Children's/Family, Comp...",1995-11-19,361958736.0,114709,3.89146
828302,2044054,1,Great As The Other One!; This Toy Story is as ...,/m/0dyb1,Toy Story,7.971,17152,78.404,United States of America,"Buddy film, Adventure, Children's/Family, Comp...",1995-11-19,361958736.0,114709,3.89146
828305,2044063,1,Pixar kick ass!; Yap. Those talented people at...,/m/0dyb1,Toy Story,7.971,17152,78.404,United States of America,"Buddy film, Adventure, Children's/Family, Comp...",1995-11-19,361958736.0,114709,3.89146
828306,2044069,1,Great!; What a great movie!! Once again Pixar ...,/m/0dyb1,Toy Story,7.971,17152,78.404,United States of America,"Buddy film, Adventure, Children's/Family, Comp...",1995-11-19,361958736.0,114709,3.89146
828308,2044077,1,Decent enough but no milestone; Think of momen...,/m/0dyb1,Toy Story,7.971,17152,78.404,United States of America,"Buddy film, Adventure, Children's/Family, Comp...",1995-11-19,361958736.0,114709,3.89146
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1066793,2349835,182351,one of the best movies I've seen; It seemed al...,/m/0b6g2ym,Fidelity,5.900,41,11.397,France,Drama,2000-04-05,,204761,3.00000
1066794,2349836,182351,nearly 3 hours of lost time; This movie was a ...,/m/0b6g2ym,Fidelity,5.900,41,11.397,France,Drama,2000-04-05,,204761,3.00000
1066795,2349837,182351,Test your loyalty to the limit; Billed as a hi...,/m/0b6g2ym,Fidelity,5.900,41,11.397,France,Drama,2000-04-05,,204761,3.00000
1066796,2349838,182351,a bold but flawed attempt; I like French cinem...,/m/0b6g2ym,Fidelity,5.900,41,11.397,France,Drama,2000-04-05,,204761,3.00000


In [5]:
df_reviews_copy = df_for_reviews.copy()

In [6]:
output_file = "./data/sentiment_analysis.csv"

In [7]:
df_with_sentiment = sentiment_analysis(df_reviews_copy, "review", "item_id", "index", output_file)

distilbert-base-uncased-finetuned-sst-2-english


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

  0%|          | 11/1293917 [00:01<26:43:20, 13.45it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 1293917/1293917 [2:09:25<00:00, 166.62it/s]


In [8]:
# Our sript is designed to output a dataframe but we also save it as a csv file
df_with_sentiment

Unnamed: 0,sentiment_analysis,item_id,index
828299,"{'label': 'NEGATIVE', 'score': 0.9858629107475...",1,2044046
828302,"{'label': 'POSITIVE', 'score': 0.9998522996902...",1,2044054
828305,"{'label': 'POSITIVE', 'score': 0.9990456700325...",1,2044063
828306,"{'label': 'POSITIVE', 'score': 0.9998844861984...",1,2044069
828308,"{'label': 'POSITIVE', 'score': 0.9972469806671...",1,2044077
...,...,...,...
1066793,"{'label': 'POSITIVE', 'score': 0.9996471405029...",182351,2349835
1066794,"{'label': 'NEGATIVE', 'score': 0.9994913339614...",182351,2349836
1066795,"{'label': 'NEGATIVE', 'score': 0.9910902976989...",182351,2349837
1066796,"{'label': 'POSITIVE', 'score': 0.7443559765815...",182351,2349838


In [9]:
# Now we can load the csv file with the sentiment analysis
df_sa = pd.read_csv("./data/sentiment_analysis.csv")
df_sa

Unnamed: 0.1,Unnamed: 0,sentiment_analysis,item_id,index
0,828299,"{'label': 'NEGATIVE', 'score': 0.9858629107475...",1,2044046
1,828302,"{'label': 'POSITIVE', 'score': 0.9998522996902...",1,2044054
2,828305,"{'label': 'POSITIVE', 'score': 0.9990456700325...",1,2044063
3,828306,"{'label': 'POSITIVE', 'score': 0.9998844861984...",1,2044069
4,828308,"{'label': 'POSITIVE', 'score': 0.9972469806671...",1,2044077
...,...,...,...,...
1293912,1066793,"{'label': 'POSITIVE', 'score': 0.9996471405029...",182351,2349835
1293913,1066794,"{'label': 'NEGATIVE', 'score': 0.9994913339614...",182351,2349836
1293914,1066795,"{'label': 'NEGATIVE', 'score': 0.9910902976989...",182351,2349837
1293915,1066796,"{'label': 'POSITIVE', 'score': 0.7443559765815...",182351,2349838
