<a href="https://colab.research.google.com/github/grosa1/sentiment-analysis-example/blob/main/analysis_with_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis on IMDb dataset - pt.2

This notebook reports an experiment with sentence embedding for sentiment analysis. The aim is to build a resource-efficient model that can classify the sentiment of movie reviews (from the IMDb dataset) to achieve the highest accuracy score.
In particular, it uses `SentenceTransformers` to extract sentence embeddings used as feature vectors for several machine learning models.
The results show that *SVC* is the best-performing model, achieving an accuracy score of 0.90.

The remainder of this notebook is structured as follows: Section 0 installs the required dependencies, Section 1 reports the steps to download and load the IMDb dataset, and Section 2 reports the model training and evaluation.

## 0. Initial setup

In [1]:
!pip install sentence-transformers==2.2.2

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m412.1 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=87935dcf931e7c3dc2ed0bb1a3d2f0fd1f675f6aec75d0c70975a8525166da37
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-t

## 1. Load dataset

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import os

def remove_html_tags(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

# Download and extract the IMDB dataset
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -O - | tar -xz

TRAIN_DIR = 'aclImdb/train/'
train = []
for f in os.listdir(os.path.join(TRAIN_DIR, 'pos')):
    with open(os.path.join(TRAIN_DIR, 'pos', f), 'r') as file:
        text = remove_html_tags(file.read())
        train.append({
            "text": text,
            "text_len": len(text),
            "score": int(f.split('.')[0].split('_')[1]),
            "label": 1
            })

for f in os.listdir(os.path.join(TRAIN_DIR, 'neg')):
    with open(os.path.join(TRAIN_DIR, 'neg', f), 'r') as file:
        text = remove_html_tags(file.read())
        train.append({
            "text": text,
            "text_len": len(text),
            "score": int(f.split('.')[0].split('_')[1]),
            "label": 0
            })

df_train = pd.DataFrame(train)
df_train.head()


TEST_DIR = 'aclImdb/test/'
test = []
for f in os.listdir(os.path.join(TEST_DIR, 'pos')):
    with open(os.path.join(TEST_DIR, 'pos', f), 'r') as file:
        test.append({
            "text": remove_html_tags(file.read()),
            "score": int(f.split('.')[0].split('_')[1]),
            "label": 1
            })

for f in os.listdir(os.path.join(TEST_DIR, 'neg')):
    with open(os.path.join(TEST_DIR, 'neg', f), 'r') as file:
        test.append({
            "text": remove_html_tags(file.read()),
            "score": int(f.split('.')[0].split('_')[1]),
            "label": 0
            })

df_test = pd.DataFrame(test)
df_test.head()

--2023-12-20 21:37:46--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘STDOUT’


2023-12-20 21:38:04 (4.55 MB/s) - written to stdout [84125825/84125825]



  soup = BeautifulSoup(text, 'html.parser')
  soup = BeautifulSoup(text, 'html.parser')


Unnamed: 0,text,score,label
0,A nurse travels to a rural psychiatric clinic ...,8,1
1,"as a 'physically challenged' person (god, how ...",9,1
2,This has got to be one of the best episodes of...,10,1
3,I was surprised and impressed to find out this...,10,1
4,There I was on vacation when my host suggested...,9,1


In [3]:
df_train.drop_duplicates()
df_test.drop_duplicates()

Unnamed: 0,text,score,label
0,A nurse travels to a rural psychiatric clinic ...,8,1
1,"as a 'physically challenged' person (god, how ...",9,1
2,This has got to be one of the best episodes of...,10,1
3,I was surprised and impressed to find out this...,10,1
4,There I was on vacation when my host suggested...,9,1
...,...,...,...
24995,I was fascinated to read the range of opinions...,2,0
24996,I went into this movie with an open mind. I ha...,1,0
24997,Worst movie ever seen. Worst acting too. I can...,1,0
24998,From reading the back of the box my first thou...,1,0


## 2. Model training and validation

In [4]:
from sentence_transformers import SentenceTransformer
import numpy as np
import joblib

In [5]:
# Load a pre-trained model for SentenceTransformer
encoder = SentenceTransformer('all-mpnet-base-v2',  device='cuda')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### 2.1 Feature extraction

In [6]:
embeddings_train = encoder.encode(df_train["text"], convert_to_tensor=False, show_progress_bar=True)

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

In [7]:
embeddings_test = encoder.encode(df_test["text"], convert_to_tensor=False, show_progress_bar=True)

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

In [8]:
joblib.dump(embeddings_train, "embeddings_train.joblib")
joblib.dump(embeddings_test, "embeddings_test.joblib")

['embeddings_test.joblib']

In [9]:
y_train = df_train['label']
y_test = df_test['label']

### 2.2 Prediction

In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score
import time

In [16]:
models = [RandomForestClassifier(random_state=1), LogisticRegression(random_state=1), SVC(random_state=1), KNeighborsClassifier(), AdaBoostClassifier(random_state=1), GradientBoostingClassifier(random_state=1)]

In [17]:
res = list()
preds = list()

for m in models:
    t_start = time.time()
    model = m.fit(embeddings_train, y_train)
    y_pred = model.predict(embeddings_test)
    print("==> model:", m.__class__.__name__, "- training time (s):", int(time.time() - t_start))
    print(classification_report(y_test, y_pred))
    print()

    preds.append(y_pred)
    res.append({
        "model": model.__class__.__name__,
        "features": "tfidf",
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred),
        "recall": recall_score(y_test, y_pred),
        "f1": f1_score(y_test, y_pred),
        "roc_auc": roc_auc_score(y_test, y_pred)
    })

==> model: RandomForestClassifier - training time (s): 99
              precision    recall  f1-score   support

           0       0.86      0.83      0.84     12500
           1       0.83      0.86      0.85     12500

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000


==> model: LogisticRegression - training time (s): 2
              precision    recall  f1-score   support

           0       0.90      0.89      0.89     12500
           1       0.89      0.90      0.89     12500

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000


==> model: SVC - training time (s): 434
              precision    recall  f1-score   support

           0       0.89      0.90      0.90     12500
           1       0.90      0.89      0.90     12500

    accuracy                 

In [18]:
joblib.dump(res, 'res_embeddings.pkl')
joblib.dump(preds, 'preds_embeddings.pkl')

['preds_embeddings.pkl']

### 2.3 Results

In [19]:
metrics_df = pd.DataFrame(res).sort_values(by="accuracy", ascending=False)
metrics_df

Unnamed: 0,model,features,accuracy,precision,recall,f1,roc_auc
2,SVC,tfidf,0.897,0.899782,0.89352,0.89664,0.897
1,LogisticRegression,tfidf,0.89328,0.888003,0.90008,0.894001,0.89328
5,GradientBoostingClassifier,tfidf,0.85836,0.850152,0.87008,0.860001,0.85836
0,RandomForestClassifier,tfidf,0.84552,0.832589,0.86496,0.848466,0.84552
4,AdaBoostClassifier,tfidf,0.82972,0.825476,0.83624,0.830823,0.82972
3,KNeighborsClassifier,tfidf,0.785,0.839706,0.70448,0.766172,0.785


In [20]:
metrics_df.to_csv("results_embeddings.csv", index=False)