# Project Overview:

The goal of this project is to perform sentiment analysis on movie reviews to classify them as either positive or negative. I preprocess the text data, extract features, and use a machine learning model to classify the reviews.

The dataset we'll use is the Large Movie Review Dataset, which can be found here: http://ai.stanford.edu/~amaas/data/sentiment/

I am utilizing two approaches:
- Bag of Words with classical Machine Learning (Logistic Regression)
- Word Embeddings (Global Vectors) with LSTM based model (Deep Learning)

## Citation:

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142-150). Portland, Oregon, USA: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/P11-1015

In [19]:
#! wget --header="Host: ai.stanford.edu" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7" --header="Accept-Language: en-US,en;q=0.9,ur;q=0.8" --header="Referer: http://ai.stanford.edu/~amaas/data/sentiment/" "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" -c -O 'aclImdb_v1.tar.gz'

In [4]:
#!tar -xvzf aclImdb_v1.tar.gz

## Loading the data

In [20]:
# Importing the libraries

import os
import glob
import re
import nltk
import numpy as np

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [5]:
def read_reviews(path):
    reviews = []
    labels = []
    
    for label in ['pos', 'neg']:
        folder = os.path.join(path, label)
        for file in glob.glob(os.path.join(folder, '*.txt')):
            with open(file, 'r', encoding='utf-8') as f:
                reviews.append(f.read())
                labels.append(1 if label == 'pos' else 0)
                
    return reviews, labels

train_path = './aclImdb/train'
test_path = './aclImdb/test'

train_reviews, train_labels = read_reviews(train_path)
test_reviews, test_labels = read_reviews(test_path)


In [8]:
# Sample review
train_reviews[0]

'Following my experience of Finland for slightly more than a week, I\'d say this movie depicts the nature of the Finnish society very accurately. Especially the young-couple-with-a-baby-having-serious-issues phenomenon is very familiar to me, as I witnessed the exact same thing in person when I was in Finland. The relationships and problems of people, fragility of the marriage institution, the drinking culture, unemployment and the ascending money problem, all are very well put, without any subjectivity or exaggeration.<br /><br />There are some points in the film that are not necessarily easy to comprehend and tie to each other, but the joint big picture is nonetheless rewarding. Not each one of the short stories is exciting or profound, but as said above, the big picture does not fail to deliver the feeling of "real life" and captivate the viewer. I happen to think in a calm moment: What is happening in the lives of all these people on the street? Well, this is what is happening. Mov

## Preprocessing the text

In [9]:
nltk.download('punkt')

def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    return word_tokenize(text)  # Tokenize

train_reviews = [preprocess_text(review) for review in train_reviews]
test_reviews = [preprocess_text(review) for review in test_reviews]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Feature Extraction
Using Bag of Words

In [12]:
vectorizer = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
X_train = vectorizer.fit_transform(train_reviews)
X_test = vectorizer.transform(test_reviews)



## Train a Classifier

In [13]:
classifier = LogisticRegression()
classifier.fit(X_train, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##  Evaluating the model

In [14]:
predicted_labels = classifier.predict(X_test)
accuracy = accuracy_score(test_labels, predicted_labels)
report = classification_report(test_labels, predicted_labels)

print("Accuracy:", accuracy)
print("Classification Report:")
print(report)

Accuracy: 0.86944
Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.88      0.87     12500
           1       0.87      0.86      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



## Feature Extraction
Uisng Word Embeddings - Global vectors (GloVe)

In [16]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove.6B.zip

In [21]:
max_words = 10000
maxlen = 100
embedding_dim = 100

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_reviews)
sequences = tokenizer.texts_to_sequences(train_reviews)
X_train = pad_sequences(sequences, maxlen=maxlen)

sequences = tokenizer.texts_to_sequences(test_reviews)
X_test = pad_sequences(sequences, maxlen=maxlen)

train_labels = np.asarray(train_labels)
test_labels = np.asarray(test_labels)

In [23]:
# Load pre-trained GloVe embeddings
embeddings_index = {}
with open('glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [24]:
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [25]:
# Building the model

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

In [26]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, train_labels, epochs=5, batch_size=32, validation_split=0.2)

# Evaluation
test_loss, test_acc = model.evaluate(X_test, test_labels)
print("Test accuracy:", test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: 0.8214799761772156
