# IMDB Reviews

* **Dataset:** https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [1]:
import pandas as pd

In [2]:
imdb = pd.read_csv('../datasets/imdb/imdb.csv')
imdb

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

def preprocess_text(text):
    """Preprocesses a text string by converting to lowercase, removing HTML tags,
    removing punctuation, and lemmatizing the words.

    Args:
        text: The text string to preprocess.

    Returns:
        The preprocessed text string.
    """
    text = text.lower()
    
    import re
    text = re.sub('<.*?>', '', text)
    text = re.sub('[^A-Za-z]', ' ', text)
    
    wnl = WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english')
    
    words = nltk.word_tokenize(text=text)
    words = list(filter(lambda word: word not in stopwords, words))
    
    lemmatized_words = [wnl.lemmatize(word) for word in words]
    preprocessed_text = " ".join(lemmatized_words)
    
    return preprocessed_text

imdb["review"] = imdb["review"].apply(preprocess_text)

[nltk_data] Downloading package punkt to /home/pedro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/pedro/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/pedro/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/pedro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer()
tv_matrix = tv.fit_transform(imdb["review"])

In [6]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(tv_matrix, imdb["sentiment"])

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [8]:
models = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Machine": SVC(),
    "Random Forest": RandomForestClassifier()
}

In [9]:
from sklearn.metrics import accuracy_score, classification_report

results = {}
for model_name, model in models.items():
    model.fit(train_x, train_y)
    predictions = model.predict(test_x)
    accuracy = accuracy_score(test_y, predictions)
    report = classification_report(test_y, predictions)
    results[model_name] = {"accuracy": accuracy, "report": report}

In [10]:
for model_name, result in results.items():
    print(f"--- {model_name} ---")
    print(f"Accuracy: {result['accuracy']}")
    print(f"Classification Report:\n{result['report']}\n")

--- Logistic Regression ---
Accuracy: 0.89528
Classification Report:
              precision    recall  f1-score   support

    negative       0.91      0.88      0.89      6227
    positive       0.88      0.91      0.90      6273

    accuracy                           0.90     12500
   macro avg       0.90      0.90      0.90     12500
weighted avg       0.90      0.90      0.90     12500


--- Support Vector Machine ---
Accuracy: 0.9024
Classification Report:
              precision    recall  f1-score   support

    negative       0.92      0.89      0.90      6227
    positive       0.89      0.92      0.90      6273

    accuracy                           0.90     12500
   macro avg       0.90      0.90      0.90     12500
weighted avg       0.90      0.90      0.90     12500


--- Random Forest ---
Accuracy: 0.85512
Classification Report:
              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      6227
    positive       0.86      0.8