# CS345 Project

For our dataset, we will be using the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). Due to the size of this dataset, it could not be uploaded to github. Please download the dataset yourself, extract it, and move it to the "data" directory.

### Imports

In [15]:
import os
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from tqdm import tqdm

### Load the data

In [16]:
def load_data(data_dir):
    data = []
    for sentiment in ['pos', 'neg']:
        sentiment_dir = os.path.join(data_dir, sentiment)
        print(f"Processing '{sentiment}' reviews...")
        file_list = os.listdir(sentiment_dir)
        # Use tqdm to create a progress bar as loading can take a while, we want to make sure it isn't hanging
        for filename in tqdm(file_list, desc=f"Loading {sentiment} files"):
            if filename.endswith('.txt'):
                filepath = os.path.join(sentiment_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as f:
                    review = f.read()
                data.append({
                    'review': review,
                    'sentiment': 1 if sentiment == 'pos' else 0,
                })

    df = pd.DataFrame(data)
    print(f"Loaded {len(df)} reviews from '{data_dir}'")
    return df

train_data = load_data('data/aclImdb/train')
test_data = load_data('data/aclImdb/test')

Processing 'pos' reviews...


Loading pos files: 100%|██████████| 12500/12500 [00:02<00:00, 4183.81it/s]


Processing 'neg' reviews...


Loading neg files: 100%|██████████| 12500/12500 [00:02<00:00, 4675.40it/s]


Loaded 25000 reviews from 'data/aclImdb/train'
Processing 'pos' reviews...


Loading pos files: 100%|██████████| 12500/12500 [00:02<00:00, 4676.72it/s]


Processing 'neg' reviews...


Loading neg files: 100%|██████████| 12500/12500 [00:02<00:00, 4456.95it/s]

Loaded 25000 reviews from 'data/aclImdb/test'





### Let's check out some of the data

In [17]:
print(train_data.head())

                                              review  sentiment
0  Bromwell High is a cartoon comedy. It ran at t...          1
1  Homelessness (or Houselessness as George Carli...          1
2  Brilliant over-acting by Lesley Ann Warren. Be...          1
3  This is easily the most underrated film inn th...          1
4  This is not the typical Mel Brooks film. It wa...          1


### Data Preprocessing

In [18]:
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text, stop_words, lemmatizer):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = text.split()
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

def preprocess_with_nltk(data):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    tqdm.pandas(desc="Preprocessing text")
    data['review'] = data['review'].progress_apply(lambda x: preprocess_text(x, stop_words, lemmatizer))
    return data

train_data = preprocess_with_nltk(train_data)
test_data = preprocess_with_nltk(test_data)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jaxfu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jaxfu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Preprocessing text: 100%|██████████| 25000/25000 [00:19<00:00, 1263.22it/s]
Preprocessing text: 100%|██████████| 25000/25000 [00:18<00:00, 1358.68it/s]


Let's compare the data to what we had before preprocessing

In [19]:
print(train_data.head())

                                              review  sentiment
0  bromwell high cartoon comedy ran time program ...          1
1  homelessness houselessness george carlin state...          1
2  brilliant overacting lesley ann warren best dr...          1
3  easily underrated film inn brook cannon sure f...          1
4  typical mel brook film much less slapstick mov...          1
