# IMDb Movie Reviews

1. Dataset
2. Cleaning Steps
3. Deliverables


Tasks
1. Data Loading
2. Text Cleaning Pipeline
3. Apply Cleaning
4. Evaluation
5. Next Steps (Optional) Use cleaned reviews for a sentiment classifier (Logistic Regression, Naive Bayes, etc.).

In [None]:
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [6]:
english_stopwords = set(stopwords.words('english'))
text_lemmatizer = WordNetLemmatizer()

In [8]:
import pandas as pd
import glob

def read_review_data(directory, label_type):
    """Read all text reviews from directory with given label"""
    file_list = glob.glob(f"{directory}/*.txt")
    return pd.DataFrame([
        [open(f, encoding='utf-8').read(), label_type] 
        for f in file_list
    ], columns=['review_text', 'sentiment'])

# Load training datasets
pos_train = read_review_data("IMDb Movie Reviews data/train/pos", "positive")
neg_train = read_review_data("IMDb Movie Reviews data/train/neg", "negative")

# Merge datasets
train_df = pd.concat([pos_train, neg_train], ignore_index=True)
train_df

Unnamed: 0,review_text,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,positive
1,Homelessness (or Houselessness as George Carli...,positive
2,Brilliant over-acting by Lesley Ann Warren. Be...,positive
3,This is easily the most underrated film inn th...,positive
4,This is not the typical Mel Brooks film. It wa...,positive
...,...,...
24995,"Towards the end of the movie, I felt it was to...",negative
24996,This is the kind of movie that my enemies cont...,negative
24997,I saw 'Descent' last night at the Stockholm Fi...,negative
24998,Some films that you pick up for a pound turn o...,negative


In [9]:
train_df.head(7)

Unnamed: 0,review_text,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,positive
1,Homelessness (or Houselessness as George Carli...,positive
2,Brilliant over-acting by Lesley Ann Warren. Be...,positive
3,This is easily the most underrated film inn th...,positive
4,This is not the typical Mel Brooks film. It wa...,positive
5,"This isn't the comedic Robin Williams, nor is ...",positive
6,Yes its an art... to successfully make a slow ...,positive


In [10]:
train_df.tail(10)

Unnamed: 0,review_text,sentiment
24990,Yeti: Curse of the Snow Demon starts aboard a ...,negative
24991,"Hmmm, a sports team is in a plane crash, gets ...",negative
24992,"I saw this piece of garbage on AMC last night,...",negative
24993,Although the production and Jerry Jameson's di...,negative
24994,Capt. Gallagher (Lemmon) and flight attendant ...,negative
24995,"Towards the end of the movie, I felt it was to...",negative
24996,This is the kind of movie that my enemies cont...,negative
24997,I saw 'Descent' last night at the Stockholm Fi...,negative
24998,Some films that you pick up for a pound turn o...,negative
24999,"This is one of the dumbest films, I've ever se...",negative


In [11]:
train_df.index

RangeIndex(start=0, stop=25000, step=1)

In [20]:
sampled_data = train_df.sample(n=10000, random_state=42, ignore_index=True)

In [23]:
def text_cleaning_pipeline(input_text):
    text_clean = input_text.lower()
    text_clean = BeautifulSoup(text_clean, "html.parser").get_text()
    text_clean = re.sub(r'http\S+|\S+@\S+\.\S+', '', text_clean)
    text_clean = re.sub(r'[^a-z\s]', '', text_clean)
    
    tokens = text_clean.split()
    filtered_tokens = [
        text_lemmatizer.lemmatize(token)
        for token in tokens
        if len(token) > 2 and token not in english_stopwords
    ]
    
    return " ".join(filtered_tokens)


In [24]:
sampled_data['cleaned_review'] = sampled_data['review_text'].apply(text_cleaning_pipeline)
sampled_data.head()

  text_clean = BeautifulSoup(text_clean, "html.parser").get_text()


Unnamed: 0,review_text,sentiment,cleaned_review
0,In Panic In The Streets Richard Widmark plays ...,positive,panic street richard widmark play navy doctor ...
1,If you ask me the first one was really better ...,negative,ask first one really better one look sarah rea...
2,I am a big fan a Faerie Tale Theatre and I've ...,positive,big fan faerie tale theatre ive seen one best ...
3,I just finished reading a book about Dillinger...,negative,finished reading book dillinger movie horribly...
4,Greg Davis and Bryan Daly take some crazed sta...,negative,greg davis bryan daly take crazed statement te...


In [26]:
sampled_data[['review_text', 'cleaned_review']].head(5)

Unnamed: 0,review_text,cleaned_review
0,In Panic In The Streets Richard Widmark plays ...,panic street richard widmark play navy doctor ...
1,If you ask me the first one was really better ...,ask first one really better one look sarah rea...
2,I am a big fan a Faerie Tale Theatre and I've ...,big fan faerie tale theatre ive seen one best ...
3,I just finished reading a book about Dillinger...,finished reading book dillinger movie horribly...
4,Greg Davis and Bryan Daly take some crazed sta...,greg davis bryan daly take crazed statement te...


In [27]:
sampled_data.to_csv("cleaned_imdb_sample.csv", index=False)