**Assignment: Building a Text‑Classification Pipeline & Word‑Embedding Exploration**

by Khalida Salahuddin
DS15

**IMDB Movies Sentiment Analysis**

# Real‑life Use‑Case Framing

IMDB is an leading website for finding movies and tv shows. Users can search for movies and tv shows and write positive or negative reviews. Given that an average person wastes plenty of time searching for movies that they would like, IMDB wants to recommend movies based on the users preferences.

For this purpose, we will build an NLP based text classification model to predict user sentiments based on their past reviews. If successful, this model can help IMDB recommend movies to the users based on the preferences derived from their past reviews about movies.

IMDB can benefit from this model by improving user retention and engagement.

# Data Acquisition & Exploration

In [123]:
!pip install datasets pandas



In [124]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [125]:
from datasets import load_dataset

dataset = load_dataset("imdb")

In [126]:
dataset.shape

{'train': (25000, 2), 'test': (25000, 2), 'unsupervised': (50000, 2)}

In [127]:
dataset['train']
dataset['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [128]:
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

print(train_df.head())

                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0


In [129]:
print(train_df['label'].value_counts())

label
0    12500
1    12500
Name: count, dtype: int64


In [130]:
for label in train_df['label'].unique():
    print(f"\nExamples for label {label}:")
    samples = train_df[train_df['label']==label].sample(5, random_state=42)
    for i, text in enumerate(samples['text']):
        print(f"{i+1}: {text[:200]}...")


Examples for label 0:
1: Wow, what a total let down! The fact people think this film is scary is ridiculous. The special effects were a direct rip-off of "The ring." The story? Was there one? Not in my opinion..Just a bunch o...
2: If Bob Ludlum was to see this mini series, he would have cried. This was complete waste of time and money. I have read the book and even though movies are not exactly what the book may be, CBS wasted ...
3: To call a film about a crippled ghost taking revenge from beyond the grave lame and lifeless would be too ironical but this here is an undeniably undistinguished combination of GASLIGHT (1939 & 1944) ...
4: they have sex with melons in Asia.<br /><br />okay. first, i doubted that, but after seeing the wayward cloud, i changed my mind and was finally convinced that they have sex with watermelons, with peo...
5: Although the production and Jerry Jameson's direction are definite improvements, "Airport '77" isn't much better than "Airport 1975": slick, comme

Exploring data for preprocessing

In [131]:
dataset['train'][:2]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

1. Dataset needs to be converted into lowercase
2. Dataset contains irrelevant symbols. Remove punctuation marks
3. Remove html tags and URLs
4. remove stop words like is, an, a, the, for
5. remove any emojis

# Pre‑processing Pipeline

## Convert text into lowercase

In [132]:
def lowercase(example):
    return {'clean_text_lowercase': example['text'].lower()}

dataset['train'] = dataset['train'].map(lowercase)
dataset['test'] = dataset['test'].map(lowercase)
dataset['unsupervised'] = dataset['unsupervised'].map(lowercase)


In [133]:
dataset['train'][0]['clean_text_lowercase']

'i rented i am curious-yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u.s. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" i really had to see this for myself.<br /><br />the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />what kills me about i am curious-yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, ev

## Remove HTML tags and Remove URL

In [134]:
import re

def full_clean(example):
    text = example['clean_text_lowercase']
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    text = text.strip()
    return {'clean_text_html_url': text}

# Apply to all splits
dataset['train'] = dataset['train'].map(full_clean)
dataset['test'] = dataset['test'].map(full_clean)
dataset['unsupervised'] = dataset['unsupervised'].map(full_clean)

In [135]:
dataset['train'][0]['clean_text_html_url']

'i rented i am curious-yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u.s. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" i really had to see this for myself.the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.what kills me about i am curious-yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it\'s not shot l

## Remove punctuation marks

In [136]:
import string

def remove_punctuation(example):
    text = example['clean_text_html_url']

    text = text.translate(str.maketrans('', '', string.punctuation))
    return {'clean_text_punctuation': text}

dataset['train'] = dataset['train'].map(remove_punctuation)
dataset['test'] = dataset['test'].map(remove_punctuation)
dataset['unsupervised'] = dataset['unsupervised'].map(remove_punctuation)


In [137]:
dataset['train'][0]['clean_text_punctuation']

'i rented i am curiousyellow from my video store because of all the controversy that surrounded it when it was first released in 1967 i also heard that at first it was seized by us customs if it ever tried to enter this country therefore being a fan of films considered controversial i really had to see this for myselfthe plot is centered around a young swedish drama student named lena who wants to learn everything she can about life in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states in between asking politicians and ordinary denizens of stockholm about their opinions on politics she has sex with her drama teacher classmates and married menwhat kills me about i am curiousyellow is that 40 years ago this was considered pornographic really the sex and nudity scenes are few and far between even then its not shot like some cheaply made

## Convert chat word abbreviations to full form

In [138]:
chat_words = {
      "AFATK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime Anywhere Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It Is Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek You",
    "ILU": "I Love You",
    "IMHO": "In My Humble Opinion",
    "IMO": "In My Opinion",
    "IRL": "In Real Life",
    "LOL": "Laughing Out Loud",
    "LMAO": "Laughing My Ass Off",
    "OIC": "Oh I See",
    "PITA": "Pain In The Ass",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFMAO": "Rolling On The Floor Laughing My Ass Off",
    "SK8": "Skate",
    "STATS": "Your Sex And Age",
    "ASL": "Age Sex Location",
    "THX": "Thank You",
    "TTFN": "Ta Ta For Now",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours Forever",
    "WB": "Welcome Back",
    "WTF": "What The Fuck",
    "WTG": "Way To Go",
    "WUF": "Where Are You From",
    "W8": "Wait",
    "7K": "Sick Laugher"
}

def chat_conversion_hf(example):
    text = example['clean_text_punctuation']
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return {'clean_text_chatwords': " ".join(new_text)}


dataset['train'] = dataset['train'].map(chat_conversion_hf)
dataset['test'] = dataset['test'].map(chat_conversion_hf)
dataset['unsupervised'] = dataset['unsupervised'].map(chat_conversion_hf)


In [139]:
dataset['train'][0]['clean_text_chatwords']

'i rented i am curiousyellow from my video store because of all the controversy that surrounded it when it was first released in 1967 i also heard that at first it was seized by us customs if it ever tried to enter this country therefore being a fan of films considered controversial i really had to see this for myselfthe plot is centered around a young swedish drama student named lena who wants to learn everything she can about life in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states in between asking politicians and ordinary denizens of stockholm about their opinions on politics she has sex with her drama teacher classmates and married menwhat kills me about i am curiousyellow is that 40 years ago this was considered pornographic really the sex and nudity scenes are few and far between even then its not shot like some cheaply made

## Stop word removal

In [140]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [144]:
import nltk

# Download the standard tokenizer models
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [145]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(example):
    text = example['clean_text_chatwords']  # replace with your latest column
    tokens = word_tokenize(text)  # split into words
    filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
    return {'clean_text_stopword': ' '.join(filtered_tokens)}  # overwrite same column

dataset['train'] = dataset['train'].map(remove_stopwords)
dataset['test'] = dataset['test'].map(remove_stopwords)
dataset['unsupervised'] = dataset['unsupervised'].map(remove_stopwords)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [146]:
dataset['train'][0]['clean_text_stopword']

'rented curiousyellow video store controversy surrounded first released 1967 also heard first seized us customs ever tried enter country therefore fan films considered controversial really see myselfthe plot centered around young swedish drama student named lena wants learn everything life particular wants focus attentions making sort documentary average swede thought certain political issues vietnam war race issues united states asking politicians ordinary denizens stockholm opinions politics sex drama teacher classmates married menwhat kills curiousyellow 40 years ago considered pornographic really sex nudity scenes far even shot like cheaply made porno countrymen mind find shocking reality sex nudity major staple swedish cinema even ingmar bergman arguably answer good old boy john ford sex scenes filmsi commend filmmakers fact sex shown film shown artistic purposes rather shock people make money shown pornographic theaters america curiousyellow good film anyone wanting study meat po

## Tokenization

In [147]:
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

def tokenize_words(example):
    text = example['clean_text_stopword']
    tokens = word_tokenize(text)
    return {'tokens': tokens}

dataset['train'] = dataset['train'].map(tokenize_words)
dataset['test'] = dataset['test'].map(tokenize_words)
dataset['unsupervised'] = dataset['unsupervised'].map(tokenize_words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [148]:
dataset['train'][0]['tokens']

['rented',
 'curiousyellow',
 'video',
 'store',
 'controversy',
 'surrounded',
 'first',
 'released',
 '1967',
 'also',
 'heard',
 'first',
 'seized',
 'us',
 'customs',
 'ever',
 'tried',
 'enter',
 'country',
 'therefore',
 'fan',
 'films',
 'considered',
 'controversial',
 'really',
 'see',
 'myselfthe',
 'plot',
 'centered',
 'around',
 'young',
 'swedish',
 'drama',
 'student',
 'named',
 'lena',
 'wants',
 'learn',
 'everything',
 'life',
 'particular',
 'wants',
 'focus',
 'attentions',
 'making',
 'sort',
 'documentary',
 'average',
 'swede',
 'thought',
 'certain',
 'political',
 'issues',
 'vietnam',
 'war',
 'race',
 'issues',
 'united',
 'states',
 'asking',
 'politicians',
 'ordinary',
 'denizens',
 'stockholm',
 'opinions',
 'politics',
 'sex',
 'drama',
 'teacher',
 'classmates',
 'married',
 'menwhat',
 'kills',
 'curiousyellow',
 '40',
 'years',
 'ago',
 'considered',
 'pornographic',
 'really',
 'sex',
 'nudity',
 'scenes',
 'far',
 'even',
 'shot',
 'like',
 'cheapl

## Lemmatization

In [149]:
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(example):
    tokens = example['tokens']
    lemmatized = [lemmatizer.lemmatize(w) for w in tokens]
    return {'tokens_lemmatized': lemmatized}

dataset['train'] = dataset['train'].map(lemmatize_tokens, num_proc=4)
dataset['test'] = dataset['test'].map(lemmatize_tokens, num_proc=4)
dataset['unsupervised'] = dataset['unsupervised'].map(lemmatize_tokens, num_proc=4)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/50000 [00:00<?, ? examples/s]

In [150]:
dataset['train'][0]['tokens_lemmatized']

['rented',
 'curiousyellow',
 'video',
 'store',
 'controversy',
 'surrounded',
 'first',
 'released',
 '1967',
 'also',
 'heard',
 'first',
 'seized',
 'u',
 'custom',
 'ever',
 'tried',
 'enter',
 'country',
 'therefore',
 'fan',
 'film',
 'considered',
 'controversial',
 'really',
 'see',
 'myselfthe',
 'plot',
 'centered',
 'around',
 'young',
 'swedish',
 'drama',
 'student',
 'named',
 'lena',
 'want',
 'learn',
 'everything',
 'life',
 'particular',
 'want',
 'focus',
 'attention',
 'making',
 'sort',
 'documentary',
 'average',
 'swede',
 'thought',
 'certain',
 'political',
 'issue',
 'vietnam',
 'war',
 'race',
 'issue',
 'united',
 'state',
 'asking',
 'politician',
 'ordinary',
 'denizen',
 'stockholm',
 'opinion',
 'politics',
 'sex',
 'drama',
 'teacher',
 'classmate',
 'married',
 'menwhat',
 'kill',
 'curiousyellow',
 '40',
 'year',
 'ago',
 'considered',
 'pornographic',
 'really',
 'sex',
 'nudity',
 'scene',
 'far',
 'even',
 'shot',
 'like',
 'cheaply',
 'made',
 'p

# Feature Engineering

In [154]:
# Convert token lists into strings
X_train_text = [" ".join(tokens) for tokens in dataset['train']['tokens_lemmatized']]
X_test_text  = [" ".join(tokens) for tokens in dataset['test']['tokens_lemmatized']]

y_train = dataset['train']['label']
y_test  = dataset['test']['label']


In [155]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(
    ngram_range=(1, 2),   # unigram + bigram
    max_features=5000    # limit vocabulary size
)

X_train_bow = bow_vectorizer.fit_transform(X_train_text)
X_test_bow = bow_vectorizer.transform(X_test_text)

print("BoW Train Shape:", X_train_bow.shape)
print("BoW Test Shape:", X_test_bow.shape)


BoW Train Shape: (25000, 5000)
BoW Test Shape: (25000, 5000)


In [156]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),   # unigram + bigram
    max_features=5000
)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_text)
X_test_tfidf = tfidf_vectorizer.transform(X_test_text)

print("TF-IDF Train Shape:", X_train_tfidf.shape)
print("TF-IDF Test Shape:", X_test_tfidf.shape)


TF-IDF Train Shape: (25000, 5000)
TF-IDF Test Shape: (25000, 5000)


In [157]:
print(bow_vectorizer.get_feature_names_out()[:20])
print(tfidf_vectorizer.get_feature_names_out()[:20])

['10' '10 10' '10 minute' '10 year' '100' '1010' '11' '110' '12' '13' '14'
 '15' '15 minute' '15 year' '16' '17' '18' '1930s' '1950s' '1960s']
['10' '10 10' '10 minute' '10 year' '100' '1010' '11' '110' '12' '13' '14'
 '15' '15 minute' '15 year' '16' '17' '18' '1930s' '1950s' '1960s']
