## Dataset:
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [6]:
import os
import pandas as pd
import re

## Setting up Kaggle API Credentials

To use the Kaggle API, you need to set up your credentials. You can get your API key from your Kaggle account settings.

In [7]:
# Set up Kaggle credentials directly
os.environ['KAGGLE_USERNAME'] = 'surajnamdeojagtap'
os.environ['KAGGLE_KEY'] = '0c3f14a7cd787f54b1a7a0e043f6c29d'
print("Kaggle credentials have been set for this session.")

# Create a directory for the dataset if it doesn't exist
os.makedirs("data", exist_ok=True)

Kaggle credentials have been set for this session.


## Fetching Data Directly Using Kaggle API

Instead of downloading the dataset manually, we'll use the Kaggle API to fetch it directly.

In [8]:
# Check if kaggle is installed
try:
    import kaggle
    # Method 1: Using kaggle API
    print("Using kaggle API to download dataset...")
    kaggle.api.dataset_download_files(
        "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
        path="data",
        unzip=True
    )
    print("Dataset downloaded successfully to 'data' directory")
except ImportError:
    print("Kaggle module not found. Please download the dataset manually from:")
    print("https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
    print("and place it in a 'data' directory.")
    import os
    os.makedirs("data", exist_ok=True)

# Set the path to the downloaded dataset
path = "data"

Using kaggle API to download dataset...
Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Dataset downloaded successfully to 'data' directory


In [9]:
# Load the dataset
data_path = os.path.join(path, "IMDB Dataset.csv")
df = pd.read_csv(data_path)
print(f"Dataset loaded with shape: {df.shape}")

# Check if the dataset was loaded correctly
if df.shape[0] > 0:
    print("Dataset loaded successfully!")
else:
    print("Error: Dataset is empty.")

Dataset loaded with shape: (50000, 2)
Dataset loaded successfully!


In [10]:
# Display the first few rows
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [11]:
# For demonstration purposes, let's work with a smaller sample
df = df.head(100)
df.shape

(100, 2)

# Text Preprocessing Steps

## 1. Convert to lowercase

In [12]:
# Display an example review before preprocessing
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [13]:
# Convert all text to lowercase
df['review'] = df['review'].str.lower()

In [14]:
# Display the same review after lowercase conversion
df['review'][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

## 2. Remove HTML tags

In [15]:
def remove_html_tags(text):
    """Remove HTML tags from text"""
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [16]:
# Apply HTML tag removal
df['review'] = df['review'].apply(remove_html_tags)

In [17]:
# Display the review after HTML tag removal
df['review'][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

Remove URL

In [18]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [19]:
text1 = 'Check out my youtube https://www.youtube.com/dswithbappy dswithbappy'
text2 = 'Check out my linkedin https://www.linkedin.com/in/boktiarahmed73/'
text3 = 'Google search here www.google.com'
text4 = 'For data click https://www.kaggle.com/'

In [20]:
remove_url(text2)

'Check out my linkedin '

Punctuation handling

In [21]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [22]:
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [23]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,'')
    return text


In [24]:
text = 'string. With. Punctuation?'

In [25]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)

string With Punctuation
54.09717559814453


In [26]:
def remove_punc1(text):
    return text.translate(str.maketrans('', '', exclude))

In [27]:
start = time.time()
remove_punc1(text)
time2 = time.time() - start
print(time2*50000)

0.0


In [28]:
# Compare execution times (avoid division by zero)
if time2 > 0:
    print(f"Method 1 is {time1/time2:.2f} times slower than Method 2")
else:
    print("Method 2 is too fast to measure accurately")

Method 2 is too fast to measure accurately


In [29]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [30]:
remove_punc1(df['review'][5])

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

chat conversation handling

In [31]:
chat_words = {
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible'
}


{
    "FYI": "For Your Information",
    "ASAP": "As Soon As Possible",
    "BRB": "Be Right Back",
    "BTW": "By The Way",
    "OMG": "Oh My God",
    "IMO": "In My Opinion",
    "LOL": "Laugh Out Loud",
    "TTYL": "Talk To You Later",
    "GTG": "Got To Go",
    "TTYT": "Talk To You Tomorrow",
    "IDK": "I Don't Know",
    "TMI": "Too Much Information",
    "IMHO": "In My Humble Opinion",
    "ICYMI": "In Case You Missed It",
    "AFAIK": "As Far As I Know",
    "BTW": "By The Way",
    "FAQ": "Frequently Asked Questions",
    "TGIF": "Thank God It's Friday",
    "FYA": "For Your Action",
    "ICYMI": "In Case You Missed It",
}


{'FYI': 'For Your Information',
 'ASAP': 'As Soon As Possible',
 'BRB': 'Be Right Back',
 'BTW': 'By The Way',
 'OMG': 'Oh My God',
 'IMO': 'In My Opinion',
 'LOL': 'Laugh Out Loud',
 'TTYL': 'Talk To You Later',
 'GTG': 'Got To Go',
 'TTYT': 'Talk To You Tomorrow',
 'IDK': "I Don't Know",
 'TMI': 'Too Much Information',
 'IMHO': 'In My Humble Opinion',
 'ICYMI': 'In Case You Missed It',
 'AFAIK': 'As Far As I Know',
 'FAQ': 'Frequently Asked Questions',
 'TGIF': "Thank God It's Friday",
 'FYA': 'For Your Action'}

In [32]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [33]:
chat_conversion('Do this work ASAP')

'Do this work As Soon As Possible'

Incorrect text handling

In [34]:
%pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [35]:
from textblob import TextBlob

In [36]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

# Try to use TextBlob if available
try:
    from textblob import TextBlob
    textBlb = TextBlob(incorrect_text)
    print(textBlb.correct().string)
except ImportError:
    print("TextBlob not available. Cannot perform spelling correction.")
    print("Original text: ", incorrect_text)
    print("To install TextBlob: pip install textblob")

certain conditions during several generations are modified in the same manner.


Remove emoji handling

In [37]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [38]:
remove_emoji("Loved the movie. It was ðŸ˜˜ðŸ˜˜")

'Loved the movie. It was '

In [39]:
remove_emoji("Lmao ðŸ˜‚ðŸ˜‚")

'Lmao '

In [40]:
# Try to use emoji module if available
try:
    import emoji
    print(emoji.demojize('Python is ðŸ”¥'))
except ImportError:
    print("Emoji module not available.")
    print("To install emoji: pip install emoji")

Python is :fire:


In [41]:
# Try to use emoji module if available
try:
    import emoji
    print(emoji.demojize('Loved the movie. It was ðŸ˜˜'))
except ImportError:
    print("Emoji module not available.")
    print("To install emoji: pip install emoji")

Loved the movie. It was :face_blowing_a_kiss:


## 3. Remove special characters and numbers

In [42]:
def remove_special_chars(text):
    """Remove special characters and numbers"""
    pattern = r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [43]:
# Apply special character removal
df['review'] = df['review'].apply(remove_special_chars)

In [44]:
# Display the review after special character removal
df['review'][3]

'basically theres a family where a little boy jake thinks theres a zombie in his closet  his parents are fighting all the timethis movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombieok first of all when youre going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing  arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots out of  just for the well playing parents  descent dialogs as for the shots with jake just ignore them'

## 4. Remove stopwords

In [45]:
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

In [46]:
def remove_stopwords(text):
    """Remove stopwords from text"""
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

In [47]:
# Apply stopword removal
df['review'] = df['review'].apply(remove_stopwords)

In [48]:
# Display the review after stopword removal
df['review'][3]

'basically theres family little boy jake thinks theres zombie closet parents fighting timethis movie slower soap opera suddenly jake decides become rambo kill zombieok first youre going make film must decide thriller drama drama movie watchable parents divorcing arguing like real life jake closet totally ruins film expected see boogeyman similar movie instead watched drama meaningless thriller spots well playing parents descent dialogs shots jake ignore'

## 5. Stemming

In [49]:
from nltk.stem import PorterStemmer

# Initialize stemmer
stemmer = PorterStemmer()

In [50]:
def stem_words(text):
    """Apply stemming to words"""
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

In [51]:
# Apply stemming
df['review'] = df['review'].apply(stem_words)

In [52]:
# Display the review after stemming
df['review'][3]

'basic there famili littl boy jake think there zombi closet parent fight timethi movi slower soap opera suddenli jake decid becom rambo kill zombieok first your go make film must decid thriller drama drama movi watchabl parent divorc argu like real life jake closet total ruin film expect see boogeyman similar movi instead watch drama meaningless thriller spot well play parent descent dialog shot jake ignor'

## 6. Lemmatization (Alternative to Stemming)

In [53]:
from nltk.stem import WordNetLemmatizer

# Download WordNet if not already downloaded
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [54]:
def lemmatize_words(text):
    """Apply lemmatization to words"""
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

In [55]:
# Create a new column for lemmatized text
df['lemmatized_review'] = df['review'].apply(lemmatize_words)

In [56]:
# Display the review after lemmatization
df['lemmatized_review'][3]

'basic there famili littl boy jake think there zombi closet parent fight timethi movi slower soap opera suddenli jake decid becom rambo kill zombieok first your go make film must decid thriller drama drama movi watchabl parent divorc argu like real life jake closet total ruin film expect see boogeyman similar movi instead watch drama meaningless thriller spot well play parent descent dialog shot jake ignor'

## Summary of Preprocessing Steps

1. Loaded data directly using Kaggle API without downloading to disk
2. Converted text to lowercase
3. Removed HTML tags
4. Removed special characters and numbers
5. Removed stopwords
6. Applied stemming
7. Applied lemmatization (as an alternative to stemming)

These preprocessing steps help clean the text data and prepare it for further analysis or modeling.

In [57]:
# Display final processed data
df[['review', 'lemmatized_review', 'sentiment']].head()

Unnamed: 0,review,lemmatized_review,sentiment
0,one review mention watch oz episod youll hook ...,one review mention watch oz episod youll hook ...,positive
1,wonder littl product film techniqu unassum old...,wonder littl product film techniqu unassum old...,positive
2,thought wonder way spend time hot summer weeke...,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,petter mattei love time money visual stun film...,positive


Tokenization

In [58]:
# Using split function
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [59]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [60]:
# Problems with split function
sent3 = 'I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [61]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

In [62]:
# Regular expression
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [63]:

text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

["Lorem Ipsum is simply dummy text of the printing and typesetting industry?\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [64]:
#NLTK
from nltk.tokenize import word_tokenize,sent_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [65]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [66]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [67]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [68]:
word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'nks',
 '@',
 'gmail.com']

In [69]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

In [70]:
%pip show spacy

Note: you may need to restart the kernel to use updated packages.




In [71]:
%pip install spacy


Collecting spacyNote: you may need to restart the kernel to use updated packages.

  Using cached spacy-3.8.5-cp39-cp39-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.12-cp39-cp39-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached cymem-2.0.11-cp39-cp39-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached preshed-3.0.9-cp39-cp39-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Using cached thinc-8.3.6-cp39-cp39-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.

  error: subprocess-exited-with-error
  
  Ã— Building wheel for blis (pyproject.toml) did not run successfully.
  â”‚ exit code: 1
  â•°â”€> [36 lines of output]
      BLIS_COMPILER? None
      !!
      
              ********************************************************************************
              Please consider removing the following classifiers in favor of a SPDX license expression:
      
              License :: OSI Approved :: BSD License
      
              See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
              ********************************************************************************
      
      !!
        self._finalize_license_expression()
      running bdist_wheel
      running build
      running build_py
      creating build\lib.win-amd64-cpython-39\blis
      copying blis\about.py -> build\lib.win-amd64-cpython-39\blis
      copying blis\benchmark.py -> build\lib.win-amd64-cpython-39\blis
      c

In [72]:
# Try to use spacy if available
try:
    import spacy
    nlp = spacy.load('en_core_web_sm')
    print("Spacy loaded successfully!")
except (ImportError, OSError):
    print("Spacy or the model 'en_core_web_sm' is not available.")
    print("To install spacy: pip install spacy")
    print("To download the model: python -m spacy download en_core_web_sm")
    nlp = None

Spacy or the model 'en_core_web_sm' is not available.
To install spacy: pip install spacy
To download the model: python -m spacy download en_core_web_sm


In [73]:
# Try to use spacy for tokenization if available
if nlp is not None:
    doc1 = nlp(sent5)
    doc2 = nlp(sent6)
    doc3 = nlp(sent7)
    doc4 = nlp(sent1)
    print("Tokenization with spaCy successful!")
else:
    print("Spacy not available, using NLTK tokenization instead")
    # Use NLTK tokenization as a fallback
    import nltk
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        print("Downloading NLTK punkt tokenizer...")
        nltk.download('punkt')
    from nltk.tokenize import word_tokenize
    doc1 = word_tokenize(sent5)
    doc2 = word_tokenize(sent6)
    doc3 = word_tokenize(sent7)
    doc4 = word_tokenize(sent1)

Spacy not available, using NLTK tokenization instead


In [74]:
# Display tokens if available
if 'doc4' in locals():
    print(doc4)
else:
    print("doc4 is not defined. Run the previous cell first.")

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']


In [75]:
# Print tokens if available
if 'doc4' in locals():
    for token in doc4:
        print(token)
else:
    print("doc4 is not defined. Run the previous cells first.")

I
am
going
to
visit
delhi
!


In [76]:
df.head()

Unnamed: 0,review,sentiment,lemmatized_review
0,one review mention watch oz episod youll hook ...,positive,one review mention watch oz episod youll hook ...
1,wonder littl product film techniqu unassum old...,positive,wonder littl product film techniqu unassum old...
2,thought wonder way spend time hot summer weeke...,positive,thought wonder way spend time hot summer weeke...
3,basic there famili littl boy jake think there ...,negative,basic there famili littl boy jake think there ...
4,petter mattei love time money visual stun film...,positive,petter mattei love time money visual stun film...
