SECTION 1 — Downloading NLP Corpus and Required NLTK Resources

Natural Language Toolkit (NLTK) provides many built-in corpora that allow us to practice real NLP tasks.
In this section, we download the Movie Reviews corpus and also install essential tokenizers, stopwords, and WordNet for lemmatization.

In [6]:
import nltk

nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download("punkt_tab")


from nltk.corpus import movie_reviews

# Load corpus into a single text string
text = " ".join(movie_reviews.words())

print("Sample Original Text:\n", text[:300])


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Sample Original Text:
 plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the 


SECTION 2 — Tokenization (Splitting Text into Words & Sentences)

Tokenization is the process of breaking a large text into smaller meaningful units.

Word tokenization splits text into individual words.

Sentence tokenization splits text into complete sentences.

This step is the foundation of all NLP processing, because every further technique—cleaning, stemming, lemmatizing—operates on tokens.

In [7]:
from nltk.tokenize import word_tokenize, sent_tokenize

word_tokens = word_tokenize(text)
sent_tokens = sent_tokenize(text)

print("First 20 word tokens:\n", word_tokens[:20])
print("\nFirst 2 sentence tokens:\n", sent_tokens[:2])


First 20 word tokens:
 ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']

First 2 sentence tokens:
 ['plot : two teen couples go to a church party , drink and then drive .', 'they get into an accident .']


SECTION 3 — Regular Expression (Regex) Cleaning

Raw text contains punctuation, numbers, symbols, and uppercase/lowercase variations.
To prepare text for analysis, we clean it using Regular Expressions (regex), which help remove unwanted characters.
Here’s what we do:

Convert text to lowercase

Remove punctuation and numbers

Remove extra spaces

This creates a neat, uniform text for further NLP processing.

In [8]:
import re

clean_text = text.lower()                                     # lowercase
clean_text = re.sub(r'[^a-z\s]', ' ', clean_text)             # keep only alphabets
clean_text = re.sub(r'\s+', ' ', clean_text).strip()          # remove extra spaces

print("Cleaned Text Preview:\n", clean_text[:300])


Cleaned Text Preview:
 plot two teen couples go to a church party drink and then drive they get into an accident one of the guys dies but his girlfriend continues to see him in her life and has nightmares what s the deal watch the movie and sorta find out critique a mind fuck movie for the teen generation that touches on 


SECTION 4 — Removing Stopwords

Stopwords are common English words like the, is, of, and, you, which do not contribute much meaning.
Removing them helps focus on important content words such as movie, film, action, actor, etc.

We use NLTK's built-in stopwords list and remove all stopwords from our cleaned tokens.

In [9]:
# Section 4 — Stopwords Removal
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

tokens_clean = clean_text.split()
filtered_tokens = [w for w in tokens_clean if w not in stop_words]

print("Before Stopwords Count:", len(tokens_clean))
print("After Stopwords Count:", len(filtered_tokens))
print("Sample tokens:", filtered_tokens[:20])


Before Stopwords Count: 1331272
After Stopwords Count: 702479
Sample tokens: ['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get', 'accident', 'one', 'guys', 'dies', 'girlfriend', 'continues', 'see', 'life', 'nightmares', 'deal']


SECTION 5 — Word Count (Frequency Distribution)

Word frequency analysis tells us which words appear most often in the corpus.
This is important for text summarization, keyword extraction, and linguistic observation.

In [10]:
from nltk import FreqDist

freq = FreqDist(filtered_tokens)

print("Top 20 most common words:\n")
for word, count in freq.most_common(20):
    print(f"{word} → {count}")


Top 20 most common words:

film → 9519
one → 5854
movie → 5775
like → 3691
even → 2565
good → 2411
time → 2411
story → 2170
would → 2110
much → 2050
character → 2020
also → 1967
get → 1949
two → 1912
well → 1906
characters → 1859
first → 1836
see → 1749
way → 1693
make → 1642


SECTION 6 — Stemming
Stemming reduces words to their root form by removing suffixes.
For example:

running → run

movies → movi

It is a fast, rule-based process.
We use the Porter Stemmer, one of the most common stemming algorithms.

In [11]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stemmed_tokens = [stemmer.stem(w) for w in filtered_tokens]

print("Original vs Stemmed:")
for i in range(10):
    print(filtered_tokens[i], "→", stemmed_tokens[i])


Original vs Stemmed:
plot → plot
two → two
teen → teen
couples → coupl
go → go
church → church
party → parti
drink → drink
drive → drive
get → get


SECTION 7 — Lemmatization

Lemmatization is smarter than stemming because it converts words to their dictionary base form (lemma).
It considers grammar and vocabulary, making the output clean and meaningful.
Examples:

better → good

studies → study

cars → car

We use WordNetLemmatizer, which uses WordNet’s lexical database.

In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemm_tokens = [lemmatizer.lemmatize(w) for w in filtered_tokens]

print("Original vs Lemmatized:")
for i in range(10):
    print(filtered_tokens[i], "→", lemm_tokens[i])


Original vs Lemmatized:
plot → plot
two → two
teen → teen
couples → couple
go → go
church → church
party → party
drink → drink
drive → drive
get → get
