# üßπ Stop Words Removal using NLTK

> **Objective:**  
> Remove common words (like *the*, *is*, *in*, *and*) that usually carry **little or no semantic meaning** in NLP tasks.  
> These words are called **Stop Words**.

---

## üìò 1. What are Stop Words?

- **Stop words** are high-frequency words that appear in most texts but don‚Äôt contribute much to meaning.  
  Examples: *a, an, the, in, on, is, was, were, of, for, and, to, at*.
- Removing them helps:
  - Reduce dataset size.
  - Improve model focus on **meaningful tokens**.
  - Increase efficiency of vectorization (BoW / TF-IDF).

However, the decision to remove stop words depends on **task context** ‚Äî  
for example, you may *not* remove them in **sentiment analysis** ("not good" vs "good").

---

## üß© 2. NLTK Stopword Corpus

NLTK provides a ready-made list of English stop words via:

$$
\text{stopwords.words('english')}
$$

> ‚úÖ Pipeline:
> **Sentence Tokenization ‚Üí Word Tokenization ‚Üí Stopword Removal ‚Üí Lemmatization**


In [4]:
# üèè MS Dhoni motivational paragraph
paragraph = """
When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions.
Success is not a sudden peak ‚Äî it‚Äôs a steady climb built on daily habits.
You don‚Äôt wake up one morning and find you‚Äôre great; you become great because you kept showing up, kept trying, and kept learning.
Mistakes will happen ‚Äî that‚Äôs inevitable.
What matters is whether you let them define you or refine you.
Stay calm under pressure, focus only on what you can control ‚Äî your preparation, your intent, and your effort.
The scoreboard may shift, the crowd may silence, but your character stays intact if you‚Äôve done the right things.
Let your purpose drive your performance, not the applause.
Because in the end, it‚Äôs not about trophies or titles ‚Äî it‚Äôs about the process, the grind, and the faith that carried you through.
"""
print(paragraph)



When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions.
Success is not a sudden peak ‚Äî it‚Äôs a steady climb built on daily habits.
You don‚Äôt wake up one morning and find you‚Äôre great; you become great because you kept showing up, kept trying, and kept learning.
Mistakes will happen ‚Äî that‚Äôs inevitable.
What matters is whether you let them define you or refine you.
Stay calm under pressure, focus only on what you can control ‚Äî your preparation, your intent, and your effort.
The scoreboard may shift, the crowd may silence, but your character stays intact if you‚Äôve done the right things.
Let your purpose drive your performance, not the applause.
Because in the end, it‚Äôs not about trophies or titles ‚Äî it‚Äôs about the process, the grind, and the faith that carried you through.



## üß© Step 1 ‚Äî Sentence Tokenization

**Goal:**  
Break a paragraph into individual **sentences** using NLTK‚Äôs `sent_tokenize()`.

$$
\text{Sentences} = \text{sent\_tokenize}(paragraph)
$$

This step helps process long documents efficiently and enables per-sentence analysis for later stages like POS tagging or sentiment scoring.


In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [9]:
# Step 1: Sentence Tokenization
sentences = sent_tokenize(paragraph)
print(f"Total Sentences: {len(sentences)}\n")

for i, s in enumerate(sentences, 1):
    print(f"{i}. {s.strip()}")


Total Sentences: 9

1. When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions.
2. Success is not a sudden peak ‚Äî it‚Äôs a steady climb built on daily habits.
3. You don‚Äôt wake up one morning and find you‚Äôre great; you become great because you kept showing up, kept trying, and kept learning.
4. Mistakes will happen ‚Äî that‚Äôs inevitable.
5. What matters is whether you let them define you or refine you.
6. Stay calm under pressure, focus only on what you can control ‚Äî your preparation, your intent, and your effort.
7. The scoreboard may shift, the crowd may silence, but your character stays intact if you‚Äôve done the right things.
8. Let your purpose drive your performance, not the applause.
9. Because in the end, it‚Äôs not about trophies or titles ‚Äî it‚Äôs about the process, the grind, and the faith that carried you through.


## üß© Step 2 ‚Äî Word Tokenization (Per Sentence)

After splitting into sentences, we tokenize each sentence into words.

$$
\text{Tokens}(S_i) = \text{word\_tokenize}(S_i)
$$

This hierarchical tokenization (sentence ‚Üí word) gives us better control over context.


In [11]:
# Tokenize each sentence into words

tokenized_sentences = [word_tokenize(s) for s in sentences]

for i, tokens in enumerate(tokenized_sentences, 1):
    print(f"\nSentence {i} Tokens ({len(tokens)}):\n{tokens}")


Sentence 1 Tokens (20):
['When', 'you', 'step', 'into', 'any', 'challenge', ',', 'you', 'must', 'bring', 'one', 'thing', 'above', 'all', '‚Äî', 'consistency', 'in', 'your', 'actions', '.']

Sentence 2 Tokens (18):
['Success', 'is', 'not', 'a', 'sudden', 'peak', '‚Äî', 'it', '‚Äô', 's', 'a', 'steady', 'climb', 'built', 'on', 'daily', 'habits', '.']

Sentence 3 Tokens (31):
['You', 'don', '‚Äô', 't', 'wake', 'up', 'one', 'morning', 'and', 'find', 'you', '‚Äô', 're', 'great', ';', 'you', 'become', 'great', 'because', 'you', 'kept', 'showing', 'up', ',', 'kept', 'trying', ',', 'and', 'kept', 'learning', '.']

Sentence 4 Tokens (9):
['Mistakes', 'will', 'happen', '‚Äî', 'that', '‚Äô', 's', 'inevitable', '.']

Sentence 5 Tokens (13):
['What', 'matters', 'is', 'whether', 'you', 'let', 'them', 'define', 'you', 'or', 'refine', 'you', '.']

Sentence 6 Tokens (23):
['Stay', 'calm', 'under', 'pressure', ',', 'focus', 'only', 'on', 'what', 'you', 'can', 'control', '‚Äî', 'your', 'preparation', ','

## üßπ Step 3 ‚Äî Stopword Removal (Per Sentence)

Now that we have **word tokens per sentence**, we‚Äôll remove **stop words** from each sentence independently.

**Definition**

Given a sentence token list \( T_i = [w_1,\dots,w_m] \) and a stopword set \( S \),

$$
T_i' = \{\, w \in T_i \mid w.\mathrm{lower}() \notin S \ \land\ w.\mathrm{isalpha}() \,\}
$$

This keeps **alphabetic content words** and drops common function words (the, is, and‚Ä¶).


In [12]:
import nltk
# Ensure resources
nltk.download('stopwords', quiet=True)

True

In [14]:
from nltk.corpus import stopwords

# Stopword set
stop_words = set(stopwords.words('english'))

# tokens per sentence are assumed in `tokenized_sentences`
# 'tokenized_sentences' is a list of lists.
# Example:
# [
#   ['When', 'you', 'step', 'into', 'any', 'challenge', ','],
#   ['Success', 'is', 'not', 'a', 'sudden', 'peak', ...],
#   ...
# ]

# We want to remove stopwords ('the', 'is', 'and', etc.)
# and keep only alphabetic words (ignore punctuation, numbers).

filtered_sentences = [
    
    # üëá Inner list comprehension: process one sentence (list of tokens)
    [
        w                                   # keep this word
        for w in tokens                     # iterate over all words in this sentence
        if w.lower() not in stop_words      # condition 1: skip stop words (case-insensitive)
        and w.isalpha()                     # condition 2: skip non-alphabetic tokens (e.g., ',', '.', '‚Äî')
    ]
    # üëá Outer loop: repeat for every sentence in tokenized_sentences
    for tokens in tokenized_sentences
    
]

for i, (orig, filt) in enumerate(zip(tokenized_sentences, filtered_sentences), 1):
    print(f"\nSentence {i}:")
    print("Original tokens:", orig)
    print("After stopword removal:", filt)


Sentence 1:
Original tokens: ['When', 'you', 'step', 'into', 'any', 'challenge', ',', 'you', 'must', 'bring', 'one', 'thing', 'above', 'all', '‚Äî', 'consistency', 'in', 'your', 'actions', '.']
After stopword removal: ['step', 'challenge', 'must', 'bring', 'one', 'thing', 'consistency', 'actions']

Sentence 2:
Original tokens: ['Success', 'is', 'not', 'a', 'sudden', 'peak', '‚Äî', 'it', '‚Äô', 's', 'a', 'steady', 'climb', 'built', 'on', 'daily', 'habits', '.']
After stopword removal: ['Success', 'sudden', 'peak', 'steady', 'climb', 'built', 'daily', 'habits']

Sentence 3:
Original tokens: ['You', 'don', '‚Äô', 't', 'wake', 'up', 'one', 'morning', 'and', 'find', 'you', '‚Äô', 're', 'great', ';', 'you', 'become', 'great', 'because', 'you', 'kept', 'showing', 'up', ',', 'kept', 'trying', ',', 'and', 'kept', 'learning', '.']
After stopword removal: ['wake', 'one', 'morning', 'find', 'great', 'become', 'great', 'kept', 'showing', 'kept', 'trying', 'kept', 'learning']

Sentence 4:
Original 

## üìò Step 4 ‚Äî Lemmatization with POS (Per Sentence)

We now lemmatize the **filtered tokens** of each sentence using **WordNet Lemmatizer** with **POS tags**.

**POS Mapping**

$$
\text{map}(t)=
\begin{cases}
\text{wn.ADJ},& t\text{ starts with }J\\
\text{wn.VERB},& t\text{ starts with }V\\
\text{wn.NOUN},& t\text{ starts with }N\\
\text{wn.ADV},& t\text{ starts with }R\\
\text{wn.NOUN},& \text{otherwise (default)}
\end{cases}
$$

**Lemmatization**

$$
\text{Lemma}(w,\text{POS})=\operatorname{lookup}_{\text{WordNet}}\big(\text{morph}(w),\text{POS}\big)
$$


In [15]:
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def penn_to_wn(tag: str):
    if tag.startswith('J'):
        return wn.ADJ
    if tag.startswith('V'):
        return wn.VERB
    if tag.startswith('N'):
        return wn.NOUN
    if tag.startswith('R'):
        return wn.ADV
    return wn.NOUN

def lemmatize_tokens(tokens):
    """POS-aware lemmatization for a list of tokens."""
    tagged = pos_tag(tokens)
    lemmas = []
    for w, t in tagged:
        wn_pos = penn_to_wn(t)
        lemmas.append(lemmatizer.lemmatize(w.lower(), pos=wn_pos))
    return lemmas, tagged

In [17]:
# Apply per sentence
lemma_sentences = []
tagged_sentences = []

for filt in filtered_sentences:
    lemmas, tagged = lemmatize_tokens(filt)
    lemma_sentences.append(lemmas)
    tagged_sentences.append(tagged)

for i, (filt, lem) in enumerate(zip(filtered_sentences, lemma_sentences), 1):
    print(f"\nSentence {i}:")
    print("After stopword removal:", filt)
    print("Lemmas:", lem)


Sentence 1:
After stopword removal: ['step', 'challenge', 'must', 'bring', 'one', 'thing', 'consistency', 'actions']
Lemmas: ['step', 'challenge', 'must', 'bring', 'one', 'thing', 'consistency', 'action']

Sentence 2:
After stopword removal: ['Success', 'sudden', 'peak', 'steady', 'climb', 'built', 'daily', 'habits']
Lemmas: ['success', 'sudden', 'peak', 'steady', 'climb', 'build', 'daily', 'habit']

Sentence 3:
After stopword removal: ['wake', 'one', 'morning', 'find', 'great', 'become', 'great', 'kept', 'showing', 'kept', 'trying', 'kept', 'learning']
Lemmas: ['wake', 'one', 'morning', 'find', 'great', 'become', 'great', 'kept', 'show', 'keep', 'try', 'keep', 'learn']

Sentence 4:
After stopword removal: ['Mistakes', 'happen', 'inevitable']
Lemmas: ['mistake', 'happen', 'inevitable']

Sentence 5:
After stopword removal: ['matters', 'whether', 'let', 'define', 'refine']
Lemmas: ['matter', 'whether', 'let', 'define', 'refine']

Sentence 6:
After stopword removal: ['Stay', 'calm', 'pre

## üìä Step 5 ‚Äî Per-Sentence Summary Table

We‚Äôll summarize each sentence with:
- Original sentence
- Token counts (original ‚Üí filtered)
- Lemma list (joined for readability)


In [18]:
import pandas as pd

rows = []
for i, (s, orig_tokens, filt_tokens, lemmas) in enumerate(
    zip(sentences, tokenized_sentences, filtered_sentences, lemma_sentences), 1
):
    rows.append({
        "Sentence #": i,
        "Sentence": s,
        "Tokens (orig)": len(orig_tokens),
        "Tokens (filtered)": len(filt_tokens),
        "Lemmas": ", ".join(lemmas)
    })

df = pd.DataFrame(rows, columns=["Sentence #", "Sentence", "Tokens (orig)", "Tokens (filtered)", "Lemmas"])
df


Unnamed: 0,Sentence #,Sentence,Tokens (orig),Tokens (filtered),Lemmas
0,1,"\nWhen you step into any challenge, you must b...",20,8,"step, challenge, must, bring, one, thing, cons..."
1,2,Success is not a sudden peak ‚Äî it‚Äôs a steady c...,18,8,"success, sudden, peak, steady, climb, build, d..."
2,3,You don‚Äôt wake up one morning and find you‚Äôre ...,31,13,"wake, one, morning, find, great, become, great..."
3,4,Mistakes will happen ‚Äî that‚Äôs inevitable.,9,3,"mistake, happen, inevitable"
4,5,What matters is whether you let them define yo...,13,5,"matter, whether, let, define, refine"
5,6,"Stay calm under pressure, focus only on what y...",23,8,"stay, calm, pressure, focus, control, preparat..."
6,7,"The scoreboard may shift, the crowd may silenc...",24,12,"scoreboard, may, shift, crowd, may, silence, c..."
7,8,"Let your purpose drive your performance, not t...",11,5,"let, purpose, drive, performance, applause"
8,9,"Because in the end, it‚Äôs not about trophies or...",32,7,"end, trophy, title, process, grind, faith, carry"


## üí° Observations

- **Stopword removal** reduces noise and keeps content words.
- **POS-aware lemmatization** normalizes tense, plurality, and irregular forms (e.g., *running ‚Üí run*, *carried ‚Üí carry*).
- Working **per sentence** preserves structure and enables sentence-level analytics (e.g., sentiment by sentence, clause-level features).

---

## ‚úÖ Optional: Paragraph-Level Lemma Stats

Quickly view the **most frequent lemmas** across the whole paragraph.


In [19]:
from collections import Counter

all_lemmas = [lemma for sent in lemma_sentences for lemma in sent]
freq = Counter(all_lemmas).most_common(20)

print("Top 20 Lemmas:")
for w, c in freq:
    print(f"{w:>15} : {c}")


Top 20 Lemmas:
            one : 2
          thing : 2
          great : 2
           keep : 2
            let : 2
           stay : 2
            may : 2
           step : 1
      challenge : 1
           must : 1
          bring : 1
    consistency : 1
         action : 1
        success : 1
         sudden : 1
           peak : 1
         steady : 1
          climb : 1
          build : 1
          daily : 1


## üéØ What You Now Have

A complete, production-style normalization pipeline:

> **sent_tokenize ‚Üí word_tokenize ‚Üí stopword filter ‚Üí POS-aware lemmatize ‚Üí summarize**


# üßÆ Stopword Removal & Lemmatization Flowchart

This flow summarizes our entire preprocessing sequence applied to the **MS Dhoni motivational paragraph** üèè  

Each stage transforms the text gradually from a raw paragraph to clean, lemmatized tokens ready for analysis.

---

$$
\begin{array}{c}
\boxed{\text{Raw Paragraph}} \\[6pt]
\Downarrow\ \text{sent\_tokenize()} \\[6pt]
\boxed{\text{Sentence List } \{S_1, S_2, \dots, S_n\}} \\[6pt]
\Downarrow\ \text{word\_tokenize()} \\[6pt]
\boxed{\text{Word Tokens per Sentence}} \\[6pt]
\Downarrow\ \text{Lowercase + Remove Non-Alpha} \\[6pt]
\Downarrow\ \text{Stopword Filter} \\[6pt]
\boxed{\text{Filtered Tokens } T_i' = \{w \mid w \notin S\}} \\[6pt]
\Downarrow\ \text{POS Tagger } \text{tag}(w) \\[6pt]
\Downarrow\ \text{Map Penn‚ÜíWordNet POS} \\[6pt]
\boxed{\text{POS-aware Lemmatizer}} \\[6pt]
\Downarrow \\[6pt]
\boxed{\text{Lemmatized Sentences } L_i} \\[6pt]
\Downarrow\ \text{Summarize per sentence / paragraph} \\[6pt]
\boxed{\text{Structured DataFrame or Clean Text Output}}
\end{array}
$$

---

### üí° Mathematical View

For each sentence $ S_i $:

$$
S_i \xrightarrow{\text{word\_tokenize}} T_i
\xrightarrow{\text{filter}} T_i' = \{w \in T_i \mid w \notin S\}
\xrightarrow{\text{lemmatize}} L_i
$$

and finally summarized as:

$$
\text{Paragraph Summary} = \bigcup_i L_i
$$

---

‚úÖ **Summary:**
- `sent_tokenize()` ‚Üí separates sentences  
- `word_tokenize()` ‚Üí breaks each sentence into words  
- Stopword filter ‚Üí removes frequent, low-value words  
- POS-aware Lemmatizer ‚Üí normalizes meaningful words  
- Summarize ‚Üí combines clean lemmas for NLP analysis (BoW, TF-IDF, etc.)
