# Stopwords Removal with NLTK

This notebook demonstrates how to remove **Stopwords** from a text. Stopwords are common words (like 'the', 'is', 'in') that usually don't carry significant meaning for NLP tasks.

### Step 1: Input Text
We use a famous speech by Dr. APJ Abdul Kalam as our sample text.

In [4]:
# Speech of Dr. APJ Abdul Kalam
paragraph = """
I have three visions for India. In 3000 years of our history, people from all over the 
world have come and invaded us, captured our lands, conquered our minds. From Alexander 
onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the 
Dutch, all of them came and looted us, took over what was ours. Yet we have not done this 
to any other nation. We have not conquered anyone. We have not grabbed their land, their 
culture, their history and tried to enforce our way of life on them. Why? Because we 
respect the freedom of others.

That is why my first vision is freedom. I believe that India got its first vision of this 
in 1857, when we started the War of Independence. It is this freedom that we must protect 
and nurture and build on. If we are not free, no one will respect us. My second vision 
for India’s development. For fifty years we have been a developing nation. It is time we 
see ourselves as a developed nation. We are among the top 5 nations of the world in terms 
of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling. 
Our achievements are being globally recognised today. Yet we lack the self-confidence to 
see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?

I have a third vision. India must stand up to the world. Because I believe that unless 
India stands up to the world, no one will respect us. Only strength respects strength. 
We must be strong not only as a military power but also as an economic power. Both must 
go hand-in-hand. My good fortune was to have worked with three great minds. Dr. Vikram 
Sarabhai of the Dept. of space, Professor Satish Dhawan, who succeeded him and Dr. 
Brahm Prakash, father of nuclear material. I was lucky to have worked with all three 
of them closely and consider this the great opportunity of my life.
"""

### Step 2: Import NLTK and Download Resources
We need to download the `stopwords` list, `punkt_tab` tokenizer data, and `wordnet` for lemmatization.

In [8]:
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Step 3: Tokenization and Stemming Setup
We'll use a `PorterStemmer` and `SnowballStemmer` to reduce words to their stems while removing stopwords.

In [7]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer

stemmer = PorterStemmer()
snowball = SnowballStemmer("english")

### Step 4: Sentence Tokenization
Break the paragraph into individual sentences.

In [6]:
sentences = nltk.sent_tokenize(paragraph)

### Step 5: Stopwords Removal and Porter Stemming
We loop through each sentence, tokenize it into words, and keep only the words that are NOT in the NLTK English stopwords list.

In [10]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(paragraph)

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word.lower() not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

print("Porter Stemmer Results:")
for sentence in sentences:
    print(sentence)

### Step 6: Using Snowball Stemmer

> [!NOTE]
> The **Snowball Stemmer** is a more advanced version of the Porter Stemmer. It is generally faster and more reliable. Let's apply it to the same paragraph.

In [11]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer("english")

# Re-tokenize to get original sentences
sentences = nltk.sent_tokenize(paragraph)

for i in range(len(sentences)):
    # Lowercase the sentence
    sentences[i] = sentences[i].lower()
    words = nltk.word_tokenize(sentences[i])
    words = [snowball.stem(word) for word in words if word.lower() not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

print("Snowball Stemmer Results:")
for sentence in sentences:
    print(sentence)

Snowball Stemmer Results:
three vision india .
3000 year histori , peopl world come invad us , captur land , conquer mind .
alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .
yet done nation .
conquer anyon .
grab land , cultur , histori tri enforc way life .
?
respect freedom other .
first vision freedom .
believ india got first vision 1857 , start war independ .
freedom must protect nurtur build .
free , one respect us .
second vision india ’ develop .
fifti year develop nation .
time see develop nation .
among top 5 nation world term gdp .
10 percent growth rate area .
poverti level fall .
achiev global recognis today .
yet lack self-confid see develop nation , self-reli self-assur .
’ incorrect ?
third vision .
india must stand world .
believ unless india stand world , one respect us .
strength respect strength .
must strong militari power also econom power .
must go hand-in-hand .
good fortun work three great mind .
dr. vikram sara

### Step 7: Using Lemmatization

> [!IMPORTANT]
> **Lemmatization** is a more accurate way of finding the root form because it uses a dictionary. However, as we saw in the previous notebook, it works best when you provide **Part-of-Speech (POS)** tags.

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Re-tokenize to get original sentences
sentences = nltk.sent_tokenize(paragraph)

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    # Remove stopwords AND lemmatize (treating verbs as verbs for better roots)
    words = [lemmatizer.lemmatize(word, pos='v') for word in words if word.lower() not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

print("Lemmatization Results (with Verb POS tagging):")
for sentence in sentences:
    print(sentence)

Lemmatization Results (with Verb POS tagging):
three visions India .
3000 years history , people world come invade us , capture land , conquer mind .
Alexander onwards , Greeks , Turks , Moguls , Portuguese , British , French , Dutch , come loot us , take .
Yet do nation .
conquer anyone .
grab land , culture , history try enforce way life .
?
respect freedom others .
first vision freedom .
believe India get first vision 1857 , start War Independence .
freedom must protect nurture build .
free , one respect us .
second vision India ’ development .
fifty years develop nation .
time see develop nation .
among top 5 nations world term GDP .
10 percent growth rate areas .
poverty level fall .
achievements globally recognise today .
Yet lack self-confidence see develop nation , self-reliant self-assured .
’ incorrect ?
third vision .
India must stand world .
believe unless India stand world , one respect us .
strength respect strength .
must strong military power also economic power .
must 