# Text Preprocessing --> Stopwords

Stopwords are common words in a language (such as "the", "is", "in", "and") that are often filtered out during text preprocessing in natural language processing (NLP) tasks. These words typically do not carry significant meaning and are removed to focus on the more informative words in a text. Removing stopwords helps improve the efficiency and accuracy of text analysis by reducing noise and dimensionality. Most NLP libraries provide predefined lists of stopwords for various languages, but these lists can be customized based on the specific requirements of a project.

In [1]:
corpus = """I have three visions for India. In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. Yet we have not done this to any other nation.

Why? Because we respect the freedom of others. That is why my first vision is that of FREEDOM.

I believe that India got its first vision of this in 1857, when we started the war of independence. It is this freedom that we must protect and nurture and build on.

My second vision for India is DEVELOPMENT. For fifty years we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top 5 nations of the world in terms of GDP. We have 10 percent growth rate in most areas. Our poverty levels are falling. Our achievements are being globally recognized today. Yet we lack the self-confidence to see ourselves as a developed nation.

My third vision is that India must stand up to the world. Unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand.

I believe that if we have to build our nation, we must develop three members of society who can make a difference. They are the father, the mother and the teacher."""

In [2]:
from nltk.tokenize import sent_tokenize

In [15]:
sentences = sent_tokenize(corpus)

In [16]:
sentences


['I have three visions for India.',
 'In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.',
 'Yet we have not done this to any other nation.',
 'Why?',
 'Because we respect the freedom of others.',
 'That is why my first vision is that of FREEDOM.',
 'I believe that India got its first vision of this in 1857, when we started the war of independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'My second vision for India is DEVELOPMENT.',
 'For fifty years we have been a developing nation.',
 'It is time we see ourselves as a developed nation.',
 'We are among the top 5 nations of the world in terms of GDP.',
 'We have 10 percent growth rate in most areas.',
 'Our poverty levels are falling.',
 'Our achievements are being globally recognized today.',
 'Yet we lack the self-confidence to see ourselves as a developed nation.',
 'My third vision is that India must stand up to the world.

In [5]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Prithivi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [7]:
from nltk.corpus import stopwords

In [8]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [9]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [10]:
for i in range(len(sentences)):
    words = sentences[i].split()
    words = [ps.stem(word) for word in words if word.lower() not in stopwords.words('english')]
    sentences[i] = ' '.join(words)

In [11]:
sentences

['three vision india.',
 '3000 year history, peopl world come invad us, captur lands, conquer minds.',
 'yet done nation.',
 'why?',
 'respect freedom others.',
 'first vision freedom.',
 'believ india got first vision 1857, start war independence.',
 'freedom must protect nurtur build on.',
 'second vision india development.',
 'fifti year develop nation.',
 'time see develop nation.',
 'among top 5 nation world term gdp.',
 '10 percent growth rate areas.',
 'poverti level falling.',
 'achiev global recogn today.',
 'yet lack self-confid see develop nation.',
 'third vision india must stand world.',
 'unless india stand world, one respect us.',
 'strength respect strength.',
 'must strong militari power also econom power.',
 'must go hand-in-hand.',
 'believ build nation, must develop three member societi make difference.',
 'father, mother teacher.']

In [17]:
from nltk.stem import SnowballStemmer
snb = SnowballStemmer('english')

In [18]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [ps.stem(word) for word in words if word.lower() not in stopwords.words('english')]
    sentences[i] = ' '.join(words)

In [19]:
sentences

['three vision india .',
 '3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'yet done nation .',
 '?',
 'respect freedom other .',
 'first vision freedom .',
 'believ india got first vision 1857 , start war independ .',
 'freedom must protect nurtur build .',
 'second vision india develop .',
 'fifti year develop nation .',
 'time see develop nation .',
 'among top 5 nation world term gdp .',
 '10 percent growth rate area .',
 'poverti level fall .',
 'achiev global recogn today .',
 'yet lack self-confid see develop nation .',
 'third vision india must stand world .',
 'unless india stand world , one respect us .',
 'strength respect strength .',
 'must strong militari power also econom power .',
 'must go hand-in-hand .',
 'believ build nation , must develop three member societi make differ .',
 'father , mother teacher .']

In [22]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [23]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [wnl.lemmatize(word) for word in words if word.lower() not in stopwords.words('english')]
    sentences[i] = ' '.join(words)
sentences

['three vision india .',
 '3000 year histori , peopl world come invad u , captur land , conquer mind .',
 'yet done nation .',
 '?',
 'respect freedom .',
 'first vision freedom .',
 'believ india got first vision 1857 , start war independ .',
 'freedom must protect nurtur build .',
 'second vision india develop .',
 'fifti year develop nation .',
 'time see develop nation .',
 'among top 5 nation world term gdp .',
 '10 percent growth rate area .',
 'poverti level fall .',
 'achiev global recogn today .',
 'yet lack self-confid see develop nation .',
 'third vision india must stand world .',
 'unless india stand world , one respect u .',
 'strength respect strength .',
 'must strong militari power also econom power .',
 'must go hand-in-hand .',
 'believ build nation , must develop three member societi make differ .',
 'father , mother teacher .']