## What are Stopwords?

Stopwords are common words in a language that are usually filtered out before processing natural language data. In English, examples of stopwords include "the," "is," "in," "and," "but," "or," "with," and "on." These words are often removed because they typically do not carry significant meaning and can clutter the analysis or reduce the efficiency of algorithms like search engines, text classification, and machine learning models.

The reason for removing stopwords is to focus on the words that are more likely to contribute to the meaning or context of the text. However, whether to remove stopwords depends on the specific application—sometimes they can be important, depending on the context.

In natural language processing (NLP), libraries like NLTK (Natural Language Toolkit) and SpaCy provide predefined lists of stopwords for various languages, and they also allow customization of these lists based on the requirements of the task at hand.


In [117]:
## Speech Of DR APJ Abdul Kalam
paragraph = """We should have but one desire today- the desire to die so that India may live- the desire to face a martyr’s death, so that the path to freedom may be paved with the martyr’s blood. Friends! my comrades in the War of Liberation! Today I demand of you one thing, above all. I demand of you blood. It is blood alone that can avenge the blood that the enemy has spilt. It is blood alone that can pay the price of freedom. Give me blood and I promise you freedom."""

In [118]:
from nltk.corpus import stopwords

In [119]:
import sys
import nltk
import os

# Find the location of the NLTK library
nltk_location = os.path.dirname(sys.modules['nltk'].__file__)

# Add the NLTK library location to the data path
nltk.data.path.append(nltk_location)

# All the stopwrds in all the languages is downloaded
nltk.download('stopwords', download_dir=nltk_location)

[nltk_data] Downloading package stopwords to c:\GenAI_udemy\TextPrepro
[nltk_data]     cessing\venv_TextPreprocessing\Lib\site-
[nltk_data]     packages\nltk...
[nltk_data]   Package stopwords is already up-to-date!


True

In [120]:
if 'so' in stopwords.words('english'):
    print('so is a stopword in english')

so is a stopword in english


In [121]:
# stopwords in english all are in lower case 
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [122]:
# stopwords in arabic
stopwords.words('arabic')

['إذ',
 'إذا',
 'إذما',
 'إذن',
 'أف',
 'أقل',
 'أكثر',
 'ألا',
 'إلا',
 'التي',
 'الذي',
 'الذين',
 'اللاتي',
 'اللائي',
 'اللتان',
 'اللتيا',
 'اللتين',
 'اللذان',
 'اللذين',
 'اللواتي',
 'إلى',
 'إليك',
 'إليكم',
 'إليكما',
 'إليكن',
 'أم',
 'أما',
 'أما',
 'إما',
 'أن',
 'إن',
 'إنا',
 'أنا',
 'أنت',
 'أنتم',
 'أنتما',
 'أنتن',
 'إنما',
 'إنه',
 'أنى',
 'أنى',
 'آه',
 'آها',
 'أو',
 'أولاء',
 'أولئك',
 'أوه',
 'آي',
 'أي',
 'أيها',
 'إي',
 'أين',
 'أين',
 'أينما',
 'إيه',
 'بخ',
 'بس',
 'بعد',
 'بعض',
 'بك',
 'بكم',
 'بكم',
 'بكما',
 'بكن',
 'بل',
 'بلى',
 'بما',
 'بماذا',
 'بمن',
 'بنا',
 'به',
 'بها',
 'بهم',
 'بهما',
 'بهن',
 'بي',
 'بين',
 'بيد',
 'تلك',
 'تلكم',
 'تلكما',
 'ته',
 'تي',
 'تين',
 'تينك',
 'ثم',
 'ثمة',
 'حاشا',
 'حبذا',
 'حتى',
 'حيث',
 'حيثما',
 'حين',
 'خلا',
 'دون',
 'ذا',
 'ذات',
 'ذاك',
 'ذان',
 'ذانك',
 'ذلك',
 'ذلكم',
 'ذلكما',
 'ذلكن',
 'ذه',
 'ذو',
 'ذوا',
 'ذواتا',
 'ذواتي',
 'ذي',
 'ذين',
 'ذينك',
 'ريث',
 'سوف',
 'سوى',
 'شتان',
 'عدا',
 'عسى',
 'عل'

# Stemmers

## Using PorterStemmer

In [123]:
# converting the paragraphs to sentences
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()

In [124]:
# converting the paragraphs to sentences
sentences=nltk.sent_tokenize(paragraph)
sentences

['We should have but one desire today- the desire to die so that India may live- the desire to face a martyr’s death, so that the path to freedom may be paved with the martyr’s blood.',
 'Friends!',
 'my comrades in the War of Liberation!',
 'Today I demand of you one thing, above all.',
 'I demand of you blood.',
 'It is blood alone that can avenge the blood that the enemy has spilt.',
 'It is blood alone that can pay the price of freedom.',
 'Give me blood and I promise you freedom.']

In [125]:
type(sentences)

list

In [126]:
## Apply Stopwords And Filter And then Apply Stemming

for i in range(len(sentences)):
    # sentence to words
    words=nltk.word_tokenize(sentences[i])
    # removing stopwords and applying stemming
    words=[stemmer.stem(word) for word in words if word.lower() not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [127]:
sentences

['one desir today- desir die india may live- desir face martyr ’ death , path freedom may pave martyr ’ blood .',
 'friend !',
 'comrad war liber !',
 'today demand one thing , .',
 'demand blood .',
 'blood alon aveng blood enemi spilt .',
 'blood alon pay price freedom .',
 'give blood promis freedom .']

## Using SnowballStemmer

In [128]:
from nltk.stem import SnowballStemmer
snowballstemmer=SnowballStemmer('english')

In [129]:
# converting the paragraphs to sentences
sentences=nltk.sent_tokenize(paragraph)
sentences

['We should have but one desire today- the desire to die so that India may live- the desire to face a martyr’s death, so that the path to freedom may be paved with the martyr’s blood.',
 'Friends!',
 'my comrades in the War of Liberation!',
 'Today I demand of you one thing, above all.',
 'I demand of you blood.',
 'It is blood alone that can avenge the blood that the enemy has spilt.',
 'It is blood alone that can pay the price of freedom.',
 'Give me blood and I promise you freedom.']

In [130]:
## Apply Stopwords And Filter And then Apply Snowball Stemming

for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[snowballstemmer.stem(word) for word in words if word.lower() not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [131]:
sentences

['one desir today- desir die india may live- desir face martyr ’ death , path freedom may pave martyr ’ blood .',
 'friend !',
 'comrad war liber !',
 'today demand one thing , .',
 'demand blood .',
 'blood alon aveng blood enemi spilt .',
 'blood alon pay price freedom .',
 'give blood promis freedom .']

# Lemmatizer

In [132]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

In [133]:
# paragraphs to sentences
sentences=nltk.sent_tokenize(paragraph)
sentences

['We should have but one desire today- the desire to die so that India may live- the desire to face a martyr’s death, so that the path to freedom may be paved with the martyr’s blood.',
 'Friends!',
 'my comrades in the War of Liberation!',
 'Today I demand of you one thing, above all.',
 'I demand of you blood.',
 'It is blood alone that can avenge the blood that the enemy has spilt.',
 'It is blood alone that can pay the price of freedom.',
 'Give me blood and I promise you freedom.']

In [134]:
# Find the location of the NLTK library
nltk_location = os.path.dirname(sys.modules['nltk'].__file__)

# Add the NLTK library location to the data path
nltk.data.path.append(nltk_location)

# Download the missing module (e.g., 'punkt') in the exact location of the NLTK library
nltk.download('wordnet', download_dir=nltk_location)

[nltk_data] Downloading package wordnet to c:\GenAI_udemy\TextPreproce
[nltk_data]     ssing\venv_TextPreprocessing\Lib\site-packages\nltk...
[nltk_data]   Package wordnet is already up-to-date!


True

In [135]:
## Apply Stopwords And Filter And then Apply Snowball Stemming

for i in range(len(sentences)):
    #sentences[i]=sentences[i].lower()
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word.lower() not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)# converting all the list of words into sentences

In [136]:
sentences

['one desire today- desire die india may live- desire face martyr ’ death , path freedom may pave martyr ’ blood .',
 'friends !',
 'comrades war liberation !',
 'today demand one thing , .',
 'demand blood .',
 'blood alone avenge blood enemy spill .',
 'blood alone pay price freedom .',
 'give blood promise freedom .']