## What are Stop words?

A stop word is a commonly used word such as `"the", "a", "an", or "in" ` that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

We would not want these words to take up space in our database or take up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in Python has a list of stopwords stored in 16 different languages.

### Need to remove the Stopwords

The necessity of removing stopwords in NLP is contingent upon the specific task at hand. For text classification tasks, where the objective is to categorize text into distinct groups, excluding stopwords is common practice. This is done to channel more attention towards words that truly convey the essence of the text. As illustrated earlier, certain words like "there," "book," and "table" contribute significantly to the text's meaning, unlike less informative words such as "is" and "on."

#### Checking english stopwords list

In [10]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mulombi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Removing stop words with NLTK

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# sample sentence
example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""

# load english stop words
stop_words = set(stopwords.words('english'))

# tokenize and convert to lowercase
word_tokens = word_tokenize(example_sent.lower())

# remove stop words
filtered_sentence = [word for word in word_tokens if word not in stop_words]

# output
print("Original Tokens (lowercase):", word_tokens)
print("Filtered Tokens (no stop words):", filtered_sentence)

Original Tokens (lowercase): ['this', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
Filtered Tokens (no stop words): ['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


#### Removing stop words with SpaCy

In [11]:
import spacy

# load spaCy english model
nlp = spacy.load("en_core_web_sm")

# sample text
text = "There is a pen on the table"

# process the text using spaCy
doc = nlp(text)

# remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)

Original Text: There is a pen on the table
Text after Stopword Removal: pen table


#### Removing stop words with Genism

In [None]:
from gensim.parsing.preprocessing import remove_stopwords

# another sample text
new_text = "The majestic mountains provide a breathtaking view."

# remove stopwords using Gensim
new_filtered_text = remove_stopwords(new_text)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_filtered_text)

#### Removing stop words with SkLearn

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# another sample text
new_text = "The quick brown fox jumps over the lazy dog."

# tokenize the new text using NLTK
new_words = word_tokenize(new_text)

# remove stopwords using NLTK
new_filtered_words = [
    word for word in new_words if word.lower() not in stopwords.words('english')]

# join the filtered words to form a clean text
new_clean_text = ' '.join(new_filtered_words)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_clean_text)

Original Text: The quick brown fox jumps over the lazy dog.
Text after Stopword Removal: quick brown fox jumps lazy dog .
