<a href="https://colab.research.google.com/github/goel4ever/machine-learning-notebooks/blob/main/nlp_stop_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Stop words

`Stop words` are words that you want to ignore. This notebook focuses on filtering out stop words from a text using Natural Language Processing.

We'll use `NLTK` package for implementation. A group of texts is called a `corpus`. NLTK provides several corpora covering everything from novels hosted by Project Gutenberg to inaugural speeches by presidents of the United States.

In order to analyze texts in NLTK, you first need to import them. We need a one-off run of `nltk.download()` to get all the resources in one go. Note: It will take some time.

In [2]:
import nltk

In [54]:
# nltk.download()
# nltk.download('book')
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [55]:
# from nltk.book import *
from nltk.corpus import brown

In [56]:
# Download stop words
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
# Imports for tokenzing words and sentences
from nltk.tokenize import sent_tokenize, word_tokenize

In [58]:
# Get English stop words
stop_words = set(stopwords.words("english"))

In [59]:
# Convert list of words to a single string
text1_str = ' '.join(brown.words())

In [60]:
# Break text into sentences
sentences = sent_tokenize(text1_str)

# Print first 5 sentences
sentences[:5]

["The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .",
 "The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .",
 "The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .",
 "`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .",
 "The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' ."]

In [61]:
# Tokenize first sentence into words
words = word_tokenize(sentences[0])
words

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 'Atlanta',
 "'s",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 '``',
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.']

In [63]:
# Filter out stop words
filtered_list = []
for word in words:
  # ignore whether the letters in word were uppercase or lowercase.
  if word.casefold() not in stop_words:
    filtered_list.append(word)

filtered_list

['Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'investigation',
 'Atlanta',
 "'s",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'evidence',
 '``',
 'irregularities',
 'took',
 'place',
 '.']

In [83]:
# Customizing stop words list based on needs
custom_stop_words = set(["said", "'s", "``", "."])

In [84]:
# Filter out stop words
filtered_list = []
final_stop_words_set = stop_words | custom_stop_words

for word in words:
  # ignore whether the letters in word were uppercase or lowercase.
  if word.casefold() not in final_stop_words_set:
    filtered_list.append(word)

filtered_list

['Fulton',
 'County',
 'Grand',
 'Jury',
 'Friday',
 'investigation',
 'Atlanta',
 'recent',
 'primary',
 'election',
 'produced',
 'evidence',
 'irregularities',
 'took',
 'place']