Stop words are common words (e.g., "and", "the", "in") that are often removed from text during preprocessing in Natural Language Processing (NLP). These words typically do not carry significant meaning and can clutter the analysis, so removing them helps improve the performance of models by focusing on the more informative words.

There are cases where keeping them is important (Chatbot, Q&A system, Language Translation or any case where valuable information is important). For example, in sentiment analysis, stop words like "not" or "really" can change the meaning of a sentence. If removed, the sentiment could be misinterpreted. So, sometimes, keeping stop words helps preserve the context and overall meaning of the text.

In [56]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

print(list(STOP_WORDS)[:10])
print("Total Stopping Words: ", len(STOP_WORDS))

['top', 'together', '‘ve', 'whose', 'some', 'move', 'thru', 'no', 'hereafter', 'both']
Total Stopping Words:  325


In [57]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("The quick brown fox jumps over the lazy dog.")

# Here we are printing stopping words in above sentence
for token in doc:
  if token.is_stop:
    print(token)



The
over
the


The common way is in NLP production application where people define some function and clean the text or document by applying stemming, lemmatization and removing stopping words.

In [19]:
def prepocessing(text):
  doc = nlp(text)

  # below code line is list comprehension
  not_stopping_words = [token.text for token in doc if not token.is_stop and not token.is_punct]

  return " ".join(not_stopping_words)

In [20]:
prepocessing("The quick brown fox jumps over the lazy dog.")

'quick brown fox jumps lazy dog'

#Removing Stopping Words From Pandas Dataframe

In [22]:
import pandas as pd

df = pd.read_json("/content/doj_press.json", lines=True)

df.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [23]:
df.shape

(13087, 6)

In [26]:
type(df['topics'][0])

list

In [27]:
df.describe()

Unnamed: 0,id,title,contents,date,topics,components
count,12810,13087,13087,13087,13087,13087
unique,12672,12887,13080,2400,253,810
top,13-526,Northern California Real Estate Investor Agree...,"WASHINGTON – ING Bank N.V., a financial inst...",2018-04-13T00:00:00-04:00,[],[Criminal Division]
freq,3,8,2,20,8399,2680


Filter out those rows that do not have any topics associated with the case

In [31]:
#Here we removed data which doesn't have topics
df = df[df['topics'].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [32]:
df.shape

(4688, 6)

Reducing the dataset to 50 rows for fast prepocessing because we will not apply any machine learning algo

In [34]:
df = df.head(50)
df.shape

(50, 6)

In [35]:
#checking the length of text present in first row in our dataset in contents column
len(df["contents"].iloc[4])

5504

In [36]:
#adding new column to dataset and applying preprocessing function
df['contents_new'] = df['contents'].apply(prepocessing)
df.head()

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],U.S. Department Justice U.S. Environmental Pro...
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 count criminal indictment unsealed today B...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],21st Century Oncology LLC agreed pay $ 19.75 m...
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...


In [37]:
len(df['contents_new'].iloc[4])

4217

After removing stopping words from the first row in our dataset now it's text size is 4217 from 5504 words.

#Exercise

**Exercise 1:**

- From a Given Text, Count the number of stop words in it.
- Print the percentage of stop word tokens compared to all tokens in a given text.

In [59]:
#import spacy and load the model

import spacy
nlp = spacy.load("en_core_web_sm")

In [74]:
text = '''
Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and
distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).
The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,
Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),
Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.
'''

#step1: Create the object 'doc' for the given text using nlp()
doc = nlp(text)

#step2: define the variables to keep track of stopwords count and total words count
stopwords_dic = {}
total_words = 0

#step3: iterate through all the words in the document
for token in doc:
  if token.is_stop:
    stopwords_dic[token.text] = 1 + stopwords_dic.get(token.text, 0)
  total_words += 1

#step4: print the count of stop words
print("Total Stopping Words are: ", sum(stopwords_dic.values()))

#step5: print the percentage of stop words compared to total words in the text
print("Percentage of Stop Words: " + str(sum(stopwords_dic.values()) / total_words * 100) + "%")

Total Stopping Words are:  40
Percentage of Stop Words: 25.0%


**Exercise 2:**

- Spacy default implementation considers "not" as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:

  - this is a good movie       ----> Positive Statement
  - this is not a good movie   ----> Negative Statement
-So, after applying stopwords to those 2 texts, both will return "good movie" and does not respect the polarity/sentiments of text.

- Now, your task is to remove this stop word "not" in spaCy and help in distinguishing the texts.

In [78]:
#use this pre-processing function to pass the text and to remove all the stop words and finally get the cleaned form
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)

#Step1: remove the stopword 'not' in spacy
# nlp.vocab['not'].is_stop = False
nlp.Defaults.stop_words.discard('not')

#step2: send the two texts given above into the pre-process function and store the transformed texts
text1 = preprocess('this is a good movie')
text2 = preprocess('this is not a good movie')

#step3: finally print those 2 transformed texts
print(text1)
print(text2)

good movie
not good movie


**Exercise 3:**

- From a given text, output the most frequently used token after removing all the stop word tokens and punctuations in it.

In [72]:
text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.
It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,
One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the
first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be
granted test cricket status.
'''

#step1: Create the object 'doc' for the given text using nlp()
doc = nlp(text)

#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list
non_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]

#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens
freq_words = {}
for word in non_stop_words:
  freq_words[word] = 1 + freq_words.get(word, 0)

#step4: get the maximum frequency word
max_freq_word = max(freq_words, key=freq_words.get)

#step5: finally print the result
print(f"The most frequent word is '{max_freq_word}' with frequency: {freq_words[max_freq_word]}")

The most frequent word is 'India' with frequency: 6
