## Stop Words
**Stop words** in NLP (Natural Language Processing) refer to words that are commonly used in a language but do not carry any significant meaning or value in the analysis of text data. These words are usually removed from the text data during preprocessing as they do not add any value to the analysis and may even cause noise in the results. Examples of stop words in English include "the," "and," "a," "an," "in," "of," etc. Removing stop words helps in reducing the size of the data and improves the accuracy of text analysis.

* As we know, when we do bag of words for multiple commercial documents, then we can find which document is going to talk about specific company. But we have some words which don't help us in finding this truth that n1 documents is talking about Tesla company. So these words are called **Stop Words**. 

<img src = "img.png" width = "800px" height = "400px"></img>

<img src = "img1.png" width = "800px" height = "400px"></img>

* So by removing the stop words we can less sparce our model.

* NLP analyst remove the stop words for majority of times, but in some cases there is need to have stop words in our vocabulary, so in those cases we don't remove stop words. The most important case where we don't remove the stop words is **Sentiment Analysis.**

<img src = "img2.png" width = "800px" height = "400px"></img>

* Next is **sentence translation** where we can't remove the stop words. See the following example if we remove the stop words from a sentence 'How are you doing Dhaval?', we will just have Dhaval which can't give the scense of the sentence and can't be translated to other languages.

<img src = "img3.png" width = "800px" height = "400px"></img>

* Next is **Chat box.** If we remove the stop words it will doesn't give proper scense.

<img src = "img4.png" width = "800px" height = "400px"></img>

In [1]:
# So first we import the spacy library and import check all the stop words which are include in the English model:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS)   

326

In [3]:
# To see the stop words:
# STOP_WORDS

In [5]:
# Now let's import the spacy english pipeline and define a simple document and search for stop words.
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")
for token in doc:
    if token.is_stop:
        print(token)

We
just
our
the
part
is


In [10]:
# So to remove the stop words from the document, we define the function to take a document and return clean text without 
# stop words:
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return no_stop_words 

In [11]:
# We pass the text into the function here:
preprocess("We just opened our wings, the flying part is coming soon")

['opened', 'wings', ',', 'flying', 'coming', 'soon']

In [12]:
# Now we can see things like comma in the upper cell, so  to remove it we say that if it's not a punctuation mark. 
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return no_stop_words

In [13]:
# Now again we pass the text:
preprocess("We just opened our wings, the flying part is coming soon")

['opened', 'wings', 'flying', 'coming', 'soon']

In [14]:
# If we want to display the return text as simple text not as list, we can do:
def preprocess(text):
    doc = nlp(text)
    
    no_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(no_stop_words)  

In [15]:
# Passing the text:
preprocess("Musk wants time to prepare for a trial over his")

'Musk wants time prepare trial'

**Remove stop words from pandas dataframe text column**
Dataset is downloaded from: https://www.kaggle.com/datasets/jbencina/department-of-justice-20092018-press-releases It contains press releases of different court cases from depart of justice (DOJ). The releases contain information such as outcomes of criminal cases, notable actions taken against felons, or other updates about the current administration.

In [17]:
# Now how to apply this pre-process function into 'Pandas DataFrame'? As we know commonly we load our dataset into Pandas 
# DataFrame and then we build an NLP model. Now if we want to do preprocessing on the DataFrame, so how to do that?
# So here we have a json file which has justice related text. So we want to load it into a DataFrame.
import pandas as pd
df = pd.read_json("doj_press.json",lines=True)
df.shape

(13087, 6)

In [18]:
# Let's print 5 records from the json file:
df.sample(5)

Unnamed: 0,id,title,contents,date,topics,components
1672,17-773,Clinical Psychologist and Owner of Psychologic...,Two owners of psychological services companies...,2017-07-14T00:00:00-04:00,[Health Care Fraud],"[Criminal Division, USAO - Louisiana, Eastern]"
8980,16-1341,Nevada Woman Indicted For Evading Payment of T...,Concealed Personal Funds and Assets from IRS C...,2016-11-16T00:00:00-05:00,[Tax],[Tax Division]
3727,18-13,Former CFO of Arthrocare Corporation Sentenced...,The former chief financial officer (CFO) of Ar...,2018-01-05T00:00:00-05:00,[],"[Criminal Division, USAO - Texas, Western]"
11146,16-1116,Syrian Electronic Army Hacker Pleads Guilty,"Peter Romar, 37, a Syrian national affiliated ...",2016-09-28T00:00:00-04:00,"[Counterintelligence and Export Control, Finan...","[National Security Division (NSD), USAO - Virg..."
11913,15-725,Two Georgia Sisters-in-Law and Former Tax Retu...,Two Georgia sisters-in-law were sentenced toda...,2015-06-10T00:00:00-04:00,[],[Tax Division]


In [19]:
# So let's say we're building a model to auto extract topics. It means our NLP application will auto-extract the topics from
# the text. We filter all the articles.
# So to see the type to the topics:
type(df.topics[0])

list

In [20]:
# We see the type ofthe topics is list.
# To see some exploration on the dataset:
df.describe()

Unnamed: 0,id,title,contents,date,topics,components
count,12810,13087,13087,13087,13087,13087
unique,12672,12887,13080,2400,253,810
top,13-526,Northern California Real Estate Investor Agree...,"WASHINGTON – ING Bank N.V., a financial inst...",2018-04-13T00:00:00-04:00,[],[Criminal Division]
freq,3,8,2,20,8399,2680


In [21]:
# So the topics type is list and we want to filter it using a condition at the square brackets.
# So the condition is, whenever the topics list is empty, filter that.
df = df[df["topics"].str.len() != 0] # This will filter out all the rows which has empty list in the topics.
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [22]:
# After filtering if we check the DataFrame shape:
df.shape

(4688, 6)

In [23]:
# So here our main goal is to apply the preprocess function into Pandas DataFrame column.
# To keep thing simple, we create a small DataFrame from the top 100 rows:
df =df.head(100)
df.shape

(100, 6)

In [24]:
# Now on the contents column of the DataFrame we apply preprocess function to remove all the stop words and return the 
# remained words. From remain words then we can build a simple NLP application.
# So if we see the first row of the 'contents' column:
df.contents.iloc[4]

'21st Century Oncology Inc. and certain of its subsidiaries and affiliates have agreed to pay $26 million to the government to resolve a self-disclosure relating to the submission of false attestations regarding the company’s use of electronic health records software and separate allegations that they violated the False Claims Act by submitting, or causing the submission of, claims for certain services provided pursuant to referrals from physicians with whom they had improper financial relationships. \xa0 “The Justice Department is committed to zealously investigating improper financial relationships that have the potential to compromise physicians’ medical judgment,” said Acting Assistant Attorney General Chad A. Readler of the Justice Department’s Civil Division.\xa0 “However, we will work with companies that accept responsibility for their past compliance failures and promptly take corrective action.”  \xa0 21st Century Oncology, which is headquartered in Fort Myers, Florida, owns a

In [25]:
# If we check the lenght of this row, it would be a big lenght, we want to remove the stop words from these rows:
len(df.contents.iloc[0])

6286

In [27]:
# So to remove stop words from 'contents' column, first we create a new column from the existance 'contents' column and then
# we apply preprocess to the 'apply()'' function. We don't want array to be back, we want to string to be back. So for that 
# we already use the 'join()' function with double qutation in cell '[14]'.
df["contents_new"] = df.contents.apply(preprocess)
df.head(5)

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],U.S. Department Justice U.S. Environmental Pro...
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 count criminal indictment unsealed today B...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],21st Century Oncology LLC agreed pay $ 19.75 m...
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...


In [31]:
# So now we see in the new created column 'contents_new' we don't have stop words, it's removed. To check, let's see the 
# first one:
df.contents_new.iloc[0][:300]  # Will print first 300 charecters

'U.S. Department Justice U.S. Environmental Protection Agency EPA Rhode Island Department Environmental Management RIDEM announced today subsidiaries Stanley Black Decker Inc.—Emhart Industries Inc. Black Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor Restoration Pro'

In [30]:
# And when you compared the lenght of this first row of 'contents_new' column with the 'contents' column, it will be small:
len(df.contents.iloc[0]), len(df.contents_new.iloc[0])

(6286, 4574)

**Examples where removing stop words can create a problem**

**1. Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words.**

In [32]:
preprocess("this is a good movie")

'good movie'

In [33]:
preprocess("this is not a good movie")

'good movie'

**2. Language translation: Say you want to translate following sentence from english to telugu. Before actual translation if you remove stop words and then translate, it will produce horrible result.**

In [34]:
preprocess("how are you doing dhaval?")

'dhaval'

**3. Chat bot or any Q&A system**

In [35]:
preprocess("I don't find yoga mat on your website. Can you help?")

'find yoga mat website help'