## Stop-Words

Perform following stop word operations on email dataset using Spacy, Gensim and NLTK libraries

a. Display existing stop words in the default list  
b. Removing stop words form the default list  
c. Adding stop words to the default list  
d. Apply stop word elimination to the email dataset  

In [1]:
import pandas as pd

data = pd.read_csv("./datasets/emails.csv", usecols=["text", "spam"])
print(data.head(10))
print(data.info())

                                                text spam
0  Subject: naturally it's your irresistible your...    1
1  Subject: the stock trading gunslinger  fanny i...    1
2  Subject: unbelievable new homes made easy  im ...    1
3  Subject: 4 color printing special  request add...    1
4  Subject: do not have money , get software cds ...    1
5  Subject: great nnews  hello , welcome to medzo...    1
6  Subject: here ' s a hot play in motion  homela...    1
7  Subject: save your money buy getting this thin...    1
8  Subject: undeliverable : home based business f...    1
9  Subject: save your money buy getting this thin...    1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5730 entries, 0 to 5729
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5730 non-null   object
 1   spam    5728 non-null   object
dtypes: object(2)
memory usage: 89.7+ KB
None


In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [3]:
# set of default stopwords in English language in Spacy
all_stopwords = nlp.Defaults.stop_words
print(all_stopwords)

{'latterly', 'however', 'formerly', 'someone', 'through', 'who', 'neither', 'were', 'could', 'any', 'amongst', 'make', 'should', 'other', "'ll", 'ourselves', 'wherever', 'bottom', '‘ve', 'their', 'so', 'already', 'of', 'least', 'onto', 'as', 'meanwhile', 'seemed', 'she', 'beside', 'out', '‘ll', 'get', 'if', 'per', 'n’t', 'this', 'hence', 'for', 'anywhere', 'hereby', 'among', 'thence', 'herself', 'put', 'next', 'hundred', "'ve", 'him', 'others', 'sometimes', "'re", 'becoming', 'becomes', 'those', 'you', 'no', 'been', 'i', 'until', 'whole', 'during', 'beforehand', 'do', 'hereafter', 'up', 'alone', 'yourselves', 'about', 'fifty', 'mine', 'thereupon', 'less', 'hereupon', 'made', 'because', 'own', 'six', 'ours', 'anyone', 'was', 'across', 'though', 'now', 'without', 'although', 'several', 'n‘t', 'last', 'sometime', 'really', 'all', 'herein', 'whom', 'often', 'your', 'themselves', 'every', 'with', 'go', 'a', 'via', 'ca', 'by', 'they', 'anyhow', 'when', 'seem', 'under', 'afterwards', 'yours',

In [4]:
# Remove or add stopwords
all_stopwords.remove("when")

nlp.Defaults.stop_words.add("naturally")
nlp.Defaults.stop_words.add("company")

print(all_stopwords)

{'latterly', 'however', 'formerly', 'someone', 'through', 'who', 'neither', 'were', 'could', 'any', 'amongst', 'make', 'should', 'other', "'ll", 'ourselves', 'wherever', 'bottom', '‘ve', 'their', 'so', 'already', 'of', 'least', 'onto', 'as', 'meanwhile', 'seemed', 'she', 'beside', 'out', '‘ll', 'get', 'if', 'per', 'n’t', 'this', 'hence', 'for', 'anywhere', 'hereby', 'among', 'thence', 'herself', 'put', 'next', 'hundred', "'ve", 'him', 'naturally', 'others', 'sometimes', "'re", 'becoming', 'becomes', 'those', 'you', 'no', 'been', 'i', 'until', 'whole', 'during', 'beforehand', 'do', 'hereafter', 'up', 'alone', 'yourselves', 'about', 'fifty', 'mine', 'thereupon', 'less', 'hereupon', 'made', 'because', 'own', 'six', 'ours', 'anyone', 'was', 'across', 'though', 'now', 'without', 'although', 'several', 'n‘t', 'last', 'sometime', 'really', 'all', 'herein', 'whom', 'often', 'your', 'themselves', 'every', 'with', 'go', 'a', 'via', 'ca', 'by', 'they', 'anyhow', 'seem', 'under', 'afterwards', 'yo

In [5]:
Text = data["text"][0]
Text

"Subject: naturally it's your irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we don't promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you'll see logo drafts within three business days . affordability : yo

In [6]:
doc = nlp(Text)
Text = [i.text for i in doc if i.text not in all_stopwords]
print(Text)

['Subject', ':', 'irresistible', 'corporate', 'identity', ' ', 'lt', 'hard', 'recollect', ':', ' ', 'market', 'suqgestions', 'information', 'isoverwhelminq', ';', 'good', ' ', 'catchy', 'logo', ',', 'stylish', 'statlonery', 'outstanding', 'website', ' ', 'task', 'easier', '.', ' ', 'promise', 'havinq', 'ordered', 'iogo', ' ', 'automaticaily', 'world', 'ieader', ':', 'isguite', 'ciear', ' ', 'good', 'products', ',', 'effective', 'business', 'organization', 'practicable', 'aim', ' ', 'hotat', 'nowadays', 'market', ';', 'promise', 'marketing', 'efforts', ' ', 'effective', '.', 'list', 'clear', ' ', 'benefits', ':', 'creativeness', ':', 'hand', '-', ',', 'original', 'logos', ',', 'specially', ' ', 'reflect', 'distinctive', 'image', '.', 'convenience', ':', 'logo', 'stationery', ' ', 'provided', 'formats', ';', 'easy', '-', '-', 'use', 'content', 'management', 'system', 'letsyou', ' ', 'change', 'website', 'content', 'structure', '.', 'promptness', ':', 'logo', 'drafts', 'business', 'days',

## Stopwords operations using Gensim

To add or remove stopwords we use set union and difference  
(unlike add and remove in others)

In [7]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS

all_stopwords = gensim.parsing.preprocessing.STOPWORDS
print(all_stopwords)

frozenset({'co', 'latterly', 'however', 'formerly', 'someone', 'through', 'who', 'neither', 'were', 'could', 'any', 'amongst', 'should', 'make', 'other', 'ourselves', 'wherever', 'bottom', 'their', 'so', 'already', 'of', 'least', 'onto', 'as', 'meanwhile', 'seemed', 'she', 'beside', 'out', 'get', 'if', 'per', 'this', 'hence', 'for', 'anywhere', 'hereby', 'among', 'thence', 'herself', 'put', 'found', 'next', 'kg', 'hundred', 'him', 'others', 'sometimes', 'becoming', 'becomes', 'those', 'you', 'no', 'been', 'i', 'until', 'whole', 'couldnt', 'interest', 'during', 'beforehand', 'do', 'hereafter', 'up', 'alone', 'yourselves', 'about', 'fifty', 'mine', 'thereupon', 'less', 'hereupon', 'made', 'own', 'because', 'six', 'ours', 'anyone', 'was', 'across', 'though', 'now', 'without', 'although', 'several', 'last', 'sometime', 'really', 'all', 'herein', 'whom', 'often', 'fire', 'your', 'themselves', 'every', 'with', 'go', 'a', 'via', 'by', 'they', 'anyhow', 'when', 'thin', 'seem', 'under', 'don', 

In [8]:
all_stopwords_gensim = STOPWORDS.union(set(["naturally", "company"]))
all_stopwords_gensim = all_stopwords_gensim.difference("really")
# all_stopwords_gensim = STOPWORDS.difference(set(['really']))

print(all_stopwords_gensim)

frozenset({'inc', 'whoever', 'co', 'latterly', 'however', 'which', 'formerly', 'someone', 'through', 'here', 'who', 'neither', 'twelve', 'two', 'were', 'could', 'any', 'ten', 'amongst', 'quite', 'should', 'make', 'other', 'eleven', 'is', 'ourselves', 'might', 'his', 'very', 'are', 'wherever', 'bottom', 'their', 'cannot', 'else', 'bill', 'regarding', 'so', 'its', 'already', 'of', 'back', 'indeed', 'at', 'least', 'onto', 'as', 'few', 'behind', 'see', 'meanwhile', 'otherwise', 'say', 'seemed', 'she', 'towards', 'beside', 'me', 'doing', 'what', 'thick', 'but', 'out', 'sixty', 'call', 'beyond', 'none', 'move', 'into', 'nor', 'get', 'mostly', 'if', 'most', 'whenever', 'in', 'per', 'this', 'hence', 'since', 'for', 'there', 'anywhere', 'thus', 'hereby', 'among', 'how', 'again', 'serious', 'due', 'done', 'whereby', 'thence', 'herself', 'whereupon', 'almost', 'put', 'found', 'computer', 'next', 'kg', 'hundred', 'him', 'naturally', 'others', 'sometimes', 'our', 'becoming', 'further', 'becomes', '

In [9]:
Text = data["text"][0]
Text

"Subject: naturally it's your irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we don't promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you'll see logo drafts within three business days . affordability : yo

In [10]:
from gensim.utils import tokenize

tokens = list(tokenize(Text))
tokens_without_sw = [word for word in tokens if not word in all_stopwords_gensim]

print(tokens_without_sw)

['Subject', 's', 'irresistible', 'corporate', 'identity', 'lt', 'hard', 'recollect', 'a', 'market', 'suqgestions', 'information', 'isoverwhelminq', 'a', 'good', 'catchy', 'logo', 'stylish', 'statlonery', 'outstanding', 'website', 'task', 'easier', 't', 'promise', 'havinq', 'ordered', 'a', 'iogo', 'automaticaily', 'a', 'world', 'ieader', 'isguite', 'ciear', 'good', 'products', 'effective', 'business', 'organization', 'practicable', 'aim', 'hotat', 'nowadays', 'market', 'promise', 'marketing', 'efforts', 'effective', 'list', 'clear', 'benefits', 'creativeness', 'hand', 'original', 'logos', 'specially', 'reflect', 'distinctive', 'image', 'convenience', 'logo', 'stationery', 'provided', 'formats', 'easy', 'use', 'content', 'management', 'letsyou', 'change', 'website', 'content', 'structure', 'promptness', 'll', 'logo', 'drafts', 'business', 'days', 'affordability', 'marketing', 'break', 'shouldn', 't', 'gaps', 'budget', 'satisfaction', 'guaranteed', 'provide', 'unlimited', 'changes', 'extr

## Using NLTK

In [11]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download("stopwords")


stop_words = set(stopwords.words("english"))
stop_words.remove("as")
stop_words.add("naturally")

print(stop_words)

{'was', "aren't", 'll', 'aren', 'now', "hadn't", 'above', 'having', "shan't", 'which', 'mightn', "you'll", 'all', 'through', 's', 'here', 'who', 'won', 'y', 'were', 'whom', 'any', 'your', 'then', "you've", 'should', 'other', 'yourself', 'is', 'ourselves', 'themselves', 'with', 'his', 'very', 'a', 'are', 't', 'wasn', "weren't", 'their', 'by', 'they', 'each', 'when', 'both', 'its', 'so', 'of', 'at', "shouldn't", 'd', 'theirs', 'few', "couldn't", 'had', 'under', 'just', "needn't", 'don', 'she', "that'll", 'yours', 'some', 'hers', 'shan', 'me', 'where', 'hasn', 'doing', 'himself', 'or', 'isn', 'what', 'but', 'o', "mightn't", 'out', 'once', "it's", "wouldn't", 'below', 'into', 'why', 'nor', "haven't", 'most', 'if', 'am', 'itself', 'not', 'in', "she's", 'haven', 'this', 'an', 'for', 'there', 'needn', 'he', 'doesn', 'how', 'again', 'after', 'mustn', 'herself', "doesn't", 'can', 'while', 'the', "isn't", 'does', 'didn', 'on', "won't", "should've", 'couldn', 'only', 'it', 'myself', 'have', 'thes

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\student\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])


data["text"] = data["text"].apply(lambda x: remove_stopwords(x))
data["text"][0]

'Subject: irresistible corporate identity lt really hard recollect company : market full suqgestions information isoverwhelminq ; good catchy logo , stylish statlonery outstanding website make task much easier . promise havinq ordered iogo company automaticaily become world ieader : isguite ciear without good products , effective business organization practicable aim hotat nowadays market ; promise marketing efforts become much effective . list clear benefits : creativeness : hand - made , original logos , specially done reflect distinctive company image . convenience : logo stationery provided formats ; easy - - use content management system letsyou change website content even structure . promptness : see logo drafts within three business days . affordability : marketing break - make gaps budget . 100 % satisfaction guaranteed : provide unlimited amount changes extra fees surethat love result collaboration . look portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 