<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Removing Stop Words from Strings</p>

In [51]:
# from google.colab import drive
# drive.mount('/content/drive')

In [1]:
'''
 Stop words are those words in natural language that have a very little meaning, 
  such as "is", "an", "the", etc.
'''

'\n Stop words are those words in natural language that have a very little meaning, \n  such as "is", "an", "the", etc.\n'

In [None]:
'''Stop words are often removed from the text before training deep learning and machine learning models 
since stop words occur in abundance
'''

"The dataset contains comments from Wikipedia's talk page edits.\nThere are six output labels for each comment: toxic, severe_toxic, obscene, threat, insult and identity_hate.\nA comment can belong to all of these categories or a subset of these categories,\n    which makes it a multi-label classification problem\nThe dataset for this article can be downloaded from this Kaggle link.\nWe will only use the train.csv file that contains 160,000 records\n"

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Using Python's NLTK Library</p>

In [4]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print(tokens_without_sw)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\38067\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['Nick', 'likes', 'play', 'football', ',', 'fond', 'tennis', '.']


In [None]:
'''You can see that the words to, he, is, not, and too have been removed from the sentence.'''

In [5]:
filtered_sentence = (" ").join(tokens_without_sw)
print(filtered_sentence)

Nick likes play football , fond tennis .


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Adding or Removing Stop Words in NLTK's Default Stop Word List</p>

In [6]:
''' let's see the list of all the English stop words supported by NLTK:'''
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Adding Stop Words to Default NLTK Stop Word List</p>

In [7]:
all_stopwords = stopwords.words('english')
all_stopwords.append('play')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'football', ',', 'however', 'fond', 'tennis', '.']


In [None]:
'''The output shows that the word play has been removed.'''

You should be fired, you're a moronic wimp who is too lazy to do research. It makes me sick that people like you exist in this world.


In [8]:
'''You can also add a list of words to the stopwords.words list using the append method, as shown below:'''
sw_list = ['likes','play']
all_stopwords.extend(sw_list)

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'football', ',', 'however', 'fond', 'tennis', '.']


In [9]:
'''The script above adds two words likes and play to the stopwords.word list. 
In the output, you will not see these two words as shown below:'''

'The script above adds two words likes and play to the stopwords.word list. \nIn the output, you will not see these two words as shown below:'

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Removing Stop Words from Default NLTK Stop Word List</p>

In [10]:
'''The following script removes the stop word not from the default list of stop words in NLTK:'''
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'however', 'not', 'fond', 'tennis', '.']


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Using Python's Gensim Library</p>

In [11]:
from gensim.parsing.preprocessing import remove_stopwords

text = "Nick likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)

print(filtered_sentence)

Nick likes play football, fond tennis.


In [None]:
'''For example, the Gensim library considered the word however to be a stop word while NLTK did not, 
and hence didn't remove it.'''
'''It is important to mention that the output 
after removing stop words using the NLTK and Gensim libraries is different'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Adding and Removing Stop Words in Default Gensim Stop Words List</p>

In [12]:
import gensim
all_stopwords = gensim.parsing.preprocessing.STOPWORDS
print(all_stopwords)

frozenset({'except', 'together', 'above', 'regarding', 'name', 'thick', 'becoming', 'fifty', 'seems', 'between', 'himself', 'next', 'unless', 'doesn', 'other', 'herself', 'did', 'elsewhere', 'through', 'nothing', 'neither', 'because', 'whereafter', 'before', 'thereafter', 'least', 'whatever', 'side', 'should', 'behind', 'themselves', 'then', 'couldnt', 'un', 'else', 'seeming', 'nor', 're', 'describe', 'move', 'though', 'interest', 'keep', 'thereby', 'somehow', 'below', 'any', 'beyond', 'always', 'km', 'could', 'further', 'yourselves', 'eg', 'done', 'fire', 'whenever', 'all', 'very', 'that', 'along', 'anyway', 'otherwise', 'really', 'serious', 'almost', 'will', 'by', 'three', 'yours', 'him', 'so', 'why', 'seemed', 'upon', 'fifteen', 'however', 'anywhere', 'beside', 'meanwhile', 'it', 'during', 'therefore', 'without', 'her', 'sixty', 'i', 'whole', 'cannot', 'show', 'everything', 'afterwards', 'eleven', 'less', 'still', 'full', 'such', 'on', 'onto', 'were', 'various', 'thin', 'whether', '

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Adding Stop Words to Default Gensim Stop Words List</p>

In [13]:
'''A frozen set in Python is a type of set which is immutable. 
You cannot add or remove elements in a frozen set. '''
'''Hence, to add an element, you have to apply the union function on the frozen set 
and pass it the set of new stop words'''

'Hence, to add an element, you have to apply the union function on the frozen set \nand pass it the set of new stop words'

In [14]:
from gensim.parsing.preprocessing import STOPWORDS

all_stopwords_gensim = STOPWORDS.union(set(['likes', 'play']))

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]

print(tokens_without_sw)

['Nick', 'football', ',', 'fond', 'tennis', '.']


In [None]:
'''From the output above, you can see that the words like and play have been treated as stop words and 
consequently have been removed from the input sentence'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Removing Stop Words from Default Gensim Stopword List</p>

In [None]:
'''You need to pass a set of stop words that you want to remove from the frozen set to the difference() method.'''

In [16]:
from gensim.parsing.preprocessing import STOPWORDS

all_stopwords_gensim = STOPWORDS
sw_list = {"not"}
all_stopwords_gensim = STOPWORDS.difference(sw_list)

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'not', 'fond', 'tennis', '.']


In [None]:
'''Since the word not has now been removed from the stop word set, 
you can see that it has not been removed from the input sentence after stop word removal.'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Using the SpaCy Library</p>

In [17]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'fond', 'tennis', '.']


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Adding and Removing Stop Words in SpaCy Default Stop Word List</p>


In [18]:
print(len(all_stopwords))
print(all_stopwords)

326
{'except', 'together', 'above', 'regarding', '‘re', 'name', 'becoming', 'fifty', 'seems', 'between', 'himself', 'next', 'unless', 'other', 'herself', 'did', 'elsewhere', 'through', '’ve', 'nothing', 'neither', 'because', '‘s', 'whereafter', 'before', 'thereafter', 'least', 'whatever', 'side', 'should', 'behind', 'themselves', 'then', "'d", 'else', 'nor', 'seeming', 're', 'move', 'though', 'keep', 'thereby', 'somehow', 'below', 'any', 'beyond', 'always', 'could', 'further', 'yourselves', 'done', 'whenever', 'all', 'very', 'that', 'along', 'anyway', 'otherwise', 'really', 'serious', 'almost', 'will', 'by', 'three', 'yours', 'him', 'so', 'why', 'seemed', 'upon', 'fifteen', 'anywhere', 'however', 'beside', 'meanwhile', 'it', 'during', 'therefore', 'without', 'her', 'sixty', 'i', 'whole', 'cannot', 'show', 'eleven', 'afterwards', 'everything', 'less', 'full', 'still', 'such', 'on', 'onto', 'were', 'various', 'whether', '’m', 'a', 'only', 'his', 'been', 'their', 'none', "'re", 'via', 'mu

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Adding Stop Words to Default SpaCy Stop Words List</p>

In [19]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words
all_stopwords.add("tennis")

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'fond', '.']


In [None]:
'''The output shows that the word tennis has been removed from the input sentence.'''

In [None]:
'''You can also add multiple words to the list of stop words in SpaCy as shown below. 
The following script adds likes and tennis to the list of stop words in SpaCy:'''

In [20]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words
all_stopwords |= {"likes","tennis",}

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'play', 'football', ',', 'fond', '.']


In [None]:
'''The ouput shows tha the words likes and tennis both have been removed from the input sentence.'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Removing Stop Words from Default SpaCy Stop Words List</p>

In [21]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words
all_stopwords.remove('not')

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'play', 'football', ',', 'not', 'fond', '.']


In [None]:
'''In the output, you can see that the word not has not been removed from the input sentence.'''