<small><small><i>
All the IPython Notebooks in **[Python Natural Language Processing](https://github.com/milaan9/Python_Python_Natural_Language_Processing)** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9)**
</i></small></small>

<a href="https://colab.research.google.com/github/milaan9/Python_Python_Natural_Language_Processing/blob/main/03_StopWords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 03 Stop Words
Stop words are those words that do not contribute to the deeper meaning of the phrase. They are the most common words such as: the, a, and is. For some applications like documentation classification, it may make sense to remove stop words. NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English.. 

In [1]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

{'otherwise', 'last', 'will', 'sixty', 'that', 'yourself', 'hereafter', 'became', 'in', 'get', 'anywhere', 'everything', 'my', 'former', 'amongst', 'already', 'been', 'however', 'quite', 'should', 'once', 'whole', 'thru', 'your', 'do', 'next', '‘ve', 'toward', 'again', 'is', 'then', 'about', 'fifteen', 'seem', 'into', 'you', 'this', 'alone', 'due', 'beforehand', 'above', 'who', 'ours', 'really', 'regarding', 'amount', 'everyone', 'forty', 'by', 'why', 'nobody', '‘s', 'first', 'were', "'ll", 'twelve', 'up', 'itself', 'never', 'often', 'what', 'very', 'whatever', 'rather', 'same', 'more', 'somewhere', 'after', 'below', 'only', 'did', 'yet', 'call', 'so', 'am', 'few', 'since', 'enough', 'whereupon', 'nowhere', 'back', 'but', 'yourselves', 'throughout', 'go', 'while', 'n’t', 'empty', 'take', 'whither', 'seeming', 'i', 'onto', 'least', 'own', 'name', 'become', 'two', 'sometimes', 'between', 'side', 'moreover', 'therefore', 'becomes', 'both', 'her', 'someone', 'anything', 'somehow', 'on', 'w

In [4]:
from nltk.corpus import stopwords 
stopwords.words('english') # all stopwords in english language

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [5]:
len(nlp.Defaults.stop_words)

326

### **Check if a word is a stop word or not( use "vocab" method)**

In [6]:
# We can check a word that is stopword or not by using vocab method
nlp.vocab['myself'].is_stop # check "myself" is stopword or not

True

In [7]:
nlp.vocab['mystery'].is_stop

False

### **Make a word as stopword**

In [8]:
# step-1: Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('mystery')

In [9]:
# step-2: Set the stop_word tag on the lexeme
nlp.vocab['mystery'].is_stop = True

In [10]:
len(nlp.Defaults.stop_words)

327

In [11]:
nlp.vocab['mystery'].is_stop

True

### **To remove a stop word**
Alternatively, you may decide that `'beyond'` should not be considered a stop word.

In [12]:
# step-1: Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('and')

# step-2: Remove the stop_word tag from the lexeme
nlp.vocab['and'].is_stop = False

In [13]:
len(nlp.Defaults.stop_words)

326

In [14]:
nlp.vocab['and'].is_stop

False

### **Print all stopwords from a line**

In [15]:
import string
import re
import nltk
nltk.download('punkt')
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
# load data
text = 'The Quick brown fox jump over the lazy dog!'

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
# split into words
tokens = word_tokenize(text)
print(tokens)

['The', 'Quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '!']


In [17]:
# convert to lower case
tokens = [w.lower() for w in tokens]
print(tokens)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '!']


In [18]:
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
print(re_punc)

re.compile('[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]')


In [19]:
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in tokens]
print(stripped)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '']


In [20]:
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
print(words)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']


In [21]:
# filter out non-stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words)

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']


In [22]:
# Check remaining tokens are stopword or not
nlp.vocab['dog'].is_stop

False