### <center>3. Stop Words</center>

#### 1. Stop words are those words that do not contribute to the deeper meaning of the phrase. 

#### 2. They are the most common words such as: the, a, and is. 

#### 3. For some applications like documentation classification, it may make sense to remove stop words. 

#### 4. NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English.. 

#### 1. Import the lib's

In [2]:
# Perform standard imports:

import spacy
nlp = spacy.load('en_core_web_lg')

In [4]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
Installing collected packages: nltk
Successfully installed nltk-3.7

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### 2. Download the stopword's

In [6]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/v-ajay.nikumbh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Print the set of spaCy's default stop words (remember that sets are unordered):

print(nlp.Defaults.stop_words)

{'twelve', 'her', 'since', 'thru', 'out', 'six', 'what', 'many', 'one', 'please', 'does', 'otherwise', 'and', '’ll', 'third', "'m", 'can', 'ours', 'been', 'them', 'several', "'d", 'under', 'more', 'as', 'hence', 'between', 'either', 'side', 'herein', 'already', 'at', 'hers', 'somewhere', 'anywhere', 'whom', 'say', 'i', 'until', 'except', 'using', 'nothing', 'everywhere', 'both', '‘ve', 'unless', 'latter', 'its', 'keep', 'seeming', 'hereby', '’re', 'of', '‘re', 'two', 'moreover', 'sixty', 'become', 'nevertheless', 'various', 'yet', 'beyond', 'this', 'front', 'go', '’ve', 'very', 'show', 'than', 'make', 'thus', 'nobody', 'call', 'seem', 'by', 'ever', '‘m', 'all', 'cannot', 'full', 'often', 'rather', 'amount', 'again', 'seemed', 'five', 'their', 'my', 'n‘t', 'therein', 'we', 'elsewhere', 'get', 'anyhow', 'top', '’s', 'themselves', 'whence', 'any', 'amongst', 'whose', 'should', 'back', 'though', "'ll", 'have', 'whereas', 'ten', 'someone', 'formerly', 'herself', 'those', 'over', 'will', 'al

In [8]:
len(nlp.Defaults.stop_words)

326

#### 3. To see if a word is a stop word

In [9]:
nlp.vocab['myself'].is_stop

True

In [10]:
nlp.vocab['mystery'].is_stop

False

In [11]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('mystery')

In [12]:
# Set the stop_word tag on the lexeme
nlp.vocab['mystery'].is_stop = True

In [13]:
len(nlp.Defaults.stop_words)

327

In [14]:
nlp.vocab['mystery'].is_stop

True

#### 4. To remove a stop word


In [15]:
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

In [16]:
len(nlp.Defaults.stop_words)

326

In [17]:
nlp.vocab['beyond'].is_stop

False

#### 5. Example

In [18]:
import string
import re
import nltk
nltk.download('punkt')
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
# load data
text = 'The Quick brown fox jump over the lazy dog!'

[nltk_data] Downloading package punkt to
[nltk_data]     /home/v-ajay.nikumbh/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [19]:
# split into words
tokens = word_tokenize(text)
print(tokens)

['The', 'Quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '!']


In [20]:
# convert to lower case
tokens = [w.lower() for w in tokens]
print(tokens)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '!']


In [21]:
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
print(re_punc)

re.compile('[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]')


In [22]:
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in tokens]
print(stripped)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '']


In [23]:
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
print(words)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']


In [24]:
# filter out non-stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words)

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
