A note here – we need to perform tokenization before removing any stopwords. I encourage you to go through my article below on the different methods to perform tokenization:

Removing stopwords is not a definite or say any fixed  rule in NLP. It depends upon the task that we are working on. For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text so that more focus can be given to those words which define the meaning of the text.

However, in tasks like machine translation and text summarization, removing stopwords is not advisable.

Important reasons of removing stopwords:

* On removing stopwords, dataset size decreases and the time to train the model also decreases

* Removing stopwords can potentially help improve the performance as there are fewer and only meaningful tokens left. Thus, it could increase classification accuracy

* Even search engines like Google remove stopwords for fast and relevant retrieval of data from the database

The Questions Arises When i shoukd remove the stop words and when not?

We can remove Stop Words In following Case:
* Text Classification
* Spam Filtering
* Language Classification
* Genre Classification
* Caption Generation
* Auto-Tag Generation
 

avoid removing stopwords Removal
* Machine Translation
* Language Modeling
* Text Summarization
* Question-Answering problems

 NLTK has provide us a list of stopwords stored in **16 different languages.**

In [1]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
nltk.download('punkt')
print(len(set(stopwords.words('english'))))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
179


In [2]:
# sample sentence
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

# set of stop words
stop_words = set(stopwords.words('english')) 

# tokens of words  
word_tokens = word_tokenize(text) 
    
filtered_sent = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sent.append(w) 



print("\n\nOriginal Sentence \n\n")
print(" ".join(word_tokens)) 

print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sent)) 



Original Sentence 


He determined to drop his litigation with the monastry , and relinguish his claims to the wood-cuting and fishery rihgts at once . He was the more ready to do this becuase the rights had become much less valuable , and he had indeed the vaguest idea where the wood and river in question were .


Filtered Sentence 


He determined drop litigation monastry , relinguish claims wood-cuting fishery rihgts . He ready becuase rights become much less valuable , indeed vaguest idea wood river question .


**Stopword Removal using spaCy**

**It has a list of its own stopwords that can be imported as STOP_WORDS from the spacy.lang.en.stop_words class.**

In [4]:
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """Apartment Therapy is a blog focusing on interior design. It was launched by Maxwell Ryan in 2001. Ryan is an interior designer who turned to blogging (using the moniker “the apartment therapist”). The blog has reached 20 million followers and has expanded into a full-scale media company."""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)
print(my_doc)
print("_______________________________________________")

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)


print(token_list)
print("__________________________________________________")

from spacy.lang.en.stop_words import STOP_WORDS

# Create list of word tokens after removing stopwords
filtered_sentence =[] 

for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
        filtered_sentence.append(word) 
#print(token_list)
print("-------------------------------------------------------------")
print(filtered_sentence)   

Apartment Therapy is a blog focusing on interior design. It was launched by Maxwell Ryan in 2001. Ryan is an interior designer who turned to blogging (using the moniker “the apartment therapist”). The blog has reached 20 million followers and has expanded into a full-scale media company.
_______________________________________________
['Apartment', 'Therapy', 'is', 'a', 'blog', 'focusing', 'on', 'interior', 'design', '.', 'It', 'was', 'launched', 'by', 'Maxwell', 'Ryan', 'in', '2001', '.', 'Ryan', 'is', 'an', 'interior', 'designer', 'who', 'turned', 'to', 'blogging', '(', 'using', 'the', 'moniker', '“', 'the', 'apartment', 'therapist', '”', ')', '.', 'The', 'blog', 'has', 'reached', '20', 'million', 'followers', 'and', 'has', 'expanded', 'into', 'a', 'full', '-', 'scale', 'media', 'company', '.']
__________________________________________________
-------------------------------------------------------------
['Apartment', 'Therapy', 'blog', 'focusing', 'interior', 'design', '.', 'launch

**Stopword Removal using Gensim**

We can easily import the remove_stopwords method from the class gensim.parsing.preprocessing.

In [5]:
from gensim.parsing.preprocessing import remove_stopwords
t="""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, 
and he had indeed the vaguest idea where the wood and river in question were."""
# pass the sentence in the remove_stopwords function
result = remove_stopwords(t)

print('\n\n Filtered Sentence \n\n')
print(result) 



 Filtered Sentence 


He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts once. He ready becuase rights valuable, vaguest idea wood river question were.


**Text Normalization**

In any natural language, words can be written or spoken in more than one form depending on the situation. That’s what makes the language such a thrilling part of our lives, right? 

For example:

* Lisa ate the food and washed the dishes.
* They were eating noodles at a cafe.
* Don’t you want to eat before we leave?
* We have just eaten our breakfast.
* It also eats fruit and vegetables.

In above examples, we can see that the word eat has been used as many forms. so, it is easy to understand that eating is the activity here. So it doesn’t really matter to us whether it is ‘ate’, ‘eat’, or ‘eaten’ – we know what is going on.

Unfortunately, that is not the case with machines. They treat these words differently. Therefore, we need to normalize them to their root word, which is “eat” in our example.

Hence, text normalization is a process of transforming a word into a single canonical form. This can be done by two processes, stemming and lemmatization. Let’s understand what they are in detail.

**Stemming**

Stemming is a text normalization technique that cuts off the end or beginning of a word by taking into account a list of common prefixes or suffixes that could be found in that word
stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word


**Lemmatization**

Lemmatization, works in an organized & step-by-step procedure of obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Why do we need to Perform Stemming or Lemmatization?
Let’s consider the following two sentences:


We can easily state that  the sentences are conveying the same meaning, that is, some activity in the past. A machine will treat both sentences differently. Thus, to make the text understandable for the machine, we need to perform stemming or lemmatization.

Another benefit of text normalization is that it reduces the number of unique words in the text data. This helps in bringing down the training time of the machine learning model (and don’t we all want that?).

**Methods to Perform Text Normalization**


Text Normalization using NLTK

The NLTK library has a lot of amazing methods to perform different steps of data preprocessing. There are methods like PorterStemmer() and WordNetLemmatizer() to perform stemming and lemmatization

**Stemming**

In [7]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer

set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 

Stem_words = []
ps =PorterStemmer()
for w in filtered_sentence:
    rootWord=ps.stem(w)
    Stem_words.append(rootWord)
print(filtered_sentence)
print(Stem_words)

['He', 'determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'rights', 'become', 'much', 'less', 'valuable', ',', 'indeed', 'vaguest', 'idea', 'wood', 'river', 'question', '.']
['He', 'determin', 'drop', 'litig', 'monastri', ',', 'relinguish', 'claim', 'wood-cut', 'fisheri', 'rihgt', '.', 'He', 'readi', 'becuas', 'right', 'becom', 'much', 'less', 'valuabl', ',', 'inde', 'vaguest', 'idea', 'wood', 'river', 'question', '.']


**Lemmtization**

In [9]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [12]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
from nltk.stem import WordNetLemmatizer
set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
print(filtered_sentence) 

print("--------------------------------------------------------------------------------------------")

lemma_word = []
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
for w in filtered_sentence:
  #Lemmatization is done on the basis of part-of-speech tagging (POS tagging)
    word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
    word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
    word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
    lemma_word.append(word3)
print(lemma_word)

['He', 'determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'rights', 'become', 'much', 'less', 'valuable', ',', 'indeed', 'vaguest', 'idea', 'wood', 'river', 'question', '.']
--------------------------------------------------------------------------------------------
['He', 'determine', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claim', 'wood-cuting', 'fishery', 'rihgts', '.', 'He', 'ready', 'becuase', 'right', 'become', 'much', 'le', 'valuable', ',', 'indeed', 'vague', 'idea', 'wood', 'river', 'question', '.']


**Text Normalization using spaCy**

**It only supports lemmetization**

In [14]:
import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were.""")

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)
q=lemma_word1

In [15]:
q

['-PRON-',
 'determine',
 'to',
 'drop',
 '-PRON-',
 'litigation',
 'with',
 'the',
 'monastry',
 ',',
 'and',
 'relinguish',
 '-PRON-',
 'claim',
 'to',
 'the',
 'wood',
 '-',
 'cut',
 'and',
 '\n',
 'fishery',
 'rihgts',
 'at',
 'once',
 '.',
 '-PRON-',
 'be',
 'the',
 'more',
 'ready',
 'to',
 'do',
 'this',
 'becuase',
 'the',
 'right',
 'have',
 'become',
 'much',
 'less',
 'valuable',
 ',',
 'and',
 '-PRON-',
 'have',
 '\n',
 'indeed',
 'the',
 'vague',
 'idea',
 'where',
 'the',
 'wood',
 'and',
 'river',
 'in',
 'question',
 'be',
 '.']

Here -PRON- is the notation for pronoun which could easily be removed using regular expressions. The benefit of spaCy is that we do not have to pass any pos parameter to perform lemmatization.

**Text Normalization using TextBlob**

**lemmetization**

In [16]:
# from textblob lib import Word method 
from textblob import Word 

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

lem = []
for i in text.split():
    word1 = Word(i).lemmatize("n")
    word2 = Word(word1).lemmatize("v")
    word3 = Word(word2).lemmatize("a")
    lem.append(Word(word3).lemmatize())
print(lem)

['He', 'determine', 'to', 'drop', 'his', 'litigation', 'with', 'the', 'monastry,', 'and', 'relinguish', 'his', 'claim', 'to', 'the', 'wood-cuting', 'and', 'fishery', 'rihgts', 'at', 'once.', 'He', 'wa', 'the', 'more', 'ready', 'to', 'do', 'this', 'becuase', 'the', 'right', 'have', 'become', 'much', 'le', 'valuable,', 'and', 'he', 'have', 'indeed', 'the', 'vague', 'idea', 'where', 'the', 'wood', 'and', 'river', 'in', 'question', 'were.']
