<a href="https://colab.research.google.com/github/nitin-barthwal/TextAnalytics/blob/master/NLP_Stopwords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP :  Use Stopword

**How  to use default Stopwords corpus present in Natural Language Toolkit (NLTK).**


**Stopwords are the frequently occurring words in a text document. For example, a, the, is, are, etc**

In [20]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Loading stopwords Corpus**

In [21]:
from nltk.corpus import stopwords
 
#stopwords_files = [str(item) for item in stopwords.fileids()]
#print (stopwords_files)
 
print (stopwords.fileids())

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']


In [22]:
stopwords_english = [str(item) for item in stopwords.words('english')]

 
print ('English Stopwords : ',stopwords.words('english'))
print ('German Stopwords : ',stopwords.words('german'))


English Stopwords :  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 's

**Tokenize Words**

We split the text sentence/paragraph into a list of words. Each word in the list is called a token.

In [23]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
nltk.download('punkt') 
text = "Hello my name is Nitin Barthwal. I am a Machine Learning Student."
 
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example: Fox and fox are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
 
# tokenize text 
words = word_tokenize(text)
 
print (words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['hello', 'my', 'name', 'is', 'nitin', 'barthwal', '.', 'i', 'am', 'a', 'machine', 'learning', 'student', '.']


**Removing Punctuation**



In [24]:
words = [w for w in words if w.isalpha()]
print( words)

['hello', 'my', 'name', 'is', 'nitin', 'barthwal', 'i', 'am', 'a', 'machine', 'learning', 'student']


**Removing Stop Words**

Here, we will remove stop words from our text data using the default stopwords corpus present in NLTK.

Get list of English Stopwords

In [30]:
stop_words = stopwords.words('english')
 
print ('Length of Stopwords',len(stop_words))
 
words_filtered = words[:] # creating a copy of the words list
 
for word in words:
    if word in stop_words:        
        words_filtered.remove(word)
 
print (words_filtered)

Length of Stopwords 179
['hello', 'name', 'nitin', 'barthwal', 'machine', 'learning', 'student']


**Updating Stop Words Corpus**

Suppose, you don’t want to omit some stopwords for your text analysis. In such case, you have to remove those words from the stopwords list.

Let’s suppose, you want the words over and under for your text analysis. The words “my”  are present in the stopwords corpus by default.
Let’s remove them from the stopwords corpus.

In [34]:

# set() function removes entries from the list . Here we removed 'my' from stopwords
stop_words = set(stopwords.words('english')) - set(['my'])
print ('Length of Stopwords',len(stop_words))
print (stop_words)

Length of Stopwords 178
{'myself', "that'll", "you'd", 'an', 'once', 'you', 'too', 'very', 'will', 'these', 'having', 'couldn', 'just', 'through', 'it', 'am', "haven't", 'further', 'i', 'but', 'between', 'why', 'only', 'below', 'the', 'such', 't', 'll', 'was', 'each', 'some', 'mustn', 'and', 'hadn', 'again', 'off', 'can', 're', 'is', 'out', 'few', 'on', "hadn't", "you're", "weren't", 'been', 'most', 'now', 'not', 'mightn', 'don', "didn't", 'didn', "shouldn't", 'his', 'so', "aren't", "you'll", 'ours', 'into', 'which', 'with', 'we', 'of', 'were', 'because', 'until', "don't", 'should', 'yourself', 'that', 'doesn', "it's", 'their', 'ourselves', 'who', 'as', 'ain', 'him', "couldn't", "wouldn't", 'this', 'by', 'any', 'up', 'while', 'here', 'to', 's', 'where', 'after', 'do', 'during', 'when', 'there', 'more', 'yours', "doesn't", 'our', 'against', 'me', 'shan', 'its', 'ma', "mustn't", 'needn', "isn't", "should've", 'from', 'won', 'be', "shan't", 'down', 'than', 'weren', 'then', 'if', 'y', 'her

In [35]:
words_filtered = words[:] # creating a copy of the words list
 
for word in words:
    if word in stop_words:        
        words_filtered.remove(word)
 
print (words_filtered)

['hello', 'my', 'name', 'nitin', 'barthwal', 'machine', 'learning', 'student']
