<a href="https://colab.research.google.com/github/jainnipun/MachineLearning/blob/master/TextAnalytics/NLP_Semantic_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP :  Use Stopword

**How  to use default Stopwords corpus present in Natural Language Toolkit (NLTK).**


**Stopwords are the frequently occurring words in a text document. For example, a, the, is, are, etc**

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

**Loading stopwords Corpus**

In [2]:
from nltk.corpus import stopwords 
print (stopwords.fileids())

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']


In [3]:
print ('English Stopwords : ',stopwords.words('english'))
print ('French Stopwords : ',stopwords.words('french'))


English Stopwords :  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 's

**Tokenize Words**

We split the text sentence/paragraph into a list of words. Each word in the list is called a token.

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
nltk.download('punkt') 
text = "Hello my name is Nipun Jain. I am a Machine Learning enthusiast."
 
# Normalize text
# NLTK considers capital letters and small letters differently.
# For example: Tree and tree are considered as two different words.
# Hence, we convert all letters of our text into lowercase.
text = text.lower()
 
# tokenize text 
words = word_tokenize(text)
 
print (words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['hello', 'my', 'name', 'is', 'nipun', 'jain', '.', 'i', 'am', 'a', 'machine', 'learning', 'enthusiast', '.']


**Removing Punctuation**



In [5]:
words = [w for w in words if w.isalpha()]
print(words)

['hello', 'my', 'name', 'is', 'nipun', 'jain', 'i', 'am', 'a', 'machine', 'learning', 'enthusiast']


**Removing Stop Words**

Here, we will remove stop words from our text data using the default stopwords corpus present in NLTK.

Get list of English Stopwords

In [6]:
stop_words = stopwords.words('english')
 
print ('Length of Stopwords',len(stop_words))
 
words_filtered = words[:] # creating a copy of the words list
 
for word in words:
    if word in stop_words:        
        words_filtered.remove(word)
 
print (words_filtered)

Length of Stopwords 179
['hello', 'name', 'nipun', 'jain', 'machine', 'learning', 'enthusiast']


**Updating Stop Words Corpus**

Suppose, you don’t want to omit some stopwords for your text analysis. In such case, you have to remove those words from the stopwords list.

Let’s suppose, you want the words over and under for your text analysis. The words “my”  are present in the stopwords corpus by default.
Let’s remove them from the stopwords corpus.

In [7]:

# set() function removes entries from the list . Here we removed 'my' from stopwords
stop_words = set(stopwords.words('english')) - set(['my'])
print ('Length of Stopwords',len(stop_words))
print (stop_words)

Length of Stopwords 178
{'at', 'so', 'they', 'not', 'me', 'yourself', 'too', 'why', 'himself', "wasn't", 'has', 'both', 'aren', 'here', 'but', "hadn't", 'during', 'you', 'which', 'is', 'about', 's', "haven't", 'd', 'his', 'i', 'did', 'ours', 'doesn', 'into', "shan't", 'each', 'can', 'are', 'nor', 'if', 'this', 'don', 'through', 'hasn', 'because', "you'd", 'couldn', 'again', "weren't", 'out', "shouldn't", 'only', "it's", 'few', 'yours', 'he', 'any', "she's", 'hers', 'and', 've', 'most', 'or', 'were', 'for', 'from', 'no', 'didn', 'm', 'mustn', 'itself', 'shan', 'we', 'as', "that'll", 'with', 'above', "won't", 'under', 'when', 'them', 'herself', 'should', "you're", 'off', "aren't", 'was', "didn't", 'wasn', 'mightn', 'its', 'your', 'up', "you'll", 'same', 'y', 'that', 'own', "mightn't", 'before', 'these', 'their', "doesn't", "mustn't", 'our', 'over', 'ain', 'those', 'other', 'hadn', "needn't", 'being', "wouldn't", "you've", 'on', 'isn', 'than', 'further', 'ma', 'all', 'her', 'it', 'needn',

In [8]:
words_filtered = words[:] # creating a copy of the words list
 
for word in words:
    if word in stop_words:        
        words_filtered.remove(word)
 
print (words_filtered)

['hello', 'my', 'name', 'nipun', 'jain', 'machine', 'learning', 'enthusiast']
