<a href="https://colab.research.google.com/github/mittalsharad/NLP/blob/main/NLP_Basics/Stop%20Word%20Removal/Stop%20Word%20Removal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stop Word Removal using NLTK 

In [9]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print("Original Tokens: ",text_tokens)
print("Tokens after StopWord Removal: ",tokens_without_sw)

Original Tokens:  ['Nick', 'likes', 'to', 'play', 'football', ',', 'however', 'he', 'is', 'not', 'too', 'fond', 'of', 'tennis', '.']
Tokens after StopWord Removal:  ['Nick', 'likes', 'play', 'football', ',', 'however', 'fond', 'tennis', '.']


In the script above, we first import the stopwords collection from the nltk.corpus module. Next, we import the word_tokenize() method from the nltk.tokenize class. We then create a variable text, which contains a simple sentence. The sentence in the text variable is tokenized (divided into words) using the word_tokenize() method. Next, we iterate through all the words in the text_tokens list and checks if the word exists in the stop words collection or not. If the word doesn't exist in the stopword collection, it is returned and appended to the tokens_without_sw list. The tokens_without_sw list is then printed.



#### **Adding or Removing Stop Words in NLTK's Default Stop Word List**

In [10]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

###### **Adding Stop Words to Default NLTK Stop Word List**

To add a word to NLTK stop words collection, first create an object from the stopwords.words('english') list. Next, use the append() method on the list to add any word to the list.

The following script adds the word play to the NLTK stop word collection. Again, we remove all the words from our text variable to see if the word play is removed or not

In [11]:
all_stopwords = stopwords.words('english')
all_stopwords.append('play')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'football', ',', 'however', 'fond', 'tennis', '.']


The output shows that the word play has been removed.

You can also add a list of words to the stopwords.words list using the append method, as shown below:

In [12]:
sw_list = ['likes','play']
all_stopwords.extend(sw_list)

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'football', ',', 'however', 'fond', 'tennis', '.']


The script above adds two words likes and play to the stopwords.word list. In the output, you will not see these two words as shown.

###### **Removing Stop Words from Default NLTK Stop Word List**

Since stopwords.word('english') is merely a list of items, you can remove items from this list like any other list. The simplest way to do so is via the remove() method. This is helpful for when your application needs a stop word to not be removed. For example, you may need to keep the word not in a sentence to know when a statement is being negated.

The following script removes the stop word not from the default list of stop words in NLTK:

In [13]:
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'however', 'not', 'fond', 'tennis', '.']


# Stop Word Removal using Gensim

The Gensim library is another extremely useful library for removing stop words from a string in Python. All you have to do is to import the remove_stopwords() method from the gensim.parsing.preprocessing module. Next, you need to pass your sentence from which you want to remove stop words, to the remove_stopwords() method which returns text string without the stop words.

Let's take a look at a simple example of how to remove stop words via the Gensim library.

In [14]:
from gensim.parsing.preprocessing import remove_stopwords

text = "Nick likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)

print(filtered_sentence)

Nick likes play football, fond tennis.


It is important to mention that the output after removing stop words using the NLTK and Gensim libraries is different. For example, the Gensim library considered the word however to be a stop word while NLTK did not, and hence didn't remove it. This shows that there is no hard and fast rule as to what a stop word is and what it isn't. It all depends upon the task that you are going to perform.

## Adding and Removing Stop Words in Default Gensim Stop Words List

Let's first take a look at the stop words in Python's Gensim library:

In [15]:
import gensim
all_stopwords = gensim.parsing.preprocessing.STOPWORDS
print(all_stopwords)

frozenset({'some', 'becoming', 'have', 'whose', 'never', 'everything', 'eg', 'rather', 'mill', 'together', 'would', 'within', 'is', 'yourselves', 'interest', 'inc', 'via', 'and', 'though', 'behind', 'most', 'who', 'really', 'computer', 'upon', 'give', 'found', 'these', 'among', 'whom', 'own', 'least', 'ever', 'me', 'a', 'well', 'cry', 'be', 'full', 're', 'where', 'that', 'same', 'unless', 'hers', 'regarding', 'for', 'about', 'than', 'fifteen', 'hundred', 'whether', 'she', 'whereafter', 'mine', 'are', 'km', 'even', 'neither', 'whereas', 'your', 'kg', 'to', 'became', 'an', 'latterly', 'sometimes', 'herself', 'us', 'co', 'say', 'ourselves', 'system', 'further', 'we', 'how', 'indeed', 'throughout', 'get', 'you', 'whereupon', 'nothing', 'anywhere', 'see', 'eight', 'i', 'else', 'bill', 'something', 'if', 'perhaps', 'don', 'anyone', 'etc', 'ie', 'might', 'afterwards', 'both', 'front', 'hereafter', 'them', 'various', 'moreover', 'nowhere', 'cant', 'empty', 'done', 'himself', 'amongst', 'amoung

You can see that Gensim's default collection of stop words is much more detailed, when compared to NLTK. Also, Gensim stores default stop words in a frozen set object.

### Adding Stop Words to Default Gensim Stop Words List

To access the list of Gensim stop words, you need to import the frozen set STOPWORDS from the gensim.parsing.preprocessong package. A frozen set in Python is a type of set which is immutable. You cannot add or remove elements in a frozen set. Hence, to add an element, you have to apply the union function on the frozen set and pass it the set of new stop words. The union method will return a new set which contains your newly added stop words, as shown below.

The following script adds likes and play to the list of stop words in Gensim:

In [16]:
from gensim.parsing.preprocessing import STOPWORDS

all_stopwords_gensim = STOPWORDS.union(set(['likes', 'play']))

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]

print(tokens_without_sw)

['Nick', 'football', ',', 'fond', 'tennis', '.']


From the output above, you can see that the words like and play have been treated as stop words and consequently have been removed from the input sentence.

### Removing Stop Words from Default Gensim Stopword List
To remove stop words from Gensim's list of stop words, you have to call the difference() method on the frozen set object, which contains the list of stop words. You need to pass a set of stop words that you want to remove from the frozen set to the difference() method. The difference() method returns a set which contains all the stop words except those passed to the difference() method.

The following script removes the word not from the set of stop words in Gensim:

In [17]:
from gensim.parsing.preprocessing import STOPWORDS

all_stopwords_gensim = STOPWORDS
sw_list = {"not"}
all_stopwords_gensim = STOPWORDS.difference(sw_list)

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'not', 'fond', 'tennis', '.']


Since the word not has now been removed from the stop word set, you can see that it has not been removed from the input sentence after stop word removal.

# Stop Word Removal using SpaCy

The SpaCy library in Python is yet another extremely useful language for natural language processing in Python.

To install SpaCy, you have to execute the following script on your command terminal:


```
pip install -U spacy
```

Once the library is downloaded, you also need to download the language model. Several models exist in SpaCy for different languages. We will be installing the English language model. Execute the following command in your terminal:



```
python -m spacy download en
```


Once the language model is downloaded, you can remove stop words from text using SpaCy. Look at the following script:



In [18]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'fond', 'tennis', '.']


In the script above we first load the language model and store it in the sp variable. The sp.Default.stop_words is a set of default stop words for English language model in SpaCy. Next, we simply iterate through each word in the input text and if the word exists in the stop word set of the SpaCy language model, the word is removed.



## Adding and Removing Stop Words in SpaCy Default Stop Word List

Like the other NLP libraries, you can also add or remove stop words from the default stop word list in Spacy. But before that, we will see a list of all the existing stop words in SpaCy.

In [19]:
print(len(all_stopwords))
print(all_stopwords)

326
{'some', 'becoming', 'have', 'whose', 'never', 'everything', 'rather', '‘m', 'together', 'within', 'would', "'ve", 'fifty', 'is', 'yourselves', 'via', 'and', 'though', 'behind', 'most', 'who', 'really', 'upon', 'give', 'these', 'among', 'whom', 'own', 'least', 'ever', 'me', 'a', 'well', 'be', 'full', 're', 'where', 'that', 'same', 'unless', 'hers', 'for', 'regarding', 'about', 'fifteen', 'hundred', 'than', 'whether', 'she', 'whereafter', 'mine', 'are', 'even', "n't", "'s", 'neither', 'whereas', 'your', 'became', 'to', 'an', 'latterly', 'sometimes', 'herself', 'us', 'say', 'ourselves', 'further', 'we', 'how', 'indeed', 'throughout', 'get', 'you', 'whereupon', 'nothing', 'anywhere', 'see', 'eight', 'i', 'else', 'something', '’ve', 'if', 'perhaps', 'anyone', 'might', 'afterwards', 'both', 'front', 'hereafter', 'them', 'various', 'moreover', 'nowhere', 'empty', 'amongst', 'done', 'himself', 'by', 'in', '‘d', 'other', 'here', "'ll", 'each', 'anyway', 'several', 'her', 'third', 'formerly

The output shows that there 326 stop words in the default list of stop words in the SpaCy library.

### Adding Stop Words to Default SpaCy Stop Words List

The SpaCy stop word list is basically a set of strings. You can add a new word to the set like you would add any new item to a set.

Look at the following script in which we add the word `tennis` to existing list of stop words in Spacy:

In [20]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words
all_stopwords.add("tennis")

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'fond', '.']


The output shows that the word `tennis` has been removed from the input sentence.

You can also add multiple words to the list of stop words in SpaCy as shown below. The following script adds `likes` and `tennis` to the list of stop words in SpaCy:

In [21]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words
all_stopwords |= {"likes","tennis",}

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'play', 'football', ',', 'fond', '.']


### Removing Stop Words from Default SpaCy Stop Words List
To remove a word from the set of stop words in SpaCy, you can pass the word to remove to the `remove` method of the set.

The following script removes the word `not` from the set of stop words in SpaCy:

In [22]:
import spacy
sp = spacy.load('en_core_web_sm')

all_stopwords = sp.Defaults.stop_words
all_stopwords.remove('not')

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]

print(tokens_without_sw)

['Nick', 'play', 'football', ',', 'not', 'fond', '.']


In the output, you can see that the word not has not been removed from the input sentence.

# Using Custom Script to Remove Stop Words

Till now, we saw how we can use various libraries to remove stop words from a string in Python. If you want full control over stop word removal, you can write your own script to remove stop words from your string.

The first step in this regard is to define a list of words that you want treated as stop words. Let's create a list of some of the most commonly used stop words:

In [23]:
my_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Next, we will define a function that will accept a string as a parameter and will return the sentence without the stop words:

In [24]:
def remove_mystopwords(sentence):
    tokens = sentence.split(" ")
    tokens_filtered= [word for word in text_tokens if not word in my_stopwords]
    return (" ").join(tokens_filtered)

Let's now try to remove stop words from a sample sentence:


In [25]:
text = "Nick likes to play football, however he is not too fond of tennis."
filtered_text = remove_mystopwords(text)
print(filtered_text)

Nick likes play football , however fond tennis .


You can see that stop words that exist in the my_stopwords list has been removed from the input sentence.

Since my_stopwords list is a simple list of strings, you can add or remove words into it using simple 


```
# To add to list
my_stopwords.append("football")

# To remove from list
my_stopwords.remove("football")
```



