# Remove Stop Words

In natural language processing, useless words (data), are referred to as stop words. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

To remove stopwords, we import the `stopwords` module from the `nltk.corpus` library.

In [14]:
from nltk.corpus import stopwords

Word tokenization is implemented using `word_tokenize`.

In [3]:
from nltk.tokenize import word_tokenize

The follwing sample text is extracted from The Hindu Editorial page: https://www.thehindu.com/opinion/editorial/failure-of-justice/article25814414.ece

In [15]:
text = "It is unfortunate that the families of the victims do not have the consolation of anyone being brought to justice. While Sohrabuddin’s killing has ‘encounter’ as an explanation, his wife’s disappearance remains a mystery. It was not proved that she was taken to a farm, killed and her body burnt. And it cannot be a coincidence that Prajapati was killed a year later in Rajasthan in another encounter. It was under a cloud of suspicion over the circumstances of their death that Sohrabuddin’s brother had approached the Supreme Court and obtained an order for an investigation, which was subsequently handed over to the CBI. In losing this case, the CBI has shown that it continues to struggle when it comes to handling cases with political overtones. The 2014 discharge of Mr. Shah and the subsequent pre-trial exoneration of senior police officer D.G. Vanzara had come as a boost to the BJP. The final decision in the trial is also likely to be interpreted as a justification for some encounters that took place in Gujarat when Narendra Modi was Chief Minister. Mr. Vanzara has implied as much in controversial tweets. He has also claimed that such ‘pre-emptive encounters’ were needed to save Mr. Modi. This is a tacit acknowledgement that these may not have been chance encounters, as genuine ones are supposed to be, but part of a plan to eliminate a threat to the leader’s life through extrajudicial killings. It is regrettable that such a triumphalist narrative is sought to be built around such incidents."

We can extract stopwords in multiple languages. As of now, we are considering English as the language.

In [16]:
stop_words = set(stopwords.words("english"))

There are many words which are not of any significance when it comes to extraction of context from the sample text. Words like 'who', 'themselves', 'each', 'no', "haven't", etc. do not contribute to the meaning.

In [6]:
print(stop_words)

{'who', 'themselves', 'each', 'no', "haven't", 'these', 'yourselves', "doesn't", 'it', 'yours', 'out', 'off', 'against', 'll', 'there', 'very', 'most', 'during', 'you', 'here', 'than', 'before', 'from', "it's", 'she', 'needn', "didn't", 'while', 'but', "you'd", 'same', 'any', 'him', 'shan', 'down', 'them', 'd', 'myself', 'are', 'once', 'of', 'if', 'more', 'ourselves', 'above', 'won', 'whom', "shouldn't", 'what', 'for', 'below', 'will', 'as', 'after', "isn't", 'over', 'shouldn', 'does', 'hasn', 'on', 'where', 's', 'other', 'again', "don't", "you're", "you've", 'itself', "she's", 'an', 'up', 'few', 'i', 'not', 'isn', 'hers', 'mightn', 'your', 'a', 'just', 'we', "should've", 'and', 'then', 'they', 'wasn', "won't", 'hadn', 'to', 'were', 'his', 'those', "wouldn't", 'under', "needn't", 'do', 'with', 'some', 'now', 'did', "shan't", "mightn't", "you'll", 'theirs', 'only', 're', 'nor', 'all', 'himself', 'is', 'both', 'our', "aren't", 'doesn', "hadn't", 'have', 'was', "that'll", 'y', 'wouldn', '

## Filtering the Text from Stop Words

In [17]:
words = word_tokenize(text)

Creata a new list to add the words of contextual relevance. That is, (original text - stop words) = words of contextual relevance.

In [18]:
filtered_sentence = []

Run a loop, and test if w is not in the list of `stop_words`, then add it to the `filtered_sentence` list, else remove.

In [19]:
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)

In [20]:
print(filtered_sentence)

['It', 'unfortunate', 'families', 'victims', 'consolation', 'anyone', 'brought', 'justice', '.', 'While', 'Sohrabuddin', '’', 'killing', '‘', 'encounter', '’', 'explanation', ',', 'wife', '’', 'disappearance', 'remains', 'mystery', '.', 'It', 'proved', 'taken', 'farm', ',', 'killed', 'body', 'burnt', '.', 'And', 'coincidence', 'Prajapati', 'killed', 'year', 'later', 'Rajasthan', 'another', 'encounter', '.', 'It', 'cloud', 'suspicion', 'circumstances', 'death', 'Sohrabuddin', '’', 'brother', 'approached', 'Supreme', 'Court', 'obtained', 'order', 'investigation', ',', 'subsequently', 'handed', 'CBI', '.', 'In', 'losing', 'case', ',', 'CBI', 'shown', 'continues', 'struggle', 'comes', 'handling', 'cases', 'political', 'overtones', '.', 'The', '2014', 'discharge', 'Mr.', 'Shah', 'subsequent', 'pre-trial', 'exoneration', 'senior', 'police', 'officer', 'D.G', '.', 'Vanzara', 'come', 'boost', 'BJP', '.', 'The', 'final', 'decision', 'trial', 'also', 'likely', 'interpreted', 'justification', 'en

Now, compare the above output with the originial text:
    
"It is unfortunate that the families of the victims do not have the consolation of anyone being brought to justice. While Sohrabuddin’s killing has ‘encounter’ as an explanation, his wife’s disappearance remains a mystery. It was not proved that she was taken to a farm, killed and her body burnt. And it cannot be a coincidence that Prajapati was killed a year later in Rajasthan in another encounter. It was under a cloud of suspicion over the circumstances of their death that Sohrabuddin’s brother had approached the Supreme Court and obtained an order for an investigation, which was subsequently handed over to the CBI. In losing this case, the CBI has shown that it continues to struggle when it comes to handling cases with political overtones. The 2014 discharge of Mr. Shah and the subsequent pre-trial exoneration of senior police officer D.G. Vanzara had come as a boost to the BJP. The final decision in the trial is also likely to be interpreted as a justification for some encounters that took place in Gujarat when Narendra Modi was Chief Minister. Mr. Vanzara has implied as much in controversial tweets. He has also claimed that such ‘pre-emptive encounters’ were needed to save Mr. Modi. This is a tacit acknowledgement that these may not have been chance encounters, as genuine ones are supposed to be, but part of a plan to eliminate a threat to the leader’s life through extrajudicial killings. It is regrettable that such a triumphalist narrative is sought to be built around such incidents."

It can be observed that the stop words have been successfully removed.

### One-line code to filter out the stop words from our document:

In [12]:
filtered_sentence = [w for w in words if not w in stop_words]

In [13]:
print(filtered_sentence)

['It', 'unfortunate', 'families', 'victims', 'consolation', 'anyone', 'brought', 'justice', '.', 'While', 'Sohrabuddin', '’', 'killing', '‘', 'encounter', '’', 'explanation', ',', 'wife', '’', 'disappearance', 'remains', 'mystery', '.', 'It', 'proved', 'taken', 'farm', ',', 'killed', 'body', 'burnt', '.', 'And', 'coincidence', 'Prajapati', 'killed', 'year', 'later', 'Rajasthan', 'another', 'encounter', '.', 'It', 'cloud', 'suspicion', 'circumstances', 'death', 'Sohrabuddin', '’', 'brother', 'approached', 'Supreme', 'Court', 'obtained', 'order', 'investigation', ',', 'subsequently', 'handed', 'CBI', '.', 'In', 'losing', 'case', ',', 'CBI', 'shown', 'continues', 'struggle', 'comes', 'handling', 'cases', 'political', 'overtones', '.', 'The', '2014', 'discharge', 'Mr.', 'Shah', 'subsequent', 'pre-trial', 'exoneration', 'senior', 'police', 'officer', 'D.G', '.', 'Vanzara', 'come', 'boost', 'BJP', '.', 'The', 'final', 'decision', 'trial', 'also', 'likely', 'interpreted', 'justification', 'en

The above `list comprehension` code gives the same output.