## stop words (useless words)

Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

In [1]:
from nltk.corpus import stopwords

#### check the stop words in English language

In [2]:
stop_words = stopwords.words("english")
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
type(stop_words)

list

#### removing the stopwords

In [4]:
from nltk.tokenize import word_tokenize

In [9]:
text = "This is some sample text, showing off the stop words filtration"
word_tokens = word_tokenize(text)

new_list = []

for val in word_tokens:
    if val not in stop_words:
        new_list.append(val)
        
new_list

['This', 'sample', 'text', ',', 'showing', 'stop', 'words', 'filtration']

#### removing the stopwords and the punctuation

In [3]:
import string
punc = string.punctuation

In [7]:
text = "This is some sample text, showing off the stop words filtration"
word_tokens = word_tokenize(text)

new_list = []

for val in word_tokens:
    if val not in stop_words and val not in punc:
        new_list.append(val)
        
new_list

['This', 'sample', 'text', 'showing', 'stop', 'words', 'filtration']

In [5]:
def remove_punc_and_sw(txt):
    txt_no_punc = "".join([val for val in txt if val not in string.punctuation])
    tokens = word_tokenize(txt_no_punc.lower())
    txt_no_sw = " ".join([val for val in tokens if val not in stop_words])
    return txt_no_sw

##### 3 steps:

1. remove the punctuations and then join them together.
2. convert all the texts into lower format then tokenized into words
3. then remove the stop words and join them again

In [6]:
text = "This is some sample text, showing off the stop words filtration."

remove_punc_and_sw(text)

'sample text showing stop words filtration'

#### Observation:

You may consider a stop word a word that has high frequency on a corpus. Words such as articles and some verbs are usually considered stop words because they don’t help us to find the context or the true meaning of a sentence. These are words that can be removed without any negative consequences to the final model that you are training.

> **why remove stop words?**

Reducing the data set size is without any doubt a way of increasing performance. Training models takes time and if you have less tokens to be trained, the training time should decrease.

**What's the main point?**

Word importance may vary depending on the dataset. But it may also change depending on the goal you are trying to achieve. Problems like sentiment analysis are much more sensitive to stop words removal than document classification.

In [8]:
# an example
example = "I told you that she was not happy"

# remove stop words
remove_punc_and_sw(example)

'told happy'

It's seen from the above example that the final output: `told happy` seems have positive vibe while the actual sentence: `I told you that she was not happy` has negative vibe

For sentiment analysis purposes, the overall meaning of the resulting sentence is positive, which is not at all the reality. Another problem of removing stop words from the model is that it’s crucial to have these tokens when our goal is to generate text or to work with search engines.

> **So, when should I remove stop words?**

You should remove these tokens only if they don’t add any new information for your problem. Classification problems normally don’t need stop words because it’s possible to talk about the general idea of a text even if you remove stop words from it.