<a href="https://colab.research.google.com/github/michaelfarayola7/Data-Science-ML-Projects/blob/main/Tokenization_and_Stopwords_with_NLTK%2C_SpaCy%2C_Gensim_and_Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **NLP BASICS**

# Tokenization

**Document.**

Using a text narrated by Steve Jobs in the “Think Different” Apple commercial.

In [1]:
text = """Here’s to the crazy ones, the misfits, the rebels, the troublemakers,
the round pegs in the square holes. The ones who see things differently — they’re not fond of
rules. You can quote them, disagree with them, glorify
or vilify them, but the only thing you can’t do is ignore them because they
change things. They push the human race forward, and while some may see them
as the crazy ones, we see genius, because the ones who are crazy enough to think
that they can change the world, are the ones who do."""

###1. Simple tokenization with .split

In [2]:
# Word tokenization
text.split()

['Here’s',
 'to',
 'the',
 'crazy',
 'ones,',
 'the',
 'misfits,',
 'the',
 'rebels,',
 'the',
 'troublemakers,',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes.',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 '—',
 'they’re',
 'not',
 'fond',
 'of',
 'rules.',
 'You',
 'can',
 'quote',
 'them,',
 'disagree',
 'with',
 'them,',
 'glorify',
 'or',
 'vilify',
 'them,',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can’t',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things.',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward,',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones,',
 'we',
 'see',
 'genius,',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world,',
 'are',
 'the',
 'ones',
 'who',
 'do.']

In [3]:
# Sentence tokenization
text.split('.')

['Here’s to the crazy ones, the misfits, the rebels, the troublemakers, \nthe round pegs in the square holes',
 ' The ones who see things differently — they’re not fond of \nrules',
 ' You can quote them, disagree with them, glorify\nor vilify them, but the only thing you can’t do is ignore them because they\nchange things',
 ' They push the human race forward, and while some may see them\nas the crazy ones, we see genius, because the ones who are crazy enough to think\nthat they can change the world, are the ones who do',
 '']

## 2. Tokenization with **NLTK**

NLTK stands for Natural Language Toolkit. This is a suite of libraries and programs for statistical natural language processing for English written in Python.

NLTK contains a module called ***tokenize*** with a ***word_tokenize()*** method that will help us split a text into tokens. Once you installed NLTK, you can write the following code to tokenize text.

In [4]:
#Installing nltk
!pip install nltk



In [5]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
#word tokenizer: This helps to tokenize the text into words using the word_tokenize in nltk library.
word_tokenize(text)

['Here',
 '’',
 's',
 'to',
 'the',
 'crazy',
 'ones',
 ',',
 'the',
 'misfits',
 ',',
 'the',
 'rebels',
 ',',
 'the',
 'troublemakers',
 ',',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes',
 '.',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 '—',
 'they',
 '’',
 're',
 'not',
 'fond',
 'of',
 'rules',
 '.',
 'You',
 'can',
 'quote',
 'them',
 ',',
 'disagree',
 'with',
 'them',
 ',',
 'glorify',
 'or',
 'vilify',
 'them',
 ',',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can',
 '’',
 't',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things',
 '.',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward',
 ',',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones',
 ',',
 'we',
 'see',
 'genius',
 ',',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world',
 ',',
 'are',
 'the',
 'ones',
 'who',
 'do',
 '.']

In [7]:
#Sentence tokenizer: This helps to tokenize the text into sentences using the sent_tokenize in nltk library.
sent_tokenize(text)

['Here’s to the crazy ones, the misfits, the rebels, the troublemakers, \nthe round pegs in the square holes.',
 'The ones who see things differently — they’re not fond of \nrules.',
 'You can quote them, disagree with them, glorify\nor vilify them, but the only thing you can’t do is ignore them because they\nchange things.',
 'They push the human race forward, and while some may see them\nas the crazy ones, we see genius, because the ones who are crazy enough to think\nthat they can change the world, are the ones who do.']

## 3. Tokenize text in different languages with **spaCy**

When you need to tokenize text written in a language other than English, you can use spaCy. This is a library for advanced natural language processing, written in Python and Cython, that supports tokenization for more than 65 languages.

Let’s tokenize the same Steve Jobs text but now translated in Spanish.

It considers punctuation as a token

In [8]:
from spacy.lang.es import Spanish

spac = Spanish()

In [9]:
text_spanish = """Por los locos. Los marginados. Los rebeldes. Los problematicos.
Los inadaptados. Los que ven las cosas de una manera distinta. A los que no les gustan
las reglas. Y a los que no respetan el “status quo”. Puedes citarlos, discrepar de ellos,
ensalzarlos o vilipendiarlos. Pero lo que no puedes hacer es ignorarlos… Porque ellos
cambian las cosas, empujan hacia adelante la raza humana y, aunque algunos puedan
considerarlos locos, nosotros vemos en ellos a genios. Porque las personas que están
lo bastante locas como para creer que pueden cambiar el mundo, son las que lo logran."""

In [10]:
doc = spac(text_spanish)

In [11]:
doc

Por los locos. Los marginados. Los rebeldes. Los problematicos. 
Los inadaptados. Los que ven las cosas de una manera distinta. A los que no les gustan
las reglas. Y a los que no respetan el “status quo”. Puedes citarlos, discrepar de ellos,
ensalzarlos o vilipendiarlos. Pero lo que no puedes hacer es ignorarlos… Porque ellos
cambian las cosas, empujan hacia adelante la raza humana y, aunque algunos puedan
considerarlos locos, nosotros vemos en ellos a genios. Porque las personas que están
lo bastante locas como para creer que pueden cambiar el mundo, son las que lo logran.

In [12]:
tokens = [token.text for token in doc]
print(tokens)

['Por', 'los', 'locos', '.', 'Los', 'marginados', '.', 'Los', 'rebeldes', '.', 'Los', 'problematicos', '.', '\n', 'Los', 'inadaptados', '.', 'Los', 'que', 'ven', 'las', 'cosas', 'de', 'una', 'manera', 'distinta', '.', 'A', 'los', 'que', 'no', 'les', 'gustan', '\n', 'las', 'reglas', '.', 'Y', 'a', 'los', 'que', 'no', 'respetan', 'el', '“', 'status', 'quo', '”', '.', 'Puedes', 'citarlos', ',', 'discrepar', 'de', 'ellos', ',', '\n', 'ensalzarlos', 'o', 'vilipendiarlos', '.', 'Pero', 'lo', 'que', 'no', 'puedes', 'hacer', 'es', 'ignorarlos', '…', 'Porque', 'ellos', '\n', 'cambian', 'las', 'cosas', ',', 'empujan', 'hacia', 'adelante', 'la', 'raza', 'humana', 'y', ',', 'aunque', 'algunos', 'puedan', '\n', 'considerarlos', 'locos', ',', 'nosotros', 'vemos', 'en', 'ellos', 'a', 'genios', '.', 'Porque', 'las', 'personas', 'que', 'están', '\n', 'lo', 'bastante', 'locas', 'como', 'para', 'creer', 'que', 'pueden', 'cambiar', 'el', 'mundo', ',', 'son', 'las', 'que', 'lo', 'logran', '.']


In this case, we imported Spanish from spacy.lang.es but if you’re working with text in English, just import English from spacy.lang.en [Check the list of languages available here.](https://spacy.io/usage/models)

## 4. Tokenization with **Gensim**

Gensim(Generate Similar) is a library for unsupervised topic modeling and natural language processing and also contains a tokenizer. Once you install Gensim, tokenizing text will be as simple as writing the following code.

Gensim is quite strict with punctuation. It splits whenever a punctuation is encountered.

In [13]:
from gensim.utils import tokenize

In [14]:
#word tokenization
print(list(tokenize(text)))

['Here', 's', 'to', 'the', 'crazy', 'ones', 'the', 'misfits', 'the', 'rebels', 'the', 'troublemakers', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', 'The', 'ones', 'who', 'see', 'things', 'differently', 'they', 're', 'not', 'fond', 'of', 'rules', 'You', 'can', 'quote', 'them', 'disagree', 'with', 'them', 'glorify', 'or', 'vilify', 'them', 'but', 'the', 'only', 'thing', 'you', 'can', 't', 'do', 'is', 'ignore', 'them', 'because', 'they', 'change', 'things', 'They', 'push', 'the', 'human', 'race', 'forward', 'and', 'while', 'some', 'may', 'see', 'them', 'as', 'the', 'crazy', 'ones', 'we', 'see', 'genius', 'because', 'the', 'ones', 'who', 'are', 'crazy', 'enough', 'to', 'think', 'that', 'they', 'can', 'change', 'the', 'world', 'are', 'the', 'ones', 'who', 'do']


#**Stop Words**

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and NLP to eliminate words that are so commonly used that they carry very little useful information.


<br>

*Why do we remove stop words?*

It helps to remove the low-level information from our text in order to give more focus to the important information.
Those words do not really contribute significant information to our model.

<br>

*Do we always remove stop words*

Not always. It highly depends on the use case. For example tasks like text classification do not generally need stop words as the other words present in the dataset are more important and give the general idea of the text. So, we generally remove stop words in such tasks.

However, in task like sentiment analysis, you might want to maintain these stop words.

For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words.

**Movie review:** *“The movie was not good at all.”*

**Text after removal of stop words:** *“movie good”*

We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review became positive, which is not the reality. Thus, the removal of stop words can be problematic here.

##Removing Stop words with Natural Language Toolkit (NLTK)

In [15]:
#Downloading and importing stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
#Stopwards in English
sw_nltk = stopwords.words('english')
print(sw_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [17]:
#Number of the stopwords
print(len(sw_nltk))

179


In [18]:
text2 = "Here's to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes."

Applying the stopwords on text2

In [19]:
words = [word for word in text2.split() if word.lower() not in sw_nltk]

new_text = " ".join(words)

In [20]:
print(new_text)
print("Old length: ", len(text2))
print("New length: ", len(new_text))

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length:  105
New length:  75


##Removing Stop words with spaCy

In [21]:
import spacy

#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words
print(sw_spacy)

{'whereby', 'seem', 'either', 'make', 'somewhere', 'name', 'herein', 'them', 'mine', 'been', 'onto', 'beforehand', 'four', 'nor', 'at', 'became', 'were', 'so', 'under', 'almost', 'eleven', 'noone', 'done', 'five', 'seeming', '‘re', 'ours', 'moreover', 'my', 'quite', 'whatever', '’d', 'next', 'cannot', 'i', 'becoming', 'thereupon', 'ourselves', 'until', 'in', 'had', 'during', 'between', 'am', 'whither', 'does', 'own', 'into', 'are', 'through', 'therefore', 'beyond', 'indeed', 'all', 'anything', 'among', 'made', 'their', 'perhaps', 'your', 'already', 'six', 'down', 'only', 'then', 'hereupon', 'this', 'we', "'s", 'rather', 'here', 'there', 'how', 'formerly', 'its', 'just', 'whole', 'nobody', 'was', 'together', '’ll', 'no', 'nothing', 'eight', 'amongst', 'using', "'ll", 'but', 'off', 'by', 'itself', 'one', 'above', 'whereafter', 'fifty', 'used', 'really', 'few', 'whenever', 'part', 'therein', 'latter', 'sixty', 'or', 'and', 'throughout', 'has', 'same', 'whom', 'while', 'below', 'more', 'mu

In [22]:
print(len(sw_spacy))

326


In [23]:
words = [word for word in text2.split() if word.lower() not in sw_spacy]

new_text = " ".join(words)

In [24]:
print(new_text)
print("Old length: ", len(text2))
print("New length: ", len(new_text))

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length:  105
New length:  75


Both NLTK and spaCy did same work in this case. Might not be same in other scenarios.

##Removing Stop words with Ginsim

In [25]:
import gensim

#importing stopwords from gensim
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS

In [26]:
print(STOPWORDS)

frozenset({'whereby', 'seem', 'either', 'eg', 'somewhere', 'make', 'cant', 'name', 'couldnt', 'herein', 'them', 'cry', 'mine', 'system', 'been', 'onto', 'beforehand', 'four', 'nor', 'at', 'became', 'were', 'so', 'under', 'almost', 'eleven', 'noone', 'done', 'five', 'seeming', 'ours', 'moreover', 'my', 'quite', 'don', 'whatever', 'next', 'cannot', 'i', 'con', 'becoming', 'thereupon', 'ourselves', 'until', 'in', 'fill', 'had', 'during', 'between', 'am', 'whither', 'does', 'own', 'found', 'are', 'into', 'through', 'km', 'therefore', 'beyond', 'indeed', 'all', 'anything', 'among', 'de', 'made', 'their', 'perhaps', 'your', 'already', 'six', 'down', 'only', 'then', 'hereupon', 'this', 'we', 'ie', 'rather', 'here', 'there', 'how', 'formerly', 'its', 'just', 'whole', 'was', 'nobody', 'together', 'ltd', 'no', 'nothing', 'eight', 'amongst', 'using', 'but', 'off', 'by', 'itself', 'one', 'above', 'whereafter', 'fifty', 'doesn', 'used', 'really', 'few', 'whenever', 'part', 'therein', 'latter', 'six

In [27]:
print(len(STOPWORDS))

337


In [28]:
new_text = remove_stopwords(text2)
print(new_text)
print("Old length: ", len(text2))
print("New length: ", len(new_text))

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length:  105
New length:  75


##Removing Stop words with Scikit-Learn

In [29]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print(ENGLISH_STOP_WORDS)

frozenset({'whereby', 'seem', 'either', 'eg', 'somewhere', 'cant', 'name', 'couldnt', 'herein', 'cry', 'them', 'mine', 'system', 'been', 'onto', 'beforehand', 'four', 'nor', 'at', 'became', 'were', 'so', 'under', 'almost', 'eleven', 'noone', 'done', 'five', 'seeming', 'ours', 'moreover', 'my', 'whatever', 'next', 'cannot', 'i', 'con', 'becoming', 'thereupon', 'ourselves', 'until', 'in', 'fill', 'had', 'during', 'between', 'am', 'whither', 'own', 'found', 'are', 'into', 'through', 'therefore', 'beyond', 'indeed', 'all', 'anything', 'among', 'de', 'made', 'their', 'perhaps', 'your', 'already', 'six', 'down', 'only', 'then', 'hereupon', 'this', 'we', 'ie', 'rather', 'here', 'there', 'how', 'formerly', 'its', 'whole', 'nobody', 'was', 'together', 'ltd', 'no', 'nothing', 'eight', 'amongst', 'but', 'off', 'by', 'itself', 'one', 'above', 'whereafter', 'fifty', 'few', 'whenever', 'part', 'therein', 'latter', 'sixty', 'or', 'and', 'throughout', 'has', 'same', 'whom', 'while', 'below', 'more', '

In [30]:
print(len(ENGLISH_STOP_WORDS))

318


In [31]:
words = [word for word in text2.split() if word.lower() not in ENGLISH_STOP_WORDS]

new_text = " ".join(words)

In [32]:
print(new_text)
print("Old length: ", len(text2))
print("New length: ", len(new_text))

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length:  105
New length:  75


## Adding custom Stop Words

You can also add custom stop words to the list of stop words available in these libraries to serve our purpose.

Here is the code to add some custom stop words to NLTK’s stop words list:

In [33]:
sw_nltk.extend(['first', 'second', 'third','me'])
print(len(sw_nltk))

183


##Removing Stop Words

You can also remove stop words from the list available in these libraries.

Here is the code using the NLTK library:

In [34]:

sw_nltk.remove('not')
print(len(sw_nltk))

182


##Create Custom Stop Words

In [35]:

text2 = "Here's to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes."

In [36]:
#create your custom stop words list
my_stop_words = ['to','the','in']
words = [word for word in text2.split() if word.lower() not in my_stop_words]
new_text = " ".join(words)
print(new_text)
print("Old length: ", len(text2))
print("New length: ", len(new_text))

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length:  105
New length:  75
