# Performing Tokenization with the following proces: 

1. Simple Tekenization with .split
2. Tokenization with NLTK
3. Convert a corpus to a vector of token counts with counter Vectorizer.
4. Tokenize text in different languages with  spacy
5. Tokenization with Gensim


In [2]:
# 1.Simple Tokenization 
text = """Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes. The ones who see things differently — they’re not fond of rules. You can quote them, disagree with them, glorify or vilify them, but the only thing you can’t do is ignore them because they change things. They push the human race forward, and while some may see them as the crazy ones, we see genius, because the ones who are crazy enough to think
that they can change the world, are the ones who do."""
text.split()

['Here’s',
 'to',
 'the',
 'crazy',
 'ones,',
 'the',
 'misfits,',
 'the',
 'rebels,',
 'the',
 'troublemakers,',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes.',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 '—',
 'they’re',
 'not',
 'fond',
 'of',
 'rules.',
 'You',
 'can',
 'quote',
 'them,',
 'disagree',
 'with',
 'them,',
 'glorify',
 'or',
 'vilify',
 'them,',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can’t',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things.',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward,',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones,',
 'we',
 'see',
 'genius,',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world,',
 'are',
 'the',
 'ones',
 'who',
 'do.']

the split() method doesn’t consider punctuation symbols as a separate token. This might change your project results.

In [None]:
# 2. Tokenization with NLTK
# NLTK contains a module called tokenize with a word_tokenize() method that will help us split a text into tokens.

In [5]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
text = """Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes. The ones who see things differently — they’re not fond of rules. You can quote them, disagree with them, glorify or vilify them, but the only thing you can’t do is ignore them because they change things. They push the human race forward, and while some may see them as the crazy ones, we see genius, because the ones who are crazy enough to think
that they can change the world, are the ones who do."""
word_tokenize(text)

['Here',
 '’',
 's',
 'to',
 'the',
 'crazy',
 'ones',
 ',',
 'the',
 'misfits',
 ',',
 'the',
 'rebels',
 ',',
 'the',
 'troublemakers',
 ',',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes',
 '.',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 '—',
 'they',
 '’',
 're',
 'not',
 'fond',
 'of',
 'rules',
 '.',
 'You',
 'can',
 'quote',
 'them',
 ',',
 'disagree',
 'with',
 'them',
 ',',
 'glorify',
 'or',
 'vilify',
 'them',
 ',',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can',
 '’',
 't',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things',
 '.',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward',
 ',',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones',
 ',',
 'we',
 'see',
 'genius',
 ',',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world',
 ',',
 'are',
 'the',
 'ones',
 'who',
 'do',
 '.']

3. Convert a corpus to a vector of token counts with Count Vectorizer (sklearn)

The last 2 methods are not effective in dealing with a large corpus because you’ll need to represent the tokens differently. Count Vectorizer will help us convert a collection of text documents to a vector of token counts. In the end, we’ll get a vector representation of the text data.

In [1]:
import pandas as pd
texts = [
"""Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes. The ones who see things differently — they’re not fond of rules. You can quote them, disagree with them, glorify or vilify them, but the only thing you can’t do is ignore them because they change things. They push the human race forward, and while some may see them as the crazy ones, we see genius, because the ones who are crazy enough to think that they can change the world, are the ones who do.""" ,
 
'I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.'
]
df = pd.DataFrame({'author': ['jobs', 'gates'], 'text':texts})

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize
cv = CountVectorizer(stop_words='english') 
cv_matrix = cv.fit_transform(df['text']) 
# create document term matrix
df_dtm = pd.DataFrame(cv_matrix.toarray(), index=df['author'].values, columns=cv.get_feature_names_out())


In [4]:
df_dtm

Unnamed: 0,change,choose,crazy,differently,disagree,easy,fond,forward,genius,glorify,...,round,rules,square,thing,things,think,troublemakers,vilify,way,world
jobs,2,0,3,1,1,0,1,1,1,1,...,1,1,1,1,2,1,1,1,0,1
gates,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [None]:
# This method is extremely useful when the dataframe contains a large corpus because it provides a matrix with words encoded as integers.
# Count Vectorizer can have different parameters like stop_words
# the default regexp used by Count Vectorizer selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)

4. Tokenize text in different languages with spaCy

In [None]:
# When you need to tokenize text written in a language other than English, you can use spaCy. 

In [5]:
from spacy.lang.es import Spanish
nlp = Spanish()

text_spanish = """Por los locos. Los marginados. Los rebeldes. Los problematicos. 
Los inadaptados. Los que ven las cosas de una manera distinta. A los que no les gustan
las reglas. Y a los que no respetan el “status quo”. Puedes citarlos, discrepar de ellos,
ensalzarlos o vilipendiarlos. Pero lo que no puedes hacer es ignorarlos… Porque ellos
cambian las cosas, empujan hacia adelante la raza humana y, aunque algunos puedan
considerarlos locos, nosotros vemos en ellos a genios. Porque las personas que están
lo bastante locas como para creer que pueden cambiar el mundo, son las que lo logran."""

doc = nlp(text_spanish)

tokens = [token.text for token in doc]
print(tokens)

['Por', 'los', 'locos', '.', 'Los', 'marginados', '.', 'Los', 'rebeldes', '.', 'Los', 'problematicos', '.', '\n', 'Los', 'inadaptados', '.', 'Los', 'que', 'ven', 'las', 'cosas', 'de', 'una', 'manera', 'distinta', '.', 'A', 'los', 'que', 'no', 'les', 'gustan', '\n', 'las', 'reglas', '.', 'Y', 'a', 'los', 'que', 'no', 'respetan', 'el', '“', 'status', 'quo', '”', '.', 'Puedes', 'citarlos', ',', 'discrepar', 'de', 'ellos', ',', '\n', 'ensalzarlos', 'o', 'vilipendiarlos', '.', 'Pero', 'lo', 'que', 'no', 'puedes', 'hacer', 'es', 'ignorarlos', '…', 'Porque', 'ellos', '\n', 'cambian', 'las', 'cosas', ',', 'empujan', 'hacia', 'adelante', 'la', 'raza', 'humana', 'y', ',', 'aunque', 'algunos', 'puedan', '\n', 'considerarlos', 'locos', ',', 'nosotros', 'vemos', 'en', 'ellos', 'a', 'genios', '.', 'Porque', 'las', 'personas', 'que', 'están', '\n', 'lo', 'bastante', 'locas', 'como', 'para', 'creer', 'que', 'pueden', 'cambiar', 'el', 'mundo', ',', 'son', 'las', 'que', 'lo', 'logran', '.']


In [None]:
#  imported Spanish from spacy.lang.es but if you’re working with text in English, just import English from spacy.lang.en
# spaCy, considers punctuation symbols as a separate token (even the new lines\n were included).

5. Tokenization with Gensim

In [7]:
# Gensim is a library for unsupervised topic modeling and natural language processing and also contains a tokenizer.
from gensim.utils import tokenize
text = """Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes. The ones who see things differently — they’re not fond of rules. You can quote them, disagree with them, glorify or vilify them, but the only thing you can’t do is ignore them because they change things. They push the human race forward, and while some may see them as the crazy ones, we see genius, because the ones who are crazy enough to think
that they can change the world, are the ones who do."""
list(tokenize(text))

['Here',
 's',
 'to',
 'the',
 'crazy',
 'ones',
 'the',
 'misfits',
 'the',
 'rebels',
 'the',
 'troublemakers',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 'they',
 're',
 'not',
 'fond',
 'of',
 'rules',
 'You',
 'can',
 'quote',
 'them',
 'disagree',
 'with',
 'them',
 'glorify',
 'or',
 'vilify',
 'them',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can',
 't',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones',
 'we',
 'see',
 'genius',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world',
 'are',
 'the',
 'ones',
 'who',
 'do']

In [None]:
#  Gensim splits every time it encounters a punctuation symbol

summary: 
1. The .split method is a simple tokenizer that separates text by white spaces. 
2. NLTK and Gensim do a similar job, but with different punctuation rules.  
3. spaCy, which offers a multilingual tokenizer and sklearn that helps tokenize a large corpus.