# Introduce
Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Senteces Tokenization'. 

# Why Tokenization is Required?
Every sentence gets its meaning by the words present in it. So by analyzing the words present in the text we can easily interpret the meaning of the text. Once we have a list of words we can also use statistical tools and methods to get more insights into the text. 

# Tokenization Techniques 
There are multiple ways we can perform tokenization on given text data. We can choose any method based on language, library and purpose of modeling. 

## Tokenization Using Python's Inbuilt Method 

In [1]:
# Word Tokenization 
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

# Split text by whitespace 
tokens = text.split(' ')
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']


In [2]:
# Sentence Tokenization 
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

# Split sentence by '.' 
text.split('.')

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 ' But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method',
 '']

In [4]:
# Tokenization Using Regular Expression (RegEx) 
import re

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = re.findall("[\w]+", text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


# Tokenization Using NLTK 
1. Natural Language Toolkit (NLTK) is library written in python for natural language processing. 
2. NLTK has module word_tokenize() for word tokenization and sent_tokenize() for sentence tokenization 

In [2]:
# Word Tokenization 
from nltk.tokenize import word_tokenize 

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = word_tokenize(text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


In [4]:
# Sentence Tokenization 
from nltk.tokenize import sent_tokenize

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

tokens = sent_tokenize(text)
print(tokens)

['Characters like periods, exclamation point and newline char are used to separate the sentences.', 'But one drawback with split() method, that we can only use one separator at a time!', 'So sentence tonenization wont be foolproof with split() method.']


# Tokenization Using spaCy 
1. spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython 
2. in spaCy we create language model object, which then used for word and sentence tokenization 

In [5]:
# Word Tokenization 
from spacy.lang.en import English

nlp = English()

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

my_doc = nlp(text)

token_list = []
for token in my_doc: 
    token_list.append(token.text)
    
print(token_list)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


In [None]:
# Sentence Tokenization 
nlp = English()

sbd = nlp.create_pipe('sentencizer')

nlp.add_pipe(sbd)

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

# nlp object is used to create documents with linguistic annotations
doc = nlp(text)

# Create list of sentence tokens

sentence_list =[]
for sentence in doc.sents:
    sentence_list.append(sentence.text)
print(sentence_list)

# Tokenization using Keras

In [11]:
# Word Tokenization 
from keras.preprocessing.text import text_to_word_sequence

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = text_to_word_sequence(text)
print(tokens)

['there', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'we', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


In [12]:
# Sentence Tokenization 
from keras.preprocessing.text import text_to_word_sequence

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
text_to_word_sequence(text, split='.', filters='!.\n')

['characters like periods, exclamation point and newline char are used to separate the sentences',
 ' but one drawback with split() method, that we can only use one separator at a time',
 ' so sentence tonenization wont be foolproof with split() method']

# Tokenization Using Gensim 

In [14]:
# Word Tokenization 
from gensim.utils import tokenize 

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = list(tokenize(text))
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']
