![NLP_Header_Tokenization](https://raw.githubusercontent.com/satishgunjal/images/master/NLP_Header_Tokenization.png)

# Index

* [Introduction](#1)
* [Why Tokenization is Required?](#2)
* [Tokenization Techniques](#3)
  - [Tokenization Using Python's Inbuilt Method](#4)
  - [Tokenization Using Regular Expressions(RegEx)](#5)
  - [Tokenization Using NLTK](#6)
  - [Tokenization Using spaCy](#7)
  - [Tokenization using Keras](#8)
  - [Tokenization using Gensim](#9)
* [Conclusion](#10)
* [References](#11)

**Tutorial contains friendly description of multiple tokenization methods and python code.**

# Introduction <a id ="1"></a>

Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization.  We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens.

![NLP_Tokenization](https://raw.githubusercontent.com/satishgunjal/images/master/NLP_Tokenization.png)

# Why Tokenization is Required? <a id ="2"></a>
Every sentence gets its meaning by the words present in it. So by analyzing the words present in the text we can easily interpret the meaning of the text. Once we have a list of words we can also use statistical tools and methods to get more insights into the text. For example, we can use word count and word frequency to find out important of word in that sentence or document.

# Tokenization Techniques <a id ="3"></a>
There are multiple ways we can perform tokenization on given text data. We can choose any method based on language, library and purpose of modeling.

## Tokenization Using Python's Inbuilt Method <a id ="4"></a>

![NLP_Tokenization](https://raw.githubusercontent.com/satishgunjal/images/master/python_split_syntax.png)

* We can use **split()** method to split a string into a list where each word is a list item.
* By default split() use whitespace as separater, but we can change it to anything.

### Word Tokenization

In [1]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
# Split text by whitespace
tokens = text.split()
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']


Observe in above list, words like 'language,' and  'modeling.' are containing punctuation at the end of them. **Python split method do not consider punctuation as separate token.**

### Sentence Tokenization

In [2]:
# Lets split the given text by full stop (.)
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
text.split(". ") # Note the space after the full stop makes sure that we dont get empty element at the end of list.

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method.']

As you can see, split() since we can't use multiple separator split() method failed to split the last sentence from separator (!). We can overcome this drawback by applying split method multiple times with different separator but there are better ways to do it.

## Tokenization Using Regular Expressions(RegEx) <a id ="5"></a>

![python_regex_syntax](https://raw.githubusercontent.com/satishgunjal/images/master/python_regex_syntax.png)

* A regular expression is a sequence of characters that define a search pattern.
* Using RegEx we can match character combinations in string and perform word/sentence tokenization.
* Please refer [regex101](https://regex101.com/) for testing your regular expression syntax.
* We can use Python's **re** library for RegeEx related operations.


### Word Tokenization

In [3]:
import re

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = re.findall("[\w]+", text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


Based on RegEx pattern we are able to generate the list of words. Details about each character in our RegEx pattern is as below.
```
[] :	A set of characters.
\w :    Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character).
+  :	One or more occurrences.
```

So our RegEx pattern signifies that the code should find all the alphanumeric characters until any other character is encountered.

### Sentence Tokenization

In [4]:
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
tokens_sent = re.compile('[.!?] ').split(text) # Using compile method to combine RegEx patterns
tokens_sent

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time',
 'So sentence tonenization wont be foolproof with split() method.']

As you can see from above result, we are able to split sentence using multiple separators.

## Tokenization Using NLTK <a id ="6"></a>
* Natural Language Toolkit (NLTK) is library written in python for natural language processing.
* NLTK has module **word_tokenize()** for word tokenization and **sent_tokenize()** for sentence tokenization.
* Syntax to install NLTK is as below
```
!pip install --user -U nltk
```
* Note that we are going use "!" before the command to let notebook know that, it should read as commandline command

### Word Tokenization

In [5]:
!pip install --user -U nltk

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 914 kB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l- \ | / - \ done
[?25h  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434675 sha256=e70309098eea6e5151ff43fb4543ffcee04b82b9b6b76dd686cd9e1ffe8dc126
  Stored in directory: /root/.cache/pip/wheels/45/6c/46/a1865e7ba706b3817f5d1b2ff7ce8996aabdd0d03d47ba0266
Successfully built nltk
Installing collected packages: nltk
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

preprocessing 0.1.13 requires nltk==3.2.4, but you'll have nltk 3.5 which is incompatible.[0m
Successfully installed nltk-3.5
You sh

In [6]:
from nltk.tokenize import word_tokenize

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = word_tokenize(text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


Notice that NLTK word tokenization also consider the punctuation as token. During text cleaning process we have to account for this.

### Sentence Tokenization

In [7]:
from nltk.tokenize import sent_tokenize

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
sent_tokenize(text)

['Characters like periods, exclamation point and newline char are used to separate the sentences.',
 'But one drawback with split() method, that we can only use one separator at a time!',
 'So sentence tonenization wont be foolproof with split() method.']

## Tokenization Using spaCy <a id ="7"></a>
* spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython
* in spaCy we create language model object, which then used for word and sentence tokenization
* Syntax to install spaCy library and English model is as below
```
!pip install spacy
!python -m spacy download en
```
* Note that we are going use "!" before the command to let notebook know that, it should read as commandline command

### Word Tokenization

In [8]:
!pip install spacy
!python -m spacy download en

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/opt/conda/lib/python3.7/site-packages/en_core_web_sm -->
/opt/conda/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [9]:
# Load English model from spacy
from spacy.lang.en import English

# Load English tokenizer. 
# nlp object will be used to create 'doc' object which uses preprecoessing pipeline's components such as tagger, parser, NER and word vectors
nlp = English()

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

# Now we will process above text using 'nlp' object. Which is use to create documents with linguistic annotations and various nlp properties
my_doc = nlp(text)

# Above step has already tokenized our text but its in doc format, so lets write fo loop to create list of it
token_list = []
for token in my_doc:
    token_list.append(token.text)

print(token_list)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


### Sentence Tokenization

In [10]:
# Load English tokenizer, tager, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add component to the pipeline
nlp.add_pipe(sbd)

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

# nlp object is used to create documents with linguistic annotations
doc = nlp(text)

# Create list of sentence tokens

sentence_list =[]
for sentence in doc.sents:
    sentence_list.append(sentence.text)
print(sentence_list)

['Characters like periods, exclamation point and newline char are used to separate the sentences.', 'But one drawback with split() method, that we can only use one separator at a time!', 'So sentence tonenization wont be foolproof with split() method.']


## Tokenization using Keras <a id ="8"></a>
* Keras is opensource neural network library written in python. It is easy to use and it is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML
* To perform word tokenization we use the **text_to_word_sequence()** method from the **keras.preprocessing.text class**
* By default, this function automatically does 3 things:
    * Splits words by space (split=” “).
    * Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).
    * Converts text to lowercase (lower=True).
* Syntx to install Keras
```
!pip install Keras
```

In [11]:
!pip install Keras

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


### Word Tokenization

In [12]:
from keras.preprocessing.text import text_to_word_sequence

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = text_to_word_sequence(text)
print(tokens)

['there', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'we', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


As you can notice, all words are also converted to lowercase. This is default behavior we can change it by changing the arguments e.g. text_to_word_sequence(text,lower=False)

### Sentence Tokenization
For sentence tokenization we can use filters like "!.\n" to split the text into sentences.

In [13]:
from keras.preprocessing.text import text_to_word_sequence

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

text_to_word_sequence(text, split= ".", filters="!.\n")

['characters like periods, exclamation point and newline char are used to separate the sentences',
 ' but one drawback with split() method, that we can only use one separator at a time',
 ' so sentence tonenization wont be foolproof with split() method']

## Tokenization using Gensim <a id ="9"></a>
* Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.
* We are going to use **tokenize()** from **gensim.utility** class for word tokenization.
* Unlike other libraries Gensim has separate method **split_sentences()** from class **gensim.summarization.textcleaner** for sentence tokenization. 
* Syntx to install Gensim
```
!pip install gensim
```

In [14]:
!pip install gensim

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


### Word Tokenization

In [15]:
from gensim.utils import tokenize

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = list(tokenize(text))
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


In [16]:
from keras.preprocessing.text import text_to_word_sequence

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be fullproof with split() method."""

tokens = text_to_word_sequence(text, split= ".")
print(tokens)

['characters like periods', ' exclamation point and newline char are used to separate the sentences', ' but one drawback with split', ' method', ' that we can only use one separator at a time', ' so sentence tonenization wont be fullproof with split', ' method']


### Sentence Tokenization

In [17]:
from gensim.summarization.textcleaner import split_sentences

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

list(split_sentences(text))

['Characters like periods, exclamation point and newline char are used to separate the sentences.',
 'But one drawback with split() method, that we can only use one separator at a time!',
 'So sentence tonenization wont be foolproof with split() method.']

# Conclusion <a id ="10"></a>
There are multiple ways to do the tokenization. We can use any library depending on our requirement and features supported by the library. Feel free to try above code with different text snippet to get hold of how tokenization work.

# References <a id ="11"></a>
* https://keras.io/api/preprocessing/text/#text_to_word_sequence
* https://www.nltk.org/
* https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/
* https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4
* https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
