# TEXT PREPROCESSING - TOKENIZATION


Text preprocessing is a very integral part of any Natural Language Processing. Although we currently have immensely powerful methods of text preprocessing, an intuition about the traditional methods will be a solid foundation to understanding the things to come. 

The inital text preprocessing concept we will be discussing is **Tokenization**. It is bascially breaking down text into smaller pieces - a.k.a tokens. This helps in developing a NLP model or understanding the semantics of the textual data. 

_Here is an example of tokenization_

| Before Tokenization | After Tokenization |
| --- | --- | 
| "This is a test sentence" | 'This' 'is' 'a' 'test' 'sentence' |


We will be using NLTK and Keras Libraries to illustrate the concepts. Will be adding Gensim, Spacy, TextBlob examples later on.

In [1]:
#Natural Language Tool Kit (NLTK) is a NLP package for text data processing
import nltk

paragraph = """Paragraphs are the building blocks of papers. Many students define paragraphs \
in terms of length. A paragraph is a group of at least five sentences. Paragraph \
is half a page long, etc."""


In [2]:
#nltk.sent_tokenize() lists the sentences in raw data
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['Paragraphs are the building blocks of papers.', 'Many students define paragraphs in terms of length.', 'A paragraph is a group of at least five sentences.', 'Paragraph is half a page long, etc.']


In [3]:
#nltk.word_tokenize() lists the sentences in raw data
words = nltk.word_tokenize(paragraph)
print(words)

['Paragraphs', 'are', 'the', 'building', 'blocks', 'of', 'papers', '.', 'Many', 'students', 'define', 'paragraphs', 'in', 'terms', 'of', 'length', '.', 'A', 'paragraph', 'is', 'a', 'group', 'of', 'at', 'least', 'five', 'sentences', '.', 'Paragraph', 'is', 'half', 'a', 'page', 'long', ',', 'etc', '.']


In [4]:
#nltk.tweet_tokenize() for tokenizing tweets - handles emojis and hashtags

tweet = " Martinelli to start today <3. Hope he recreates magic at #StamfordBridge #ARSCHE #COYG"

tweet_tokenizer = nltk.TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(tweet)
print(tweet_tokens)

normal_tokens = nltk.word_tokenize(tweet)
print(normal_tokens)

['Martinelli', 'to', 'start', 'today', '<3', '.', 'Hope', 'he', 'recreates', 'magic', 'at', '#StamfordBridge', '#ARSCHE', '#COYG']
['Martinelli', 'to', 'start', 'today', '<', '3', '.', 'Hope', 'he', 'recreates', 'magic', 'at', '#', 'StamfordBridge', '#', 'ARSCHE', '#', 'COYG']


In [5]:
#nltk.RegexpTokenizer() for splits the text into tokens based on regex

sample_sentence = "I earned $400 as winnings of Microsoft Hackathon."
regex_tokenizer = nltk.RegexpTokenizer('\w+|\$[\d\.]+|\S+')
regex_tokens = regex_tokenizer.tokenize(sample_sentence)
print(regex_tokens)

normal_tokens = nltk.word_tokenize(sample_sentence)
print(normal_tokens)

['I', 'earned', '$400', 'as', 'winnings', 'of', 'Microsoft', 'Hackathon', '.']
['I', 'earned', '$', '400', 'as', 'winnings', 'of', 'Microsoft', 'Hackathon', '.']


In [None]:
#Tokenization can also be done by using keras - a deep learning API written in Python
import keras
from keras.preprocessing.text import text_to_word_sequence

words = text_to_word_sequence(paragraph)
print(words)

**Challenges with Tokenization :**

* Boundary of words - Difficult to find the end of words for no-space languages like Chinese etc
* Lot of Symbols and other noise in real life text data
* Short Forms in the Language - I am (I'm) etc