# Tokenization

Tokenization can be achieved using the library NLTK.

### Install and Import NLTK
Use pip to install NLTK and then import the library to use.

In [4]:
pip install NLTK


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Import NLTK:
When you try to run nltk.download('punkt'), you might get an error regarding SSL certificates. Instead of trying to bypass this certfication requirement try the following:

Windows users : Take a look at - /Applications/Python 3.6/Install Certificates.command
MAC users: There should be a file named Install Certificates.command in the python application folder, double click it and a script will run.

Post this the download should go through smoothly

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/sahithimv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Tokenize Sentences
To mark the boundaries of sentences in a text we can use a sentence tokenizer. It is important as it is a foundational step of analyzing text at the sentence level.

In [16]:
from nltk.tokenize import sent_tokenize
text = "Messi is the best Footballer in the world. He is an inspiration to all footballers. Goals win games,gameplay wins seasons. Messi has scored a whoping 672 goals and 269 assists for @FCBarcelona!"
sentences = sent_tokenize(text)
print(sentences)

['Messi is the best Footballer in the world.', 'He is an inspiration to all footballers.', 'Goals win games,gameplay wins seasons.', 'Messi has scored a whoping 672 goals and 269 assists for @FCBarcelona!']


### Tokenize words
You can tokenize words from the sentence for further processing. 
(Pick words from tokenized sentences if needed - like from the tokenized sentences above)

In [17]:
from nltk import word_tokenize
text = "Messi is the best Footballer in the world."
tokens = word_tokenize(text)
print(tokens)

['Messi', 'is', 'the', 'best', 'Footballer', 'in', 'the', 'world', '.']


### Custom Tokenization
In some cases you might want to come up with a tokenization process that is specific to your use case. For Example - 
1) Tokenization based on Delimitter - Splits text using special delimitters.

In [18]:
import re
sentence = "Custom tokenization, using various delimiters; such as, comma, semicolon, and colon."
delimiters = [',', ';', ':']
pattern = '|'.join(map(re.escape, delimiters))
tokens = re.split(pattern, sentence)
ans = [token for token in tokens if token]
print(ans)

['Custom tokenization', ' using various delimiters', ' such as', ' comma', ' semicolon', ' and colon.']


2) Tokenization based on length of words - Pick out words based on range of length.

In [19]:
sentence = "Tokenize this text based on word length, considering min and max lengths."
min_length = 4
max_length = 7
words = sentence.split()
ans = [word for word in words if min_length<= len(word) <= max_length]
print(ans)

['this', 'text', 'based', 'word', 'length,']


### Language specific tokenization
While this doesn't happen often, sometimes you might want to deal with tokenization of data that is in different languages. Let's take a look at an example for some Spanish text

In [20]:
spanish_text = "Esto es un ejemplo. Tokenizarlo con NLTK."
spanish_tokens = word_tokenize(spanish_text, language='spanish')
print(spanish_tokens)

['Esto', 'es', 'un', 'ejemplo', '.', 'Tokenizarlo', 'con', 'NLTK', '.']


### Tokenizing Abbrievated words
Tokenize words such as U.S.A etc.

In [21]:
import re
sentence = "U.S.A. is a country. E=mc^2 is an equation."
pattern = re.compile(r'\b(?:[A-Z]\.)+|[a-zA-Z0-9-]+')
tokens = re.findall(pattern, sentence)
print(tokens)

['U.S.A.', 'is', 'a', 'country', 'E', 'mc', '2', 'is', 'an', 'equation']


Explore the SpaCy library if you feel it fits your requirements more perfectly. It should be pretty analogous to the above mentioned scenarios