# Tokenization

Tokenization may be defined as the process of splitting the text into smaller parts called tokens, and is considered a crucial step in NLP

## load library

In [1]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import LineTokenizer
from nltk.tokenize import SpaceTokenizer
from nltk.tokenize.util import spans_to_relative

## load data

In [2]:
nltk.download('punkt')
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ursar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## input data text

In [3]:
text=" Welcome readers. I hope you find it interesting. Please do reply."
text_1=" She secured 90.56 % in class X \n. She is a meritorious student"

## Tokenization of text into sentences

In [4]:
sent_tokenize(text)

[' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']

In [8]:
tokenizer.tokenize(text)

[' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']

## Tokenization of sentences into words

In [9]:
texts=nltk.word_tokenize(text)
print(texts)

['Welcome', 'readers', '.', 'I', 'hope', 'you', 'find', 'it', 'interesting', '.', 'Please', 'do', 'reply', '.']


In [10]:
print("The length of text is",len(word_tokenize(text)),"words")

The length of text is 14 words


## Tokenization using TreebankWordTokenizer

In [19]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

['Welcome',
 'readers.',
 'I',
 'hope',
 'you',
 'find',
 'it',
 'interesting.',
 'Please',
 'do',
 'reply',
 '.']

In [14]:
texts=nltk.word_tokenize(text)
print(texts)

['Welcome', 'readers', '.', 'I', 'hope', 'you', 'find', 'it', 'interesting', '.', 'Please', 'do', 'reply', '.']


Another word tokenizer is PunktWordTokenizer. It works by splitting punctuation;
each word is kept instead of creating an entirely new token. Another word tokenizer
is WordPunctTokenizer. It provides splitting by making punctuation an entirely
new token

In [18]:
tokenizer=WordPunctTokenizer()
tokenizer.tokenize(text)

['Welcome',
 'readers',
 '.',
 'I',
 'hope',
 'you',
 'find',
 'it',
 'interesting',
 '.',
 'Please',
 'do',
 'reply',
 '.']

## Tokenization using regular expressions

The tokenization of words can be performed by constructing regular expressions in
these two ways:
• By matching with words
• By matching spaces or gaps

In [39]:
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize(text_1)

['She',
 'secured',
 '90',
 '56',
 'in',
 'class',
 'X',
 'She',
 'is',
 'a',
 'meritorious',
 'student']

In [32]:
print(regexp_tokenize(text, pattern='\w+|\$[\d\.]+|\S+'))

['Welcome', 'readers', '.', 'I', 'hope', 'you', 'find', 'it', 'interesting', '.', 'Please', 'do', 'reply', '.']


RegularexpTokenizer uses the re.findall()function to perform tokenization
by matching tokens. It uses the re.split() function to perform tokenization by
matching gaps or spaces.

#### how to tokenize using whitespaces: 

In [33]:
tokenizer=RegexpTokenizer('\s+',gaps=True)
tokenizer.tokenize(text)

['Welcome',
 'readers.',
 'I',
 'hope',
 'you',
 'find',
 'it',
 'interesting.',
 'Please',
 'do',
 'reply.']

#### To select the words starting with a capital letter

In [35]:
capt = RegexpTokenizer('[A-Z]\w+')
capt.tokenize(text_1)

['She', 'She']

#### The tokenization of strings can be done using whitespace—tab, space, or newline

In [43]:
WhitespaceTokenizer().tokenize(text_1)

['She',
 'secured',
 '90.56',
 '%',
 'in',
 'class',
 'X',
 '.',
 'She',
 'is',
 'a',
 'meritorious',
 'student']

####  LineTokenizer works by tokenizing text into lines:

In [50]:
BlanklineTokenizer().tokenize(text_1)

[' She secured 90.56 % in class X \n. She is a meritorious student']

In [51]:
LineTokenizer(blanklines='keep').tokenize(text_1)

[' She secured 90.56 % in class X ', '. She is a meritorious student']

#### SpaceTokenizer works similar to sent.split(''):

In [52]:
SpaceTokenizer().tokenize(text_1)

['',
 'She',
 'secured',
 '90.56',
 '%',
 'in',
 'class',
 'X',
 '\n.',
 'She',
 'is',
 'a',
 'meritorious',
 'student']

#### nltk.tokenize.util module works by returning the sequence of tuples that are offsets of the tokens in a sentence

In [54]:
list(WhitespaceTokenizer().span_tokenize(text_1))

[(1, 4),
 (5, 12),
 (13, 18),
 (19, 20),
 (21, 23),
 (24, 29),
 (30, 31),
 (33, 34),
 (35, 38),
 (39, 41),
 (42, 43),
 (44, 55),
 (56, 63)]

#### Given a sequence of spans, the sequence of relative spans can be returned

In [58]:
list(spans_to_relative(WhitespaceTokenizer().span_tokenize(text_1)))

[(1, 3),
 (1, 7),
 (1, 5),
 (1, 1),
 (1, 2),
 (1, 5),
 (1, 1),
 (2, 1),
 (1, 3),
 (1, 2),
 (1, 1),
 (1, 11),
 (1, 7)]