# Tokenization

Tokenization is the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. These tokens can be represented as Vectors which can be fed into Machine Learning models.

## Why Tokenization?

In order to get our computer to understand any text, we need to break the sentences down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in. So, this is the first step to build NLP model.

## Installation of nltk

In [2]:
!pip install nltk



# nltk

In [3]:
import nltk

from nltk import word_tokenize, sent_tokenize

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [24]:
text = "Data Science. Machine Learning. Deep Learning."

## Word Tokenization
* Word tokenization is the most used Tokenization algorithm. It splits text into words based on certain delimiter (generally "space")
* This can be accomplished using word_tokenize method from nltk.

In [25]:
word_tokens = word_tokenize(text)

In [26]:
print(word_tokens)

['Data', 'Science', '.', 'Machine', 'Learning', '.', 'Deep', 'Learning', '.']


## Sentence Tokenization
* Sentence Tokenization breaks a paragraph into sentences
* An obvious question youu might get is, why sentence tokenization is required if we have Word Tokenization. Suppose, the task is to calculate average number of words in sentence, we can make use of Sentence Tokenization.
* This can be accomplished using sent_tokenize from nltk

In [27]:
sent_tokens = sent_tokenize(text)

In [28]:
print(sent_tokens)

['Data Science.', 'Machine Learning.', 'Deep Learning.']


# spacy

In [29]:
!pip install spacy



In [30]:
from spacy.lang.en import English

In [31]:
nlp = English()

## Word Tokenization

In [32]:
doc = nlp(text)

In [33]:
word_tokens = []

for token in doc:
  word_tokens.append(token.text)

In [34]:
print(word_tokens)

['Data', 'Science', '.', 'Machine', 'Learning', '.', 'Deep', 'Learning', '.']


## Sentence Tokenization

In [36]:
pipe = nlp.create_pipe('sentencizer')

nlp.add_pipe(pipe)

doc = nlp(text)

In [37]:
sent_tokens = []

for token in doc.sents:
  sent_tokens.append(token)

In [38]:
print(sent_tokens)

[Data Science., Machine Learning., Deep Learning.]
