# Tokenization

The process of converting a sequence of text into smaller parts, known as **tokens**. These tokens can be as small as characters (word) or as long as words like (sentence).

We can perform tokenization in many ways:
- Using python's inbuilt methods split()
- RegEx
- NLTK library
- SpaCy
- Tensorflow/Keras

In [None]:
# Packages required!
#!pip install regex nltk spacy tensorflow


## Using in-built methods
In Python, the `split()` method is commonly used for tokenization, which means breaking a string into smaller parts, usually based on certain delimiters such as whitespace, commas, or any other character(s). Here's how you can use the `split()` method for tokenization:



In [None]:
text = """As dawn broke over the sleepy village, a gentle breeze stirred the leaves, whispering secrets to the awakening world. The scent of dew-kissed grass hung in the air, mingling with the aroma of freshly brewed coffee wafting from the local café. Birds chirped joyfully, heralding the start of a new day, while the sun peeked over the horizon, casting a golden glow upon the cobblestone streets. In the distance, the silhouette of a lone figure could be seen, ambling along the winding path with purposeful strides. Each step seemed to echo the rhythm of the village, a harmonious melody of life unfolding with every passing moment. As the day unfolded, the village embraced the promise of adventure and possibility, basking in the beauty of the present moment."""

# Using split() method to tokenize based on whitespace
word = text.split()
#based on fullstop
sent = text.split(".")

print('Word Token: ', word)
print('\n')
print('Sentence Token: ', sent)



Word Token:  ['As', 'dawn', 'broke', 'over', 'the', 'sleepy', 'village,', 'a', 'gentle', 'breeze', 'stirred', 'the', 'leaves,', 'whispering', 'secrets', 'to', 'the', 'awakening', 'world.', 'The', 'scent', 'of', 'dew-kissed', 'grass', 'hung', 'in', 'the', 'air,', 'mingling', 'with', 'the', 'aroma', 'of', 'freshly', 'brewed', 'coffee', 'wafting', 'from', 'the', 'local', 'café.', 'Birds', 'chirped', 'joyfully,', 'heralding', 'the', 'start', 'of', 'a', 'new', 'day,', 'while', 'the', 'sun', 'peeked', 'over', 'the', 'horizon,', 'casting', 'a', 'golden', 'glow', 'upon', 'the', 'cobblestone', 'streets.', 'In', 'the', 'distance,', 'the', 'silhouette', 'of', 'a', 'lone', 'figure', 'could', 'be', 'seen,', 'ambling', 'along', 'the', 'winding', 'path', 'with', 'purposeful', 'strides.', 'Each', 'step', 'seemed', 'to', 'echo', 'the', 'rhythm', 'of', 'the', 'village,', 'a', 'harmonious', 'melody', 'of', 'life', 'unfolding', 'with', 'every', 'passing', 'moment.', 'As', 'the', 'day', 'unfolded,', 'the',

## Using RegEx
Regular expressions (RegEx) are powerful tools for pattern matching and manipulation of text. In Python, the `re` module provides support for working with regular expressions.

- `re.findall()`: This function is used to find all non-overlapping matches of a pattern in a string and return them as a list of strings.

- `re.compile()`: This function is used to compile a regular expression pattern into a regex object, which can then be used for matching operations.

In [None]:
import re

#using findall()
word = re.findall('[\w]+', text)
#using compile() to find a pattern
sent = re.compile('[.]').split(text)

print("word Token: ", word)
print('\n')
print('Sentence Token: ', sent)

word Token:  ['As', 'dawn', 'broke', 'over', 'the', 'sleepy', 'village', 'a', 'gentle', 'breeze', 'stirred', 'the', 'leaves', 'whispering', 'secrets', 'to', 'the', 'awakening', 'world', 'The', 'scent', 'of', 'dew', 'kissed', 'grass', 'hung', 'in', 'the', 'air', 'mingling', 'with', 'the', 'aroma', 'of', 'freshly', 'brewed', 'coffee', 'wafting', 'from', 'the', 'local', 'café', 'Birds', 'chirped', 'joyfully', 'heralding', 'the', 'start', 'of', 'a', 'new', 'day', 'while', 'the', 'sun', 'peeked', 'over', 'the', 'horizon', 'casting', 'a', 'golden', 'glow', 'upon', 'the', 'cobblestone', 'streets', 'In', 'the', 'distance', 'the', 'silhouette', 'of', 'a', 'lone', 'figure', 'could', 'be', 'seen', 'ambling', 'along', 'the', 'winding', 'path', 'with', 'purposeful', 'strides', 'Each', 'step', 'seemed', 'to', 'echo', 'the', 'rhythm', 'of', 'the', 'village', 'a', 'harmonious', 'melody', 'of', 'life', 'unfolding', 'with', 'every', 'passing', 'moment', 'As', 'the', 'day', 'unfolded', 'the', 'village', 

## Using NLTK

`NLTK` stands for Natural Language Toolkit. It's a leading platform for building Python programs to work with human language data.

Also used for `classification`, `tokenization`, `stemming`, `tagging`, `parsing`, and `semantic reasoning`.

The command `nltk.download('punkt')` is used to download the `Punkt` tokenizer models for `NLTK`. `Punkt` is a tokenizer included in `NLTK`, which is designed to split text into sentences.

In [None]:
import nltk
#downloading punkt tokenizer : splits texts into sentences
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

- `nltk.sent_tokenize()` to tokenize the text into sentences. This function takes a string of text as input and returns a list of sentences.
- `nltk.word_tokenize()` to tokenize the text into words. This function takes a string of text as input and returns a list of words.

In [None]:
#importing methods
from nltk.tokenize import word_tokenize, sent_tokenize
#words
word = word_tokenize(text)
#sentences
sent = sent_tokenize(text)

print('Word Token: ', word )
print('\n')
print('Sentence Token: ', sent)

Word Token:  ['As', 'dawn', 'broke', 'over', 'the', 'sleepy', 'village', ',', 'a', 'gentle', 'breeze', 'stirred', 'the', 'leaves', ',', 'whispering', 'secrets', 'to', 'the', 'awakening', 'world', '.', 'The', 'scent', 'of', 'dew-kissed', 'grass', 'hung', 'in', 'the', 'air', ',', 'mingling', 'with', 'the', 'aroma', 'of', 'freshly', 'brewed', 'coffee', 'wafting', 'from', 'the', 'local', 'café', '.', 'Birds', 'chirped', 'joyfully', ',', 'heralding', 'the', 'start', 'of', 'a', 'new', 'day', ',', 'while', 'the', 'sun', 'peeked', 'over', 'the', 'horizon', ',', 'casting', 'a', 'golden', 'glow', 'upon', 'the', 'cobblestone', 'streets', '.', 'In', 'the', 'distance', ',', 'the', 'silhouette', 'of', 'a', 'lone', 'figure', 'could', 'be', 'seen', ',', 'ambling', 'along', 'the', 'winding', 'path', 'with', 'purposeful', 'strides', '.', 'Each', 'step', 'seemed', 'to', 'echo', 'the', 'rhythm', 'of', 'the', 'village', ',', 'a', 'harmonious', 'melody', 'of', 'life', 'unfolding', 'with', 'every', 'passing'

## Using spaCy
`spaCy` is an open-source natural language processing (NLP) library for Python. It's designed to be fast, efficient, and production-ready, making it suitable for building real-world NLP applications. `spaCy` provides an easy-to-use interface for common NLP tasks such as `tokenization`, `part-of-speech tagging`, named `entity recognition`, dependency `parsing`, and more.
- Tokenization: `spaCy` offers robust tokenization capabilities, allowing you to split text into words, punctuation marks, and other meaningful units.

- `spacy.load()`: This is a function provided by the spaCy library to load a pre-trained model or language pipeline. It takes a string argument specifying the name or path of the model to load.
- `'en_core_web_sm'`: This is the name of the pre-trained model being loaded. In this case, `'en_core_web_sm'` refers to the English language pipeline with a small model.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm') #our model
doc = nlp(text) #object
word = []
#print(doc)
for t in doc: #t is also an object
  #print(t)
  word.append(t.text)

sent = []
for t in doc.sents:

  sent.append(t.text)

print('Word Token: ', word)
print(type(word))
print('\n')
print('Sentence Token: ', sent)

Word Token:  ['As', 'dawn', 'broke', 'over', 'the', 'sleepy', 'village', ',', 'a', 'gentle', 'breeze', 'stirred', 'the', 'leaves', ',', 'whispering', 'secrets', 'to', 'the', 'awakening', 'world', '.', 'The', 'scent', 'of', 'dew', '-', 'kissed', 'grass', 'hung', 'in', 'the', 'air', ',', 'mingling', 'with', 'the', 'aroma', 'of', 'freshly', 'brewed', 'coffee', 'wafting', 'from', 'the', 'local', 'café', '.', 'Birds', 'chirped', 'joyfully', ',', 'heralding', 'the', 'start', 'of', 'a', 'new', 'day', ',', 'while', 'the', 'sun', 'peeked', 'over', 'the', 'horizon', ',', 'casting', 'a', 'golden', 'glow', 'upon', 'the', 'cobblestone', 'streets', '.', 'In', 'the', 'distance', ',', 'the', 'silhouette', 'of', 'a', 'lone', 'figure', 'could', 'be', 'seen', ',', 'ambling', 'along', 'the', 'winding', 'path', 'with', 'purposeful', 'strides', '.', 'Each', 'step', 'seemed', 'to', 'echo', 'the', 'rhythm', 'of', 'the', 'village', ',', 'a', 'harmonious', 'melody', 'of', 'life', 'unfolding', 'with', 'every', '

### Basics of Spacy

In [None]:
for t in doc:
    print(t,"\nis_alpha:",t.is_alpha,
         "\nis_punct:",t.is_punct,
         "\nlike_num:",t.like_num,
         "\nis_currency:",t.is_currency,"\n")

As 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

dawn 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

broke 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

over 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

the 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

sleepy 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

village 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

, 
is_alpha: False 
is_punct: True 
like_num: False 
is_currency: False 

a 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

gentle 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

breeze 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

stirred 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: False 

the 
is_alpha: True 
is_punct: False 
like_num: False 
is_currency: Fals

## Using Tensorflow/Keras
TensorFlow is one of the most popular deep learning frameworks, and it offers extensive support for natural language processing (NLP) tasks. It provides various tools, modules, and pre-trained models that can be used for building NLP applications.
- `keras.preprocessing.text` is a module within the Keras deep learning library that provides various utilities for preprocessing text data. It includes functions and classes for tasks such as tokenization, converting text to sequences of integers, padding sequences, and more.
- `text_to_word_sequence` function from the `keras.preprocessing.text` module. This function is used for converting text to a list of words based on certain delimiters.

In [None]:
from keras.preprocessing.text import text_to_word_sequence
word=text_to_word_sequence(text) #default split with while space
sent=text_to_word_sequence(text,split='.') # split by dot (.) for sentences
print('word token',word) # print word
print('\n')
print('sentence token:',sent) #print sentence

word token ['as', 'dawn', 'broke', 'over', 'the', 'sleepy', 'village', 'a', 'gentle', 'breeze', 'stirred', 'the', 'leaves', 'whispering', 'secrets', 'to', 'the', 'awakening', 'world', 'the', 'scent', 'of', 'dew', 'kissed', 'grass', 'hung', 'in', 'the', 'air', 'mingling', 'with', 'the', 'aroma', 'of', 'freshly', 'brewed', 'coffee', 'wafting', 'from', 'the', 'local', 'café', 'birds', 'chirped', 'joyfully', 'heralding', 'the', 'start', 'of', 'a', 'new', 'day', 'while', 'the', 'sun', 'peeked', 'over', 'the', 'horizon', 'casting', 'a', 'golden', 'glow', 'upon', 'the', 'cobblestone', 'streets', 'in', 'the', 'distance', 'the', 'silhouette', 'of', 'a', 'lone', 'figure', 'could', 'be', 'seen', 'ambling', 'along', 'the', 'winding', 'path', 'with', 'purposeful', 'strides', 'each', 'step', 'seemed', 'to', 'echo', 'the', 'rhythm', 'of', 'the', 'village', 'a', 'harmonious', 'melody', 'of', 'life', 'unfolding', 'with', 'every', 'passing', 'moment', 'as', 'the', 'day', 'unfolded', 'the', 'village', 'e