## Learning Tokanization

- Tokanization is the process of breaking a sentence into smaller peices or tokens
- Type of tokanization methods
    - Word based ( 'Hello','World')
    - Character based ('H','E','L','L','O', ...)
    - Subword based (Best of both word and character based)

#### Understanding Tokanization using Pytorch

In [None]:
!pip install nltk
!pip install transformers==4.42.1
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install scikit-learn
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install numpy==1.26.0

- Lets create a dataset of some sentences

In [1]:
dataset= [(1, "This is a positive review."),
          (1, "This is a negative review."),
          (2, "I love this product!"),
          (2, "I hate this product."),
          (1, "Absolutely fantastic!"),
          (3, "Terrible experience."),
          (4, "Highly recommend it."),
          (3, "Would not buy again.")]

In [2]:
# Importing a function from torchtext
from torchtext.data.utils import get_tokenizer

# Using the get_tokenizer function to tokenize a sample text
# NOTE: The tokenizer is set to "basic_english" which is a simple tokenizer that splits text into words.
tokenizer = get_tokenizer("basic_english")
# get_tokenizer (which is a function ) returns a callable function (here named tokenizer)
# This tokenizer function can be used to tokenize text
tokenizer(dataset[0][1])


['this', 'is', 'a', 'positive', 'review', '.']

- Now lets create an iterator that iterates through all the sentences of datasets and yields the tokens

In [3]:
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)
my_iterator = yield_tokens(dataset)


- 'build_vocab_from_iterator' function will converts these tokens to indecies


In [4]:
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(yield_tokens(dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
vocab.get_stoi()  # This will return the string to index mapping of the vocabulary

{'this': 2,
 'love': 17,
 '.': 1,
 'review': 8,
 '<unk>': 0,
 'product': 7,
 'i': 5,
 '!': 3,
 'a': 4,
 'is': 6,
 'buy': 11,
 'absolutely': 9,
 'again': 10,
 'experience': 12,
 'fantastic': 13,
 'recommend': 21,
 'hate': 14,
 'highly': 15,
 'it': 16,
 'positive': 20,
 'negative': 18,
 'not': 19,
 'terrible': 22,
 'would': 23}