### **Tutorial 10: Tokenizer**

This tutorial explains the concept of a tokenizer and demonstrates how to implement and use a basic character-level tokenizer in Python. A tokenizer is a tool that converts text into tokens (smaller units, such as words or characters) and, in some cases, converts those tokens back into text.

In this tutorial, we build 
- 1). A character-level tokenizer that works at the character granularity, encoding each character as a unique token ID and decoding token IDs back to text.
- 2). A word-level tokenizer that works at the word granularity, encoding each word as a unique token ID and decoding token IDs back to text.
- 3). An n-grams tokenizer that generates sequences of tokens based on an n-grams approach, a fundamental concept in language modeling.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [2]:
def is_extraction_successful(text):
    if isinstance(text, str) and text.strip():
        return True
    return False

In [3]:
from utils import text_helper
from utils.tokenizer import CharTokenizer, WordTokenizer, NGramsTokenizer

url = "https://medium.com/letters-to-my-younger-self-embracing-emotions-and/bridging-theory-and-practice-1b277456400d"  
text = text_helper.extract_medium_post_content(url)
if not is_extraction_successful(text):
    print("Failed to extract valid content from the URL.")
else:
    print("Content successfully extracted.")

char_tokenizer = CharTokenizer.train_from_text(text)
encoded = char_tokenizer.encode(text)

ngrams_tokenizer = NGramsTokenizer(n=3)
ngrams = ngrams_tokenizer.generate_ngrams(encoded)
print("Sample of Character-level 3-grams:", ngrams[0])


word_tokenizer = WordTokenizer.train_from_text(text)
encoded = word_tokenizer.encode(text)

ngrams_tokenizer = NGramsTokenizer(n=2)
ngrams = ngrams_tokenizer.generate_ngrams(encoded)

print("Sample of Word-level 2-grams:", ngrams[0])

encoded

Content successfully extracted.
Sample of Character-level 3-grams: tensor([ 5, 32,  1])
Sample of Word-level 2-grams: tensor([ 0, 12])


tensor([  0,  12,  36,   4,  71,  55, 103,  44,  51, 123, 119,  31,  95, 113,
        101, 117,  28, 111,  27, 123,  90,  15,   5,  39,  58,  80,  23,  45,
         79,  13,  34,  50, 110,  30,  57,   6, 110,  77,  52,   4,  54,  67,
        103, 122,  37, 125, 116, 110,  65,  49,  41,  85, 114,  78,  83,  25,
        107,   9,   4,  17,  73, 104, 102,  11,  56,  91,  56,  81, 120,  60,
          7,  43, 108,  77,  97, 127, 104,  16, 125,  72,  90,  89,  76, 126,
         24, 117, 100,  83,  87, 127,  23,  35,  18,  63, 116,  74,  32,   3,
         65,  38,  64,  56,  45,  10,  96,  70,  59, 105,  78, 115,  92,  86,
         72,  74,  94,   4,  71,  54, 117,  93, 103, 109,  33,  13,  83,  98,
         69,  68,  42,  22,  26,   8, 110, 128,  20, 110,  66,  61,  40, 127,
         69,  46,  59, 106,  22, 117,  21,  10,  99,  48,   2, 114,  47,  19,
        118,  82, 121,  62,  29, 117,  83,  14, 108, 100,  90,  88,   1, 103,
        122,  45,  34,  50, 112,  13,  84,  75,  53, 124])

In [4]:
from torch.utils.data import DataLoader, RandomSampler
from utils import sequentialdataset as sd
import numpy as np

#using char-level tokenizer
tokenized_text = char_tokenizer.encode(text) 
dataset = sd.SequentialDataset(tokenized_text, seq_len=24, label_len=12)
seq_x, seq_y = dataset[0] 
print(f"input sample: {np.round(seq_x.detach().flatten().tolist(),2)}")
print(f"output sample: {np.round(seq_y.detach().flatten().tolist(),2)}")


sampler = RandomSampler(dataset, replacement=True)
dataloader = DataLoader(dataset, batch_size=2, sampler=sampler)
x, y = next(iter(dataloader))

print(f"Input tensor: {x[0]} and decoded text is '{char_tokenizer.decode(x[0])}'")
print(f"Output tensor: {y[0]} and decoded text is '{char_tokenizer.decode(y[0])}'")


## Using word level tokenizer
print("------------------------------")
print("Using word level tokenizer")
tokenized_text = word_tokenizer.encode(text) 
dataset = sd.SequentialDataset(tokenized_text, seq_len=5, label_len=5)
seq_x, seq_y = dataset[0] 
print(f"input sample: {np.round(seq_x.detach().flatten().tolist(),2)}")
print(f"output sample: {np.round(seq_y.detach().flatten().tolist(),2)}")

sampler = RandomSampler(dataset, replacement=True)
dataloader = DataLoader(dataset, batch_size=2, sampler=sampler)
x, y = next(iter(dataloader))
print(f"Input tensor: {x[0]} and decoded text is '{word_tokenizer.decode(x[0])}'")
print(f"Output tensor: {y[0]} and decoded text is '{word_tokenizer.decode(y[0])}'")


input sample: [ 5 32  1 14 27  1 18 17 34 16 14 33 28 31  2  1  8  1 28 19 33 18 27  1]
output sample: [21 18 14 31  1 32 33 34 17 18 27 33]
Input tensor: tensor([33, 21, 14, 33, 33, 21, 18, 28, 31, 18, 33, 22, 16, 14, 25,  1, 17, 18,
        29, 33, 21,  1, 14, 27]) and decoded text is 'thattheoretical depth an'
Output tensor: tensor([17,  1, 29, 31, 14, 16, 33, 22, 16, 14, 25,  1]) and decoded text is 'd practical '
------------------------------
Using word level tokenizer
input sample: [ 0 12 36  4 71]
output sample: [ 55 103  44  51 123]
Input tensor: tensor([ 96,  70,  59, 105,  78]) and decoded text is 'sense of intellectual superiority over'
Output tensor: tensor([115,  92,  86,  72,  74]) and decoded text is 'those relying primarily on online'
