### **Tutorial 10: Tokenizer**

This tutorial explains the concept of a tokenizer and demonstrates how to implement and use a basic character-level tokenizer in Python. A tokenizer is a tool that converts text into tokens (smaller units, such as words or characters) and, in some cases, converts those tokens back into text.

In this tutorial, we build 
- 1). A character-level tokenizer that works at the character granularity, encoding each character as a unique token ID and decoding token IDs back to text.
- 2). A word-level tokenizer that works at the word granularity, encoding each word as a unique token ID and decoding token IDs back to text.
- 3). An n-grams tokenizer that generates sequences of tokens based on an n-grams approach, a fundamental concept in language modeling.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [2]:
def is_extraction_successful(text):
    if isinstance(text, str) and text.strip():
        return True
    return False

In [4]:
from utils import text_helper
from utils.tokenizer import CharTokenizer, WordTokenizer, NGramsTokenizer

url = "https://medium.com/letters-to-my-younger-self-embracing-emotions-and/bridging-theory-and-practice-1b277456400d"  
text = text_helper.extract_medium_post_content(url)
if not is_extraction_successful(text):
    print("Failed to extract valid content from the URL.")
else:
    print("Content successfully extracted.")

char_tokenizer = CharTokenizer.train_from_text(text)
encoded = char_tokenizer.encode(text)

ngrams_tokenizer = NGramsTokenizer(n=3)
ngrams = ngrams_tokenizer.generate_ngrams(encoded)
print("Sample of Character-level 3-grams:", ngrams[0])


word_tokenizer = WordTokenizer.train_from_text(text)
encoded = word_tokenizer.encode(text)

ngrams_tokenizer = NGramsTokenizer(n=2)
ngrams = ngrams_tokenizer.generate_ngrams(encoded)
print("Sample of Word-level 2-grams:", ngrams[0])



Content successfully extracted.
Sample of Character-level 3-grams: tensor([ 5, 32,  1])
Sample of Word-level 2-grams: tensor([ 0, 12])


In [5]:
from torch.utils.data import DataLoader, RandomSampler
from utils import sequentialdataset as sd
import numpy as np

#using char-level tokenizer
tokenized_text = char_tokenizer.encode(text) 
dataset = sd.SequentialDataset(tokenized_text, seq_len=24, label_len=12)
seq_x, seq_y = dataset[0] 
print(f"input sample: {np.round(seq_x.detach().flatten().tolist(),2)}")
print(f"output sample: {np.round(seq_y.detach().flatten().tolist(),2)}")


sampler = RandomSampler(dataset, replacement=True)
dataloader = DataLoader(dataset, batch_size=2, sampler=sampler)
x, y = next(iter(dataloader))

print(f"Input tensor: {x[0]} and decoded text is '{char_tokenizer.decode(x[0])}'")
print(f"Output tensor: {y[0]} and decoded text is '{char_tokenizer.decode(y[0])}'")


## Using word level tokenizer
print("------------------------------")
print("Using word level tokenizer")
tokenized_text = word_tokenizer.encode(text) 
dataset = sd.SequentialDataset(tokenized_text, seq_len=5, label_len=5)
seq_x, seq_y = dataset[0] 
print(f"input sample: {np.round(seq_x.detach().flatten().tolist(),2)}")
print(f"output sample: {np.round(seq_y.detach().flatten().tolist(),2)}")

sampler = RandomSampler(dataset, replacement=True)
dataloader = DataLoader(dataset, batch_size=2, sampler=sampler)
x, y = next(iter(dataloader))
print(f"Input tensor: {x[0]} and decoded text is '{word_tokenizer.decode(x[0])}'")
print(f"Output tensor: {y[0]} and decoded text is '{word_tokenizer.decode(y[0])}'")


input sample: [ 5 32  1 14 27  1 18 17 34 16 14 33 28 31  2  1  8  1 28 19 33 18 27  1]
output sample: [21 18 14 31  1 32 33 34 17 18 27 33]
Input tensor: tensor([18, 26, 32,  1, 39,  1, 16, 14, 27,  1, 18, 14, 32, 22, 25, 38,  1, 15,
        18,  1, 25, 18, 14, 31]) and decoded text is 'ems — can easily be lear'
Output tensor: tensor([27, 18, 17,  1, 33, 21, 31, 28, 34, 20, 21,  1]) and decoded text is 'ned through '
------------------------------
Using word level tokenizer
input sample: [ 0 12 36  4 71]
output sample: [ 55 103  44  51 123]
Input tensor: tensor([38, 64, 56, 45, 10]) and decoded text is 'equations makes her feel a'
Output tensor: tensor([ 96,  70,  59, 105,  78]) and decoded text is 'sense of intellectual superiority over'
