## Program to demonstrate a simple tokenizer

Inputs collection of string.

Builds vocabulary by splitting on space. **Converts text to lowercase**

Encode sentence into list of input ids.

Decode list of ids to string

Uses <UNK> token for word not present in the vocabulary

In [117]:
class Tokenizer:
    """Demonstrate simple Tokenizer class in python"""
    def __init__(self):
        self.vocab = {"<UNK>": 0}
        self.id2word = {0: "<UNK>"}
        self.next_id = 1


    def build_vocab(self, texts):
        """Builds vocabulary from string"""
        for text in texts:
            words = text.lower().split(" ")
            unique_words = set(words)
            
            for word in unique_words:
                self.vocab[word] = self.next_id
                self.id2word[self.next_id] = word
                self.next_id += 1

    def encode(self, text):
        """Convert input text to list of ids"""
        encoded_ids = []
        words = text.lower().split(" ")
        for word in words:
            encoded_ids.append(self.vocab.get(word, 0))

        return encoded_ids

    def decode(self, ids):
        """Convert list o ids back to string"""
        words = []
        for id in ids:
            words.append(self.id2word.get(id, "<UNK>"))
        return " ".join(words)
        
            
        

## Demonstration

In [118]:
input_str = ["I love learning AI"]

simple_tokenizer = Tokenizer()
simple_tokenizer.build_vocab(input_str)

print(f"Word2Idx: {simple_tokenizer.vocab}")
print(f"Idx2Word: {simple_tokenizer.id2word}")

Word2Idx: {'<UNK>': 0, 'love': 1, 'i': 2, 'learning': 3, 'ai': 4}
Idx2Word: {0: '<UNK>', 1: 'love', 2: 'i', 3: 'learning', 4: 'ai'}


### String to IDs

In [119]:
encoded_ids = simple_tokenizer.encode("I love learning AI")
print(f"Encoded IDs: {encoded_ids}")

Encoded IDs: [2, 1, 3, 4]


### IDs to string

In [120]:
string = simple_tokenizer.decode(encoded_ids)
print(f"Decoded text: {string}")

Decoded text: i love learning ai


### Handling Unknown

In [122]:
input_string = "I love learning math"
encoded_ids = simple_tokenizer.encode(input_string)
print(f"Encoded IDs: {encoded_ids}")
string = simple_tokenizer.decode(encoded_ids)
print(f"Decoded text: {string}")

Encoded IDs: [2, 1, 3, 0]
Decoded text: i love learning <UNK>
