# Week 4: Tokenization

### What we are building
Tokenization is the task of chopping it up into pieces, called tokens. As you might have observed going through the projects in Week 1, vectorization and tokenization have a huge influence on the output of the exact same model.

We will compare the different tokenizers for different sizes of vocabulary on Botchan a novel written by Natsume Sōseki in 1906. We'll see what percentage of vocabulary would be considered OOV (out-of-vocab) at different sizes.

### Instructions

1. We have provide scaffolding for all the different tokenizers and added an assert to make sure the output is same as expected.
1. Most of the tokenizers are already somethings you've seen before so but we'll dive deeper into SentencePiece.

### Code Overview

- Dependencies: Install and import python dependencies
- Tokenizers
  - WhitespaceTokenizer
  - CharacterTokenizer
  - SpacyTokenizer
  - BERTTokenizer
  - SentencePieceTokenizer
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [1]:
%%capture
# Install all the required dependencies for the project
!pip install spacy --quiet
!python -m spacy download en_core_web_md
!pip install transformers
!pip install sentencepiece

Import all the necessary libraries we need throughout the project.



In [2]:
# Import all the relevant libraries
import re
import en_core_web_md
import sentencepiece as spm

from collections import defaultdict
from transformers import BertTokenizer

Now let's load the Spacy data, which comes with pre-trainined embeddings. This process is expensive so only do it once.

In [3]:
loaded_spacy_model = en_core_web_md.load()

### Dataset

Download the Botchan novel from [SentencePiece repository](https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt).


In [4]:
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

lines = []
with open('botchan.txt', 'r') as the_file:
  lines = the_file.readlines()

--2022-08-08 03:42:46--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’


2022-08-08 03:42:47 (6.93 MB/s) - ‘botchan.txt’ saved [278779/278779]



Constants for sample sentence and thresholds we'll be using throughout

In [5]:
SAMPLE_SENTENCE = "I'm learning NLP. Aren't my projects awesome?"
THRESHOLDS = [1000, 2000, 3000, 4000, 5000, 7500, 10000]

### Base Tokenizer class that implements the coverage report function

In [6]:
class Tokenizer:
  def __init__(self, lines):
    self.vocab_size = 0
    token_dict = defaultdict(int)

    for line in lines:
      for token in self.tokenize(line):
        self.vocab_size += 1
        token_dict[token] += 1

    self.token_counts = sorted(token_dict.items(), key=lambda x: x[1], reverse=True)

  def tokenize(self, sentence):
    raise ValueError("TO BE IMPLEMENTED")

  def coverage(self, threshold):
    return sum([x[1] for x in self.token_counts[:threshold]]) / self.vocab_size

  def coverage_report(self, thresholds):
    # For each threshold print the percentage of coverage and OOV
    for tv in thresholds:
      coverage = self.coverage(tv) * 100
      print("For vocab size: %d, coverage is: %.2f%% and oov is: %.2f%%" % (tv, coverage, 100-coverage))

## Assignment Part:1 - Whitespace based separators
##### <font color='red'>Expected vocab: 1.000, coverage: 72.79%</font>
##### <font color='red'>Expected vocab: 10.000, coverage: 99.43%</font>

Tokenizer that splits the string sentences into tokens using  whitespace.

In [8]:
class WhiteSpaceTokenizer(Tokenizer):
  def tokenize(self, sentence):
    ### TO BE IMPLEMENTED ###
    output = sentence.split(' ')
    ### TO BE IMPLEMENTED ###
    
    return output

white_space_tokenizer = WhiteSpaceTokenizer(lines)
assert white_space_tokenizer.tokenize(SAMPLE_SENTENCE) == ["I'm", 'learning', 'NLP.', "Aren't", 'my', 'projects', 'awesome?']
white_space_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 72.79% and oov is: 27.21%
For vocab size: 2000, coverage is: 80.11% and oov is: 19.89%
For vocab size: 3000, coverage is: 84.41% and oov is: 15.59%
For vocab size: 4000, coverage is: 87.66% and oov is: 12.34%
For vocab size: 5000, coverage is: 89.62% and oov is: 10.38%
For vocab size: 7500, coverage is: 94.52% and oov is: 5.48%
For vocab size: 10000, coverage is: 99.43% and oov is: 0.57%


## Assignment Part:2 - Character based tokenizer
##### <font color='red'>Expected vocab: 1.000, coverage: 100%</font>
##### <font color='red'>Expected vocab: 10.000, coverage: 100%</font>

Tokenizer that splits the string sentences into individual characters.

In [9]:
class CharacterTokenizer(Tokenizer):
  def tokenize(self, sentence):
    ### TO BE IMPLEMENTED ###
    output = list(sentence)
    ### TO BE IMPLEMENTED ###
    
    return output

character_tokenizer = CharacterTokenizer(lines)
assert character_tokenizer.tokenize(SAMPLE_SENTENCE) == ['I', "'", 'm', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'N', 'L', 'P', '.', ' ', 'A', 'r', 'e', 'n', "'", 't', ' ', 'm', 'y', ' ', 'p', 'r', 'o', 'j', 'e', 'c', 't', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '?']
character_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 2000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 3000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 4000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 5000, coverage is: 100.00% and oov is: 0.00%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:3 - Spacy Tokenizer
##### <font color='red'>Expected vocab: 1.000, coverage: 86%</font>
##### <font color='red'>Expected vocab: 10.000, coverage: 100%</font>

Tokenizer that splits the string sentences into individual tokens using the Spacy's built in tokenizer.

In [12]:
class SpacyTokenizer(Tokenizer):
  def tokenize(self, sentence):
    ### TO BE IMPLEMENTED ###
    spacy_doc = loaded_spacy_model(sentence)
    output = [token.text for token in spacy_doc]
    ### TO BE IMPLEMENTED ###
    
    return output

spacy_tokenizer = SpacyTokenizer(lines)
assert spacy_tokenizer.tokenize(SAMPLE_SENTENCE) == ['I', "'m", 'learning', 'NLP', '.', 'Are', "n't", 'my', 'projects', 'awesome', '?']
spacy_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 86.00% and oov is: 14.00%
For vocab size: 2000, coverage is: 91.94% and oov is: 8.06%
For vocab size: 3000, coverage is: 95.08% and oov is: 4.92%
For vocab size: 4000, coverage is: 96.67% and oov is: 3.33%
For vocab size: 5000, coverage is: 98.21% and oov is: 1.79%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:4 - BERT Tokenizer
##### <font color='red'>Expected vocab: 1.000, coverage: 85.46%</font>
##### <font color='red'>Expected vocab: 10.000, coverage: 100%</font>

BERT tokenizer provided by the hugging face library.

In [13]:
class BERTTokenizer(Tokenizer):
  def __init__(self, tokenizer, lines):
    self.tokenizer = tokenizer
    super(BERTTokenizer, self).__init__(lines)

  def tokenize(self, sentence):
    ### TO BE IMPLEMENTED ###
    output = self.tokenizer.tokenize(sentence)
    ### TO BE IMPLEMENTED ###
    
    return output

raw_bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokenizer = BERTTokenizer(raw_bert_tokenizer, lines)
assert bert_tokenizer.tokenize(SAMPLE_SENTENCE) == ['i', "'", 'm', 'learning', 'nl', '##p', '.', 'aren', "'", 't', 'my', 'projects', 'awesome', '?'], bert_tokenizer.tokenize(SAMPLE_SENTENCE)
bert_tokenizer.coverage_report(THRESHOLDS)

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

For vocab size: 1000, coverage is: 85.46% and oov is: 14.54%
For vocab size: 2000, coverage is: 92.11% and oov is: 7.89%
For vocab size: 3000, coverage is: 95.61% and oov is: 4.39%
For vocab size: 4000, coverage is: 97.51% and oov is: 2.49%
For vocab size: 5000, coverage is: 99.04% and oov is: 0.96%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


## Assignment Part:5 - SentencePieceTokenizer
##### <font color='red'>Expected vocab: 1.000, coverage: 99.87%</font>
##### <font color='red'>Expected vocab: 10.000, coverage: 100%</font>

SentencePiece tokenizer works a bit differently from everything we've used so far. It needs to be trained with a vocabulary and a target size. 

In [14]:
class SentencePieceTokenizer(Tokenizer):
  def __init__(self, lines):
    self.lines = lines
    spm.SentencePieceTrainer.train(input='botchan.txt', model_prefix='m', vocab_size=4000)
    self.sp = spm.SentencePieceProcessor()
    self.sp.load('m.model')

  def tokenize(self, sentence):
    ### TO BE IMPLEMENTED ###
    output = self.sp.encode_as_pieces(sentence)
    ### TO BE IMPLEMENTED ###
    
    return output

  def coverage(self, threshold):
    # Train a new SentencePiece tokenizer for the threshold provided
    try:
      spm.SentencePieceTrainer.train(input='botchan.txt', model_prefix=f'm{threshold}', vocab_size=threshold)
    except RuntimeError:
      # Vocabulary size > 5239 raises a runtime error
      return 1.0 

    sp = spm.SentencePieceProcessor()
    # Load our recently trained model
    sp.load(f'm{threshold}.model')

    # Count the number of times UNK (id=0) was assigned in the entire dataset from the model
    total = 0
    unk = 0
    for line in self.lines:
      ids = sp.encode_as_ids(line)
      unk += ids.count(0)
      total += len(ids)
    return (total - unk) / total

sp_tokenizer = SentencePieceTokenizer(lines)
assert sp_tokenizer.tokenize(SAMPLE_SENTENCE) == ['▁I', "'", 'm', '▁learn', 'ing', '▁N', 'L', 'P', '.', '▁A', 'ren', "'", 't', '▁my', '▁pro', 'j', 'e', 'c', 't', 's', '▁awe', 'some', '?'], sp_tokenizer.tokenize(SAMPLE_SENTENCE)
sp_tokenizer.coverage_report(THRESHOLDS)

For vocab size: 1000, coverage is: 99.87% and oov is: 0.13%
For vocab size: 2000, coverage is: 99.85% and oov is: 0.15%
For vocab size: 3000, coverage is: 99.84% and oov is: 0.16%
For vocab size: 4000, coverage is: 99.83% and oov is: 0.17%
For vocab size: 5000, coverage is: 99.83% and oov is: 0.17%
For vocab size: 7500, coverage is: 100.00% and oov is: 0.00%
For vocab size: 10000, coverage is: 100.00% and oov is: 0.00%


As you see the OOV% is really low but at the same time it increased a bit when we increased a vocabulary size. An intuitive way to think about this is that in smaller vocabularies the algorithm trains to something closer to character embeddings as we give it a bit larger size it tries to learn more language semantics and trades off some vocabulary coverage.

🎉 WOOHOOO we've covered the part 1 of this week. Let's keep making progress and proceed to the Generation notebook. But do come back to try out some of the extensions.

# Extensions

Now that you've worked through the part 1 of the project there is a lot more for us to try:

- Try the new tokenizers in the Week 1 EmbeddingBag model?
- Similarly change the tokenizer in the Week 2 LSTM?
- Compare the tokenizers on a non-english language data?