# Tokenization

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

**Note**: If you're running this on the lab machines, you should **re-run the class setup script**:

In [1]:
!/home/cs/344/setup-cs344.sh

Anaconda is set up.
TORCH_HOME looks ok.
HF_HOME looks ok.
Scratch configured in ~/.fastai/config.ini.
Done.


Then you will need to **LOG OUT AND LOG BACK IN**. (If you know what you're doing and want to avoid the log out: that added a definition of `HF_HOME` to `~/.profile`; you can set it here with `os.environ` if you want.)

Now let's install the library.

In [2]:
!pip install -q transformers[sentencepiece]

In [3]:
import torch
from torch import tensor

### Download and load the model

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

In [5]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


## Task

Consider the following phrase:

In [8]:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

### Getting familiar with tokens

1: Use `tokenizer.tokenize` to convert the phrase into a list of tokens. (What do you think the `Ġ` means?)

In [9]:
tokenizer.tokenize(phrase)

['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']

2: Use `tokenizer.convert_tokens_to_string` to convert the tokens back into a string.


In [12]:
tokenizer.convert_tokens_to_string(phrase)

'I visited Muskegon'

3: Use `tokenizer.encode` to convert the original phrase into token ids. (*Note: this is equivalent to `tokenize` followed by `convert_tokens_to_ids`*.) Call the result `input_ids`.


In [26]:
input_ids = tokenizer.encode(phrase)
input_ids

[314, 8672, 2629, 365, 14520]

4: Turn `input_ids` back into a readable string. Try this two ways: (1) using `convert_ids_to_tokens` and (2) using `tokenizer.decode`.

In [37]:
# using convert_ids_to_to
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids))

' I visited Muskegon'

In [28]:
tokenizer.decode(input_ids)

' I visited Muskegon'

### Applying what you learned

5: Use `model.generate(tensor([input_ids]))` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input.) Call the result `output_ids`.


In [30]:
output_ids = model.generate(tensor([input_ids]))
output_ids

tensor([[  314,  8672,  2629,   365, 14520,    11,   290,   314,   373,  6655,
           284,  1064,   326,   262,  1748,   550,   407,   587,  1498,   284]])

6: Convert your `output_ids` into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use `output_ids[0]`.)

In [38]:
tokenizer.decode(output_ids[0])

' I visited Muskegon, and I was surprised to find that the city had not been able to'

Note: `generate` uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

- Turn on `do_sample=True`. Run it a few times to see what it gives.
- Set `top_k=5`. Or 50.

In [53]:
output_ids_5 = model.generate(tensor([input_ids]), do_sample=True, top_k=5)
output_ids_5

tensor([[  314,  8672,  2629,   365, 14520,    11,   810,   262,  1748,   286,
          2629,   365, 14520,   318,  5140,    13,   198,   198,   198,   198]])

In [54]:
tokenizer.decode(output_ids_5[0])

' I visited Muskegon, where the city of Muskegon is located.\n\n\n\n'

In [55]:
output_ids_50 = model.generate(tensor([input_ids]), do_sample=True, top_k=50)
output_ids_50

tensor([[  314,  8672,  2629,   365, 14520,   329,   262,   938,  1115,   812,
            13,   314,  2911,   339,   857,   407,   466,   884,   281,  2801]])

In [56]:
tokenizer.decode(output_ids_50[0])

' I visited Muskegon for the last three years. I hope he does not do such an ill'

## Analysis

Q1: Write a brief explanation of what a tokenizer does. Note that we worked with two parts of a tokenizer in this exercise (one that deals only with strings, and another that deals with numbers); make sure your explanation addresses both parts.

The tokenizer prepares the inputs for models. This exercise uses a tokenizer for strings and a tokenzier for numbers. The tokenizer that deals with strings converts tokens to output strings and the tokenizer that deals with numbers converts number ids to output tokens.

Q2: What are the smallest and largest numbers you've seen in `input_ids`? How does this relate to the number of words in the tokenizer's vocabulary? (See the `print` statement just after loading the model.)

Smallest: 314  
Largest: 14520  
The tokenizer has 50257 strings in its vocabulary, so I think these numbers are the number assigned to the word generated in the tokenizer's vocabulary list.

Q3: What do you think the `Ġ` means? (Hint: it replaces a single well-known character.)

It is a space (sorta). It denotes where a token begins in a string.

Q4: Suppose you add some personal flair to your writing by adding some extra syllables to the end of some words. Explain what this tokenizer will do with your embellished writing.

From what I understand, the tokenizer only detects where the next token starts, not where the token ends, so like the breaking up of the string word "Muskegon", which is likely not in the tokenizer vocabulary list, it would not separate the word as separate tokens. For example, if I were to embellish "awesome" to be "awesomeingerly", assuming the tokenizer's vocabulary includes the string "awesome" and not the string "ingerly", the word "awesomeingerly" would be kept as a single token.