<a href="https://colab.research.google.com/github/kcarnold/cs344/blob/main/portfolio/fundamentals/012_tokenization-solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `012` Tokenization

Task: Convert text to numbers; interpret subword tokenization.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 5.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 21.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 36.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=d6b6fdd3399

In [2]:
import torch
from torch import tensor

### Download and load the model

In [40]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

In [41]:
print(f"The model has {model.num_parameters():,d} parameters.")

The model has 81,912,576 parameters.


## Task

Consider the following phrase: "In a shocking finding, scientists discovered a herd of unicorns living in"

**Getting familiar with tokens:**

1. Use `tokenizer.tokenize` to convert the phrase into a list of tokens. What do you think the `Ġ` means?
2. Use `tokenizer.convert_tokens_to_string` to convert the tokens back into a string.
3. Use `tokenizer.encode` to convert the original phrase into token ids. (*Note: this is equivalent to `tokenize` followed by `convert_tokens_to_ids`*.)
4. Use `tokenizer.decode` to convert the token ids back to the original phrase.

**Applying what you learned:**

5. Use `model.generate(tensor([input_ids]))` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input.)
6. Convert the result of `generate` into a readable form. (Recall the note in the previous step.)

In [54]:
phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"
#phrase = "Next weekend I plan to"

In [55]:
# your code here
tokens = tokenizer.tokenize(phrase)
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))

['ĠIn', 'Ġa', 'Ġshocking', 'Ġfinding', ',', 'Ġscientists', 'Ġdiscovered', 'Ġa', 'Ġherd', 'Ġof', 'Ġunic', 'orns', 'Ġliving', 'Ġin']
 In a shocking finding, scientists discovered a herd of unicorns living in


In [56]:
# your code here
input_ids = tokenizer.encode(phrase)
input_ids

[554,
 257,
 14702,
 4917,
 11,
 5519,
 5071,
 257,
 27638,
 286,
 28000,
 19942,
 2877,
 287]

In [57]:
# your code here
tokenizer.decode(input_ids)

' In a shocking finding, scientists discovered a herd of unicorns living in'

In [58]:
# your code here
output_ids = model.generate(tensor([input_ids]))
output_ids

tensor([[  554,   257, 14702,  4917,    11,  5519,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  8222,   287,   262,  7840,   636]])

In [59]:
# your code here
tokenizer.decode(output_ids[0])

' In a shocking finding, scientists discovered a herd of unicorns living in a forest in the northern part'