# Tokenization

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

**Note**: If you're running this on the lab machines, you should **re-run the class setup script**:

In [4]:
!/home/cs/344/setup-cs344.sh

Anaconda is set up.
TORCH_HOME looks ok.
HF_HOME looks ok.
Scratch configured in ~/.fastai/config.ini.
Done.


Then you will need to **LOG OUT AND LOG BACK IN**. (If you know what you're doing and want to avoid the log out: that added a definition of `HF_HOME` to `~/.profile`; you can set it here with `os.environ` if you want.)

Now let's install the library.

In [5]:
!pip install -q transformers[sentencepiece]

In [6]:
import torch
from torch import tensor

### Download and load the model

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

In [8]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


In [9]:
#phrase = "May the Force be" [damned]
phrase = "To be or not to"

In [10]:
batch = tokenizer(phrase, return_tensors='pt')
# Batch pf phrases that can passed through model at the same time, helps for fast training
batch

{'input_ids': tensor([[1675,  307,  393,  407,  284]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [11]:
batch['input_ids'].shape
# [batch size, number of input ids]

torch.Size([1, 5])

In [17]:
tokenizer.convert_ids_to_tokens([1675])

['ĠTo']

In [18]:
tokenizer.convert_ids_to_tokens(batch['input_ids'][0])

['ĠTo', 'Ġbe', 'Ġor', 'Ġnot', 'Ġto']

In [32]:
out = model.forward(**batch) #,labels=masked_input_ids, output_hidden_states=True

In [45]:
input_ids = batch ['input_ids']
transformer_outputs = model.transformer(input_ids)
hidden_states = transformer_outputs[0]
lm_logits = model.lm_head(hidden_states)

In [47]:
hidden_states.shape

torch.Size([1, 5, 768])

In [48]:
model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

In [49]:
hidden_states[0, 1]

tensor([-1.0535e-01, -3.1119e-01,  1.8547e-01, -2.1496e-01,  4.7452e-02,
         1.0167e-01, -5.4309e-01,  1.0478e+00,  1.3330e-01,  3.2923e-01,
         3.2632e-01,  1.2584e-01, -1.9127e-01,  4.0045e-01, -2.8676e-02,
        -2.3159e-01,  3.8155e-02, -8.4357e-01, -1.5539e-01,  1.4244e-01,
        -1.4554e-01, -5.5682e-02,  1.4789e-01,  4.7947e-01,  2.0871e-01,
         8.7035e-03,  3.0157e-01, -3.0594e-01,  2.3922e-01,  3.7042e-01,
         1.3656e-01,  2.4162e-02,  1.9577e-01, -1.6666e-01,  2.0682e-01,
        -4.3732e-01,  3.8166e+01,  3.9355e-01, -1.4582e-01,  2.6889e-01,
        -1.8951e-01, -5.9843e-02, -8.7268e-03,  1.4158e-01,  1.8710e-02,
         2.4323e-01,  1.1892e-01, -2.4620e-01,  2.2336e-01,  5.8412e-01,
        -1.0754e-01,  1.5730e-01, -2.6348e-01, -1.1958e-01,  7.4822e-02,
         1.4612e+00, -2.7473e-02, -5.6583e-01, -2.1155e-01,  1.7774e-01,
         8.1083e-02, -2.7762e-01,  2.4878e-01,  3.9668e-02, -1.2293e+00,
        -1.6452e-02,  1.7181e-01,  7.7464e-02, -1.3

In [50]:
model.lm_head.weight[257].shape

torch.Size([768])

In [53]:
hidden_states[0, 1] @ model.lm_head.weight[257]

tensor(-51.5463, grad_fn=<DotBackward0>)

In [54]:
lm_logits.shape

torch.Size([1, 5, 50257])

In [36]:
vars(out).keys()

dict_keys(['loss', 'logits', 'past_key_values', 'hidden_states', 'attentions', 'cross_attentions'])

In [37]:
out.logits.shape
# [batch size, input ids, number of strings in vocabulary]

torch.Size([1, 5, 50257])

In [1]:
next_token_logits = out.logits[0, -1]
# selecting one dimension out makes it go away
# logits that say what the next token could be
# number represents how likely that word is to being the next word
next_token_logits

NameError: name 'out' is not defined

In [39]:
next_token_probs = next_token_logits.softmax(dim=0)
# softmax can be shifted with offset, shows probabiltiy of next token
next_token_probs.max()
# find the word that is by finding the index of where the probability-word is

tensor(0.6485, grad_fn=<MaxBackward1>)

In [40]:
next_token_probs.argmax()

tensor(307)

In [41]:
next_token_probs[307]
# find probability of token form argmax()

tensor(0.6485, grad_fn=<SelectBackward0>)

In [42]:
#convert ids to tokens
tokenizer.convert_ids_to_tokens([307])

['Ġbe']

In [43]:
# find more than one word with high probability
tokenizer.convert_ids_to_tokens(
    next_token_probs.topk(5).indices
)

['Ġbe', 'Ġhave', 'Ġthe', 'Ġdo', ',']

In [46]:
# take off a layer
tokenizer.convert_ids_to_tokens(
    lm_logits[0, 1]
    .softmax(dim=0)
    .topk(10)
    .indices
)

['Ġa',
 'Ġable',
 'Ġthe',
 'Ġsure',
 'Ġin',
 'Ġused',
 'Ġan',
 'Ġmore',
 'Ġon',
 'Ġhonest']

## Task

Consider the following phrase:

In [6]:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

### Getting familiar with tokens

1: Use `tokenizer.tokenize` to convert the phrase into a list of tokens. (What do you think the `Ġ` means?)

In [7]:
tokenizer.tokenize(phrase)

['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']

2: Use `tokenizer.convert_tokens_to_string` to convert the tokens back into a string.


In [12]:
tokenizer.convert_tokens_to_string(phrase)

'I visited Muskegon'

3: Use `tokenizer.encode` to convert the original phrase into token ids. (*Note: this is equivalent to `tokenize` followed by `convert_tokens_to_ids`*.) Call the result `input_ids`.


In [26]:
input_ids = tokenizer.encode(phrase)
input_ids

[314, 8672, 2629, 365, 14520]

4: Turn `input_ids` back into a readable string. Try this two ways: (1) using `convert_ids_to_tokens` and (2) using `tokenizer.decode`.

In [37]:
# using convert_ids_to_to
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids))

' I visited Muskegon'

In [28]:
tokenizer.decode(input_ids)

' I visited Muskegon'

### Applying what you learned

5: Use `model.generate(tensor([input_ids]))` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input.) Call the result `output_ids`.


In [30]:
output_ids = model.generate(tensor([input_ids]))
output_ids

tensor([[  314,  8672,  2629,   365, 14520,    11,   290,   314,   373,  6655,
           284,  1064,   326,   262,  1748,   550,   407,   587,  1498,   284]])

6: Convert your `output_ids` into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use `output_ids[0]`.)

In [38]:
tokenizer.decode(output_ids[0])

' I visited Muskegon, and I was surprised to find that the city had not been able to'

Note: `generate` uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

- Turn on `do_sample=True`. Run it a few times to see what it gives.
- Set `top_k=5`. Or 50.

In [53]:
output_ids_5 = model.generate(tensor([input_ids]), do_sample=True, top_k=5)
output_ids_5

tensor([[  314,  8672,  2629,   365, 14520,    11,   810,   262,  1748,   286,
          2629,   365, 14520,   318,  5140,    13,   198,   198,   198,   198]])

In [54]:
tokenizer.decode(output_ids_5[0])

' I visited Muskegon, where the city of Muskegon is located.\n\n\n\n'

In [55]:
output_ids_50 = model.generate(tensor([input_ids]), do_sample=True, top_k=50)
output_ids_50

tensor([[  314,  8672,  2629,   365, 14520,   329,   262,   938,  1115,   812,
            13,   314,  2911,   339,   857,   407,   466,   884,   281,  2801]])

In [56]:
tokenizer.decode(output_ids_50[0])

' I visited Muskegon for the last three years. I hope he does not do such an ill'