<a href="https://colab.research.google.com/github/jansoe/AIHorizons24/blob/main/CLMTextGeneration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

In [2]:
import transformers
import torch
import textwrap

import plotly.express as px

## Text Generation

In [3]:
model_version = 'EleutherAI/gpt-neo-125m'

tokenizer = transformers.AutoTokenizer.from_pretrained(model_version)
model = transformers.AutoModelForCausalLM.from_pretrained(model_version)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

### Manual generation

In [40]:
prompt = "The sun is"

ids = tokenizer(prompt, return_tensors='pt')['input_ids']
ids

tensor([[ 464, 4252,  318]])

In [41]:
with torch.no_grad():
    out = model(ids)

In [42]:
probabilities = torch.softmax(out.logits, 2).numpy().squeeze()
top10 = probabilities[-1].argsort()[:-11:-1]

decoded_most_probable = [repr(tokenizer.decode(i)) for i in top10]

px.bar(
    x = decoded_most_probable,
    y = probabilities[-1][top10],
    title = 'Next token probability'
)

## Automatic generation

The process of text generation is based on the model predicting, at each step, the probability of each token in the vocabulary for the next position. We can use a 'greedy' approach to simply select the most likely token at each step and move to the next position. However, this often leads to subpar results (see https://arxiv.org/abs/1904.09751).

To improve the quality of the generated texts, various settings can be adjusted:

* repetition_penalty: Tokens that have already been generated become less likely by this factor.
* do_sample: Instead of selecting the most probable token, a token is randomly chosen based on the predicted probability distribution.
    * temperature: The higher the temperature, the higher the chance that less likely tokens will be selected.
    * top_: Constraints on the available tokens:
        * top_k: Only the k most probable tokens are allowed.
        * top_p: Only the most probable tokens are allowed until their cumulative probability reaches p.

In [43]:
prompt = 'The sun is'

output_sequences = model.generate(
    **tokenizer(prompt, return_tensors='pt'),
    max_length = 50, # How many output token should be maximal generated
    repetition_penalty = None,
    do_sample = True,
    temperature = 0.7,
    top_k = None, # set to None to ignore or 0 < k <= vocab_size
    top_p = None, # set to None to ignore or 0 < p <= 1
    pad_token_id = tokenizer.eos_token_id
)

In [44]:
decoded = tokenizer.decode(output_sequences[0])
print(textwrap.fill(decoded, width = 80, replace_whitespace=False))

The sun is rising at an altitude of about 3,000 meters and the sky is a
beautiful, sunny picture of the day. We are heading northwest for the first time
this winter, and are exploring the North Pole. We have a great view of


### Tasks

Let's experiment with GPT-2 using different parameter settings to observe their effects on the generated text quality. We'll document our observations as we proceed with various configurations.

**Experiment 1** : `repetition_penalty = None` and `do_sample = False`

In this setting, we expect the model to use a greedy approach, selecting the most probable token at each step without any penalties for repetition.

**Experiment 2:** `do_sample = True` with varying `temperature` and `top_p`

This setting allows for sampling from the predicted probability distribution, introducing randomness to the text generation process. The temperature and top_p parameters control the diversity of the generated text.