<a href="https://colab.research.google.com/github/jansoe/Inno/blob/main/TextGeneration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Causal Language Modelling

#### Preperation

In [1]:
!pip install transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import transformers
import torch
import textwrap
import plotly.express as px

## Choice of Model

In [3]:
model_version = 'EleutherAI/gpt-neo-125m'
#model_version = 'dbmdz/german-gpt2'
#model_version = 'flax-community/nordic-gpt-wiki'

## Model Input : Tokens
To process a text with a Transformer Neural Network, the sentence has to be split into **tokens**.

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_version)

Downloading (…)okenizer_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Each token is some frequently appearing combination of letters:

In [5]:
tokens = tokenizer.tokenize('This is a test of the tokenizer')
tokens

['This', 'Ġis', 'Ġa', 'Ġtest', 'Ġof', 'Ġthe', 'Ġtoken', 'izer']

Each token has a unique id

In [7]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[1212, 318, 257, 1332, 286, 262, 11241, 7509]

Token IDs can be converted back to tokens

In [8]:
tokenizer.convert_ids_to_tokens(token_ids)

['This', 'Ġis', 'Ġa', 'Ġtest', 'Ġof', 'Ġthe', 'Ġtoken', 'izer']

or directly to text

In [9]:
tokenizer.decode(token_ids)

'This is a test of the tokenizer'

The smallest tokens are single letters. They are selected when there if a word does not match any known letter combination  

In [11]:
tokenizer.tokenterenize('n0th1ng is corect')>> Hallo Herr Renné,

['n', '0', 'th', '1', 'ng', 'Ġis', 'Ġcore', 'ct']

In [12]:
tokenizer.tokenize('Ein Satz auf deutsch')

['E', 'in', 'ĠSat', 'z', 'Ġa', 'uf', 'Ġde', 'utsch']

In [13]:
len(tokenizer.vocab)

50257

## Model Output: Next Token probability

We first initialize the model

In [114]:
model = transformers.AutoModelForCausalLM.from_pretrained(model_version)

Then the input sentence is converted to token ids

In [115]:
#prompt = "The sun is"
prompt = "Answer the question: Is Peter a male or female name? Answer: Peter is a male name."

ids = tokenizer(prompt, return_tensors='pt')['input_ids']
ids

tensor([[33706,   262,  1808,    25,  1148,  5613,   257,  4257,   393,  4048,
          1438,    30, 23998,    25,  5613,   318,   257,  4257,  1438,    13]])

and processed by the model

In [116]:
with torch.no_grad():
    out = model.forward(ids)

The model output gives the probability for each of the tokens to be next one in teh sentence

In [125]:
probabilities = torch.softmax(out.logits, 2).numpy().squeeze()

top10 = probabilities[-1].argsort()[:-11:-1]
decoded_most_probable = [repr(tokenizer.decode(i)) for i in top10]

px.bar(
    x = decoded_most_probable,
    y = probabilities[-1][top10],
    title = 'Probability of the 10 most likely next tokens'
)

For text generation the model predicts the probability of the next token. But always choosing the most probable token often leads to very repetitive texts.

To improve text generation qualtiy, different strategies exist:
- `repetition_penalty`: Tokens that have been choosen before are less likely to be choosen again.
- `do_sample`: Not the token with the highest probability is chosen, but randomly a token is chosen reflecting its probability
  - `temperature`: the higher the temperature, the more likely even low probability tokens are choosen
  - `top_`: do not randomly draw from all tokens but
    - `top_k`: just choose from the k most probable
    - `top_p`: just choose from the top most probable tokens that have a combined probability of p%

More detailed explanation [here](https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277).

In [175]:
prompt = 'The sun is'

torch.manual_seed(10) # the seed controls the randomness

output_sequences = model.generate(
    **tokenizer(prompt, return_tensors='pt'),
    max_length = 128,
    repetition_penalty = None,
    do_sample = False,
    temperature = 0.7,
    top_k = 10, # set to None to ignore or 0 < k <= vocab_size
    top_p = None, # set to None to ignore or 0 < p <= 1
    pad_token_id = tokenizer.eos_token_id
)

In [176]:
output_sequences

tensor([[  464,  4252,   318, 22751,   319,   262,   995,    11,   290,   262,
          5788,   389, 22751,   319,   262,   995,    13,   198,   198,   464,
          4252,   318, 22751,   319,   262,   995,    11,   290,   262,  5788,
           389, 22751,   319,   262,   995,    13,   198,   198,   464,  4252,
           318, 22751,   319,   262,   995,    11,   290,   262,  5788,   389,
         22751,   319,   262,   995,    13,   198,   198,   464,  4252,   318,
         22751,   319,   262,   995,    11,   290,   262,  5788,   389, 22751,
           319,   262,   995,    13,   198,   198,   464,  4252,   318, 22751,
           319,   262,   995,    11,   290,   262,  5788,   389, 22751,   319,
           262,   995,    13,   198,   198,   464,  4252,   318, 22751,   319,
           262,   995,    11,   290,   262,  5788,   389, 22751,   319,   262,
           995,    13,   198,   198,   464,  4252,   318, 22751,   319,   262,
           995,    11,   290,   262,  5788,   389, 2

In [177]:
decoded = tokenizer.decode(output_sequences[0])
print(textwrap.fill(decoded, width = 80, replace_whitespace=False))

The sun is shining on the world, and the stars are shining on the world.

The
sun is shining on the world, and the stars are shining on the world.

The sun is
shining on the world, and the stars are shining on the world.

The sun is
shining on the world, and the stars are shining on the world.

The sun is
shining on the world, and the stars are shining on the world.

The sun is
shining on the world, and the stars are shining on the world.

The sun is
shining on the world, and the stars are shining on


### Task: Parameter Exploration

Experiment with GPT-2 for text generation. How to obtain "good" texts?


- What do you observe with  `repetition_penalty = None` and `do_sample = False`?
- What happens with `do_sample = True` und different values of `temperature`?
  - repeat the same with some values for `top_k`. What changes?
- What happens, if you start with `prompt_text = '<|endoftext|>'`?