In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [11]:
model = AutoModelForCausalLM.from_pretrained(
        "sapienzanlp/Minerva-7B-Instruct-v1.0",
        torch_dtype=torch.bfloat16,
    )

tokenizer = AutoTokenizer.from_pretrained("sapienzanlp/Minerva-7B-Instruct-v1.0")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.67M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

## Zero-Shot Prompting

Nowadays a common way to use LLM is trough prompting. What this means is that thanks to the model high parameterization, and its intstruction fine-tuning, they can be asked directly to solve tasks with the reasonable expectation that they might have the knowledge to solve it. 

Zero-shot is simply a straight up question of solving a task. 

In [15]:
messages = [
    {"role": "user", "content": "Il sentimento della seguente frase è positivo, negativo o neutro?\nLa pizza era deliziosa ma il servizio era pessimo."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=10
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 user 

Il sentimento della seguente frase è positivo, negativo o neutro?
La pizza era deliziosa ma il servizio era pessimo. assistant 

 Il sentimento della frase è negativo.


## Few Shot Prompting

W.r.t. to Zero-shot, in few shot prompting (specifically, in-context few shot) we provide a number of example of how we would like the model to solve the task, and then provide its input for which we expect an output.

In [16]:
messages = [
    {"role": "user", "content": "Il sentimento della seguente frase è positivo, negativo o neutro?\nLa pizza era deliziosa ma il servizio era pessimo."},
    {"role": "assistant", "content": "Neutro"},
    {"role": "user", "content": "Il sentimento della seguente frase è positivo, negativo o neutro?\nLa pizza era deliziosa e il servizio era eccellente."},
    {"role": "assistant", "content": "Positivo"},
    {"role": "user", "content": "Il sentimento della seguente frase è positivo, negativo o neutro?\nLa pizza era immangiabile e il servizio era pessimo."},
    {"role": "assistant", "content": "Negativo"},
    {"role": "user", "content": "Il sentimento della seguente frase è positivo, negativo o neutro?\nLa pizza era buona e il servizio era normale."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=10
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 user 

Il sentimento della seguente frase è positivo, negativo o neutro?
La pizza era deliziosa ma il servizio era pessimo. assistant 

Neutro user 

Il sentimento della seguente frase è positivo, negativo o neutro?
La pizza era deliziosa e il servizio era eccellente. assistant 

Positivo user 

Il sentimento della seguente frase è positivo, negativo o neutro?
La pizza era immangiabile e il servizio era pessimo. assistant 

Negativo user 

Il sentimento della seguente frase è positivo, negativo o neutro?
La pizza era buona e il servizio era normale. assistant 

Positivo


## Decoding methods

While the model internals stays the same, there are a number of different ways to decode the model logits on the vocabulary, and choosing simply the highest probability token at every step might not always be the best way to decode. 

In [17]:
messages = [
    {"role": "user", "content": "Scrivi una frase su di un Robot che impara a dipingere."}
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

### Greedy Decoding

Greedy decoding selects the **highest probability token** at each step.

```python
generate(..., do_sample=False)
```

`do_sample = False` Disables sampling; the model always chooses the most likely next token. There is no randomness: this produces **deterministic** and **repeatable** output. However it may lack creativity or diversity, and can get stuck in repetitive loops.

In [20]:
output_greedy = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False
)
tokenizer.decode(output_greedy[0], skip_special_tokens=True)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


' user \n\nScrivi una frase su di un Robot che impara a dipingere. assistant \n\n "Il robot pittore sta imparando ad esprimere la sua creatività attraverso l\'arte della pittura, con ogni pennellata che aggiunge un nuovo strato di colore e bellezza alla tela."'

### Top-k Sampling

Top-k sampling limits the token pool to the **k most likely tokens**, then randomly samples from them.

```python
generate(..., do_sample=True, top_k=50)
```

`do_sample=True` enables sampling. 

`top_k=50` At each step, only the top 50 most likely tokens are considered. One is thend randomly selected based on probabilities.  
This means that higher $k \rightarrow$ more diversity; lower $k \rightarrow$ more deterministic.

In [26]:
output_topk = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    top_k=50,
    temperature=0.9
)
tokenizer.decode(output_topk[0], skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


' user \n\nScrivi una frase su di un Robot che impara a dipingere. assistant \n\n Il robot è in grado di sviluppare abilità sempre più avanzate nella pittura, grazie alla sua capacità di apprendere e perfezionare le tecniche di disegno e colore.'

### Top-p (Nucleus) Sampling

Top-p sampling dynamically chooses the **smallest set of tokens whose cumulative probability ≥ p**, then samples from them.

```python
generate(..., do_sample=True, top_p=0.9)
```
`top_p=0.9`: Includes only tokens that together make up 90% of the probability mass. Again, `do_sample=True` enables sampling.
This is more adaptive than top-k: the number of candidate tokens changes with the distribution.

In [23]:
output_topp = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    top_p=0.9,
    temperature=0.9
)
tokenizer.decode(output_topp[0], skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


' user \n\nScrivi una frase su di un Robot che impara a dipingere. assistant \n\n "Il robot sta imparando ad utilizzare pennelli e colori per creare opere d\'arte uniche ed emozionanti."'

### Temperature Sampling

Temperature controls the **sharpness** of the probability distribution used for sampling.

```python
generate(..., do_sample=True, temperature=1.2)
```
How does the Temperature affect generation?

`temperature=1.0`: Default setting (no change).

`temperature > 1.0`: Flattens the distribution, meaning more **randomness**.

`temperature < 1.0`: Sharpens the distribution, meaning more **deterministic**.

Works best when combined with top_k or top_p.

In [24]:
output_temp = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=1.2
)
tokenizer.decode(output_temp[0], skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


" user \n\nScrivi una frase su di un Robot che impara a dipingere. assistant \n\n Un robot sta imparando a dipingere, cercando di catturare l'emozione e la maestria dei grandi artisti attraverso la sua creazione artistica."

### Beam Search

Beam search keeps **multiple candidate sequences (beams)** at each step and expands them in parallel.

```python
generate(..., num_beams=5, early_stopping=True, no_repeat_ngram_size=2)
```

`num_beams=5` means that the top 5 beams (sequence candidates) at every decoding step are kept and the auto-regressive generation continues for all of the 5 candidates. 

`early_stopping=True`: Stops generation when all beams reach an EOS token.

`no_repeat_ngram_size=2`: Prevents repeating any 2-gram (like "the the").

Produces more globally optimal and coherent results than greedy decoding, and similarly, it's deterministic unless combined with `do_sample=True`. In the end the sequence with the best score is chosen:
$$Score = \frac{\log P(\text{sequence})}{\text{sequence length}^{\text{length penalty}}}$$

The score of each beam is the total log-probability of each token in the sequence, normalized by its lenght. 

The `length_penalty` parameter can be defined in the `generate(...)` method. $1.0$ is the default value, and balances fairly short and long sequences. Higher numbers penalize longer sequences, while lower numbers penalize shorter sequences. A `length_penalty=0` means that the lenght of the sequence it's not taken into account when evaluating the beams scores. 

In [25]:
output_beam = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=2
)
tokenizer.decode(output_beam[0], skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


' user \n\nScrivi una frase su di un Robot che impara a dipingere. assistant \n\n "Il robot pittore sta imparando ad esprimere la sua creatività attraverso l\'uso di pennelli e colori, creando opere d\'arte uniche e originali."'

## Perplexity Extraction

We'll se now how to extract perplexity of a model generation. Perplexity is defined as:
$$Perplexity = exp(-\frac{1}{N}\sum^N_{i=1}P(x_i|x_{<i}))$$

Where $N$ is the number of tokens, $x_i$ is the token in position $i$. In general we assume that a **lower** perplexity means a better prediction. It quantifies the model uncertainty, mathematically it is the exponentiated average negative log-likelihood of the predicted tokens. 
But what does this mean? We take the logarithm of the predicted probabilities (which makes sums possible), compute their average over the sequence, and then apply the exponential to return to a probability-like scale. This yields a number that can be interpreted as the **effective number of equally likely choices** the model had per token.

So, for example, a perplexity of 10 means that—on average—the model is as uncertain as if it had to choose uniformly among 10 equally likely tokens at each step. Meaning that, from the model perspective, at each generation step it could chose at random between 10 tokens of its vocabulary. 

While useful, perplexity has limitations: it only measures *token-level probability accuracy*, not semantic coherence, relevance, or factual correctness. A model can have low perplexity while still generating irrelevant or nonsensical outputs. It also tends to penalize rare but valid completions, which can discourage creative or diverse generation in open-ended tasks.


In [27]:
messages = [
    {"role": "user", "content": "Continua la seguente frase: 'Il robot si guardò intorno e vide"}
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

In [38]:
import math

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False,
    return_dict_in_generate=True, # to return scores
    output_scores=True
)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [39]:
decoded_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
generated_ids = outputs.sequences[0][inputs["input_ids"].shape[1]:]  # generated tokens only
scores = outputs.scores  # list of logits (one tensor per step)

# Compute log probs for each generated token
log_probs = []
for i, score in enumerate(scores):
    logits = score[0]                     # shape: [vocab_size]
    probs = torch.nn.functional.log_softmax(logits, dim=-1)
    token_id = generated_ids[i]
    log_probs.append(probs[token_id].item())

# Average NLL and perplexity
nll = -sum(log_probs) / len(log_probs)
perplexity = math.exp(nll)
print(f"Decoded text: {decoded_text} has a Perplexity of {perplexity:.2f}")

Decoded text:  user 

Continua la seguente frase: 'Il robot si guardò intorno e vide assistant 

 che c'era un altro robot vicino a lui.' has a Perplexity of 2.20


In [32]:
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=2.5,
    return_dict_in_generate=True, # to return scores
    output_scores=True
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [37]:
decoded_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
generated_ids = outputs.sequences[0][inputs["input_ids"].shape[1]:]  # generated tokens only
scores = outputs.scores  # list of logits (one tensor per step)

# Compute log probs for each generated token
log_probs = []
for i, score in enumerate(scores):
    logits = score[0]                     # shape: [vocab_size]
    probs = torch.nn.functional.log_softmax(logits, dim=-1)
    token_id = generated_ids[i]
    log_probs.append(probs[token_id].item())

# Average NLL and perplexity
nll = -sum(log_probs) / len(log_probs)
perplexity = math.exp(nll)
print(f"Decoded text: {decoded_text} has a Perplexity of {perplexity:.2f}")

Decoded text:  user 

Continua la seguente frase: 'Il robot si guardò intorno e vide assistant 

 nessun uomo presente.' Ciò implica l'inanimata non presenza ma la condizione potrebbe essere temporaneamente sostituita inserendo nella lista un attore chiave se richiesto e gradito dalle diverse condizioni operative nel quadro contestativo previsto nei dettagli dal compito. Naturalmente i parametri vanno discre has a Perplexity of 41.55
