# 2 Hello, Llama: generation parameters  

This example illustrates how to effectively adjust the generation parameters for the Llama model. These parameters control the sampling behavior and help in generating more diverse or focused outputs.

#### Import required libraries

In [1]:
import transformers
import torch
import os
import time
import json

#### Choose the model you want to use

The model could be downloaded from HuggingFace for example here --> https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct. You can clone the repo locally after creating an account on Huggingface and accepting the meta policies.  

_Note: you can configure transformer library to download it without cloning repo manually._

In [8]:
# change the following folder to point the path where you have stored the model you want to use
base_folder = "FILL_WITH_BASE_FOLDER" # Example: "C:/Users/username/Documents/HuggingFace"

model_name = "Llama-3.2-3B-Instruct"

# set the model id
model_id = os.path.join(base_folder, model_name)

#### Build transformer pipeline mapping to cpu device

In [3]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.float32},
    device_map="cpu",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Define message to be processed by Llama  

__System prompt__ is a pre-defined instruction set generally by a backend that determines how the LLM (Large Language Model) behaves.  
__User prompt__ This is the input provided by the user that interact with the LLM, often representing a query or command that the model needs to respond to.

In [4]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Say hello to the world"},
]

#### How to tune Processing  

- The `max_new_tokens` parameter defines the maximum number of tokens the model is allowed to generate in its response.
- The `temperature` _(Range 0 to 1.5)_ parameter controls the randomness of the model's output. A lower temperature (closer to 0) produces more deterministic responses, while a higher temperature results in more creative and random outputs. Setting temperature to 0 means the response will always be the same for identical input, making the model fully deterministic.
- The `top_k` _(Range 1 to ∞)_ parameter limits the selection of possible next tokens to the top k choices based on their probabilities. A higher value provides the model with more options, leading to more diverse responses. If k is set to 1, the model defaults to selecting the most probable token, resulting in less variability in the generated output.
- The `top_p`  _(Range 0 to 1)_ the probability you want to use before discarding the one with high value. By setting top_p to a lower value (e.g., 0.9), the model will focus on the most probable tokens, while a value of 1.0 allows all tokens to be considered, resulting in more randomness in the output. If value is set to 0 is the same as setting 
- The `repetition_penalty` penalizes repetition, higher values discourage repetitive sequences. A value greater than 1.0 (e.g., 1.2) encourages the model to avoid repeating the same phrases or tokens.

####  Processing with deterministic tuning  

Run the following code more than once to check how the model generates always the same response

In [5]:
# Start the timer
start_time = time.time()


# Run the pipeline 
# Generate a response with the pipeline, including custom generation parameters
outputs = pipeline(
    messages,
    max_new_tokens=256,
    temperature=0.01,
    top_k=1,
    top_p=0,
    repetition_penalty=1.0
)

# End the timer
end_time = time.time()

# Calculate processing time
processing_time = end_time - start_time

# Print the processing time
print(f"Processing completed.\nInference time: {processing_time:.2f} seconds.\n")

# print only generated text

output_text_deterministic = (outputs[0]["generated_text"][-1]['content']).strip()

print(f"The hello from the Pirate Chatbot:\n--------------------------------------\n{output_text_deterministic}\n--------------------------------------")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Processing completed.
Inference time: 77.86 seconds.

The hello from the Pirate Chatbot:
--------------------------------------
Yer lookin' fer a hearty "hello" to the world, eh? Alright then, matey! *pounds chest with fist* HEY THERE, MATEYS! Arrrr, 'tis a grand day to be sailin' the seven seas, don't ye think? So hoist the colors, me hearties, and let's set sail fer a swashbucklin' adventure!
--------------------------------------


####  Processing with more creative tuning 

Run the following code more than once to check how the model generates (usually) different responses

In [6]:
# Start the timer
start_time = time.time()


# Run the pipeline 
# Generate a response with the pipeline, including custom generation parameters
outputs = pipeline(
    messages,
    max_new_tokens=256,
    temperature=0.9,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.0
)

# End the timer
end_time = time.time()

# Calculate processing time
processing_time = end_time - start_time

# Print the processing time
print(f"Processing completed.\nInference time: {processing_time:.2f} seconds.\n")

# print only generated text

output_text_creative = (outputs[0]["generated_text"][-1]['content']).strip()

print(f"The hello from the Pirate Chatbot:\n--------------------------------------\n{output_text_creative}\n--------------------------------------")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Processing completed.
Inference time: 28.22 seconds.

The hello from the Pirate Chatbot:
--------------------------------------
Yer lookin' fer a hearty "hello", eh? Alright then, matey! Yer gettin' a swashbucklin' "HELLO, WORLD!" from yer scurvy dog o' a chatbot! Arrr!
--------------------------------------


#### Check the difference

In [7]:
print(f"Text generated with deterministic tuning:\n{output_text_deterministic}")
print("\n-----------------------------------------------------------\n")
print(f"Text generated with more creative tuning: \n{output_text_creative}")

Text generated with deterministic tuning:
Yer lookin' fer a hearty "hello" to the world, eh? Alright then, matey! *pounds chest with fist* HEY THERE, MATEYS! Arrrr, 'tis a grand day to be sailin' the seven seas, don't ye think? So hoist the colors, me hearties, and let's set sail fer a swashbucklin' adventure!

-----------------------------------------------------------

Text generated with more creative tuning: 
Yer lookin' fer a hearty "hello", eh? Alright then, matey! Yer gettin' a swashbucklin' "HELLO, WORLD!" from yer scurvy dog o' a chatbot! Arrr!
