<a href="https://colab.research.google.com/github/mbb15761/SEIS767/blob/main/Fall2025_SEIS767_hw06_YANNI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 6 Prompt Engineering


## Using Text Generation Models


### choosing a Text Generation Model
foundation models:

LLama(7B/13B/33B/70B)-StableLM(3B/7B)-Falcon(7B/40B/180B)-LLama2(7B/40B/180B)-Mistral(7B)

### loading a Text Generation Model


In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [2]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
    use_cache=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Device set to use cuda
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [3]:
messages =[
    {
        "role":"user","content":"Create a funny joke about chickens."
    }
]

output = pipe(messages)
print(output[0]["generated_text"])



 Why did the chicken join the band? Because it had the drumsticks!


In [4]:
print(output)

[{'generated_text': ' Why did the chicken join the band? Because it had the drumsticks!'}]


In [5]:
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|user|>
Create a funny joke about chickens.<|end|>
<|endoftext|>


### Controlling Model Output
* temperature: control randomness of the output
* top_p: nucleus sampling, controls which subset of tokens the LLM can consider; top_p=1: consider all tokens
* top_k: controls how many tokens the LLM can consider

In [6]:
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

 Why do chickens never stop talking? Because they have nothing better to do!


I get different jokes when I rerun this above code several times

In [7]:
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 Why did the chicken stop in the road? Because it wanted to take a break and do some "pecking"!


## Intro to Prompt Engineering
### In-context learning: providing examples


In [8]:
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|user|>
A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:<|end|>
<|assistant|>
I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.<|end|>
<|user|>
To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:<|end|>
<|endoftext|>


In [9]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 During the medieval reenactment, the knight skillfully screeged the wooden target, impressing the onlookers with his prowess.


### Chain Prompting: breaking up the problem

In [10]:
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

 Name: ChatSage
Slogan: "Your AI Companion for Smart Conversations"


In [11]:
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]
outputs = pipe(sales_prompt)
sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

 Introducing ChatSage, your AI Companion for Smart Conversations. With ChatSage, you'll have a personalized and intelligent assistant at your fingertips, ready to engage in meaningful dialogue, provide helpful information, and enhance your daily interactions. Experience the future of communication with ChatSage – your smart conversation partner.


## Reasoning with Generative Models

### Chain-of-Thought: Think Before Answering

In [12]:
 # Answering without explicit reasoning
 standard_prompt = [
     {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
     {"role": "assistant", "content": "11"},
     {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
 ]

 # Run generative model
 outputs = pipe(standard_prompt)
 print(outputs[0]["generated_text"])

 The cafeteria started with 23 apples. They used 20 apples to make lunch, so they had 23 - 20 = 3 apples left. After buying 6 more apples, they now have 3 + 6 = 9 apples.


In [13]:
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

 Step 1: Start with the initial number of apples in the cafeteria, which is 23.

Step 2: Subtract the number of apples used to make lunch, which is 20.
23 - 20 = 3 apples remaining.

Step 3: Add the number of apples bought, which is 6.
3 + 6 = 9 apples.

So, the cafeteria now has 9 apples.


### Tree-of-Thought: Exploring Intermediate Steps

* consider multiple paths, creative writing
* mimic calling generative model multiple times by emulatiing a convesation between multiple experts.
* experts question each other til reach a consensus


In [14]:
# Zero-shot Chain-of-Thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

In [15]:
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

 Expert 1:
Step 1: Start with the initial number of apples, which is 23.

Expert 2:
Step 1: Subtract the number of apples used for lunch, which is 20. This leaves us with 3 apples.
Step 2: Add the number of apples bought, which is 6. This results in a total of 9 apples.

Expert 3:
Step 1: Begin with the initial number of apples, which is 23.
Step 2: Subtract the number of apples used for lunch, which is 20. This leaves us with 3 apples.
Step 3: Add the number of apples bought, which is 6. This results in a total of 9 apples.

Discussion:
All three experts arrived at the same answer, which is 9 apples. This indicates that their calculations were correct. The cafeteria started with 23 apples, used 20 for lunch, and then bought 6 more, resulting in a total of 9 apples.


**Expert1 only have step1 and did not reach to 9 apples, expert2 and 3 have same correct answers**

## Output verification
* Why? structured output, valid output, ethics, accuracy
* 3 ways to control the output
* examples
* grammar
* fine tunning model

In [16]:
zeroshot_prompt =[
    {
        "role":"user","content": "Create a character profile for an RPG game in JSON format."
    }
]
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

 ```json
{
  "name": "Aria Stormbringer",
  "class": "Warrior",
  "race": "Human",
  "level": 10,
  "attributes": {
    "strength": 18,
    "dexterity": 12,
    "constitution": 16,
    "intelligence": 8,
    "wisdom": 10,
    "charisma": 14
  },
  "skills": {
    "melee": 15,
    "ranged": 10,
    "magic": 5,
    "stealth": 12,
    "acrobatics": 10,
    "survival": 14
  },
  "equipment": {
    "weapon": "Two-handed Axe",
    "armor": "Chainmail Hauberk",
    "shield": "Warhammer",
    "accessories": [
      "Ring of Protection",
      "Warrior's Bracer"
    ]
  },
  "background": "Aria was born into a noble family, but her life took a turn when her father was killed in battle. She trained as a warrior to avenge his death and protect her homeland from invaders."
}
```


In [17]:
one_shot_template = """Create a short character profile for an RPG game. Make
sure to only use this format:
{
 "description": "A SHORT DESCRIPTION",
 "name": "THE CHARACTER'S NAME",
 "armor": "ONE PIECE OF ARMOR",
 "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
 {"role": "user", "content": one_shot_template}
]
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 {
 "description": "A cunning rogue with a mysterious past, skilled in stealth and deception.",
 "name": "Shadowblade",
 "armor": "Leather Vest",
 "weapon": "Dagger"
}


### Grammar: Constrained Sampling
* problem of giving example: Models don't follow the given example
*

In [18]:
import gc
import torch
del model, tokenizer, pipe
# Flush memory
gc.collect()
torch.cuda.empty_cache()

In [19]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.16.tar.gz (50.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.16-cp312-cp312-linux_x86_64.whl size=4503260 sha256=e854b6714c7deb80c1

In [20]:
from llama_cpp.llama import Llama
# Load Phi-3
llm = Llama.from_pretrained(
 repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
 filename="*fp16.gguf",
 n_gpu_layers=-1,
 n_ctx=4096,
 verbose=False
)

./Phi-3-mini-4k-instruct-fp16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

In [21]:
output = llm.create_chat_completion(
 messages=[
 {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
 ],
 response_format={"type": "json_object"},
 temperature=0,
)['choices'][0]['message']["content"]

In [22]:
import json
# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "warrior": {
        "name": "Eldric Stormbringer",
        "class": "Warrior",
        "level": 5,
        "attributes": {
            "strength": 18,
            "dexterity": 10,
            "constitution": 16,
            "intelligence": 8,
            "wisdom": 10,
            "charisma": 12
        },
        "skills": [
            {
                "name": "Martial Arts",
                "proficiency": 20,
                "description": "Expert in hand-to-hand combat and weapon handling."
            },
            {
                "name": "Shield Block",
                "proficiency": 18,
                "description": "Highly skilled at deflecting attacks with a shield."
            },
            {
                "name": "Heavy Armor",
                "proficiency": 16,
                "description": "Expertly equipped with heavy armor for protection."
            },
            {
                "name": "Survival",
                "proficiency": 14,
                "

## Summary
* Using generative models through prompt engineering and output verification.
* in-context learning and chain-of-thoughts
* prompt engineering is crucial