# Week_7_Lesson_Notebook_Simple_Prompt_Examples

**Description:** There are a variety of ways of accessing Gen AI models.  In this note book we will show you three very simple and free ones.<br>

Section 0 is environment set up.

Section 1 is about using the HuggingFace Transformer libraries and models in the repository.  Here we are using a new model, called [Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/), just deployed as open source by Meta that is very performant despite it's small size.  This model is part of a current trend to improve efficiency and to squeeze more and more performance out of smaller and smaller models.  The regular 8B parameter model needs an A100 to run and uses up most of its GPU memory.  Fortunetly we can leverage Hugging Face's quantization libraries and run the model on a T4 with some room to spare.

Section 2 uses another open sourced model called [Gemma](https://ai.google.dev/gemma) from Google.  These are lightweight models trained in the same way as Gemini but with a much smaller number of parameters.  

Section 3 is about using another year old model, called [Mistral](https://mistral.ai/), deployed as open source by a French start up in December  that is very performant despite it's small size.  This model is part of a current trend to improve efficiency and to squeeze more and more performance out of smaller and smaller models.  The regular 7B parameter model needs an A100 to run and uses up most of its GPU memory.  Fortunetly we can leverage Hugging Face's quantization libraries and run the model on a T4 with some room to spare.

Section 4 uses another open sourced model built for reasonong called [Qwen3](https://huggingface.co/Qwen/Qwen3-8B) from Alibaba.  These are dense models with a much smaller number of parameters that are distilled from a larger mixture of experts (MoE) model.  We can run these models on a T4 GPU in free Colab.

Section 5 uses another commercial service called [Cohere](https://dashboard.cohere.com/).  They offer a free playground and an API service to non-commercial users.  You will need to sign up and add your key as a secret in Colab in order to be able to use it.  You will need it for assignment 5.  We'll use their API to call their endpoints which include completion and embeddings.

Section 6 is about accessing [ChatGPT via a web interface](https://chat.openai.com/).  This allows you to experiment and to copy and paste but it does not allow you to access the model programiatically.  We'll talk about that later.

<a id = 'returnToTop'></a>

## Notebook Contents
  * 0. [Setup](#setup)
  * 1. [Llama3.1-8B-instruct](#llama3.18Binst)
  * 2. [Gemma 2](#gemma)
  * 3. [Mistral](#mistral7b-ift)
  * 4. [Qwen 3](#qwen3")
  * 5. [Cohere](#cohere)
  * 6. [ChatGPT](#chatgpt)
  
**To run this notebook** you should copy it to your Berkeley Google Drive or your personal Colab Plus Google account by uploading it into that Google Drive. From there you can open it as a Colab notebook and run it.  Note we were able to run it in the Free version of Colab but the RAM memory was maxed out.  It needs a T4 GPU to run Sections 1, 2, 3, and 4, but sections 5 and 6 access web services and therefore require way fewer resources and no GPU.

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

# Setup

In [1]:
#let's make longer model output readable without horizontal scrolling
from pprint import pprint

In [2]:
%%capture
!pip install -q -U transformers


In [3]:
!pip install -q -U accelerate
!pip install -q -U bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

This is the bits and bytes config file where we specify our quantization arguments.  You can read about it [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [5]:
from transformers import BitsAndBytesConfig


quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)


[Return to Top](#returnToTop)  
<a id = '#llama3.18Binst'></a>
## LLaMa 3.1

LLaMa 3.1 is a recent open weights model from Meta. We've used it extensively in this class. Check out [the model card](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) for further details. It is open-sourced.  To use it, you need to log in to your Hugging Face account and get permission.  We're using the 8 billion parameter version but quantized so it has a much smaller memory footprint.  We talked about quantization in week 5.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
myprompt = (
    "write a three sentence description for the following product: Cuisinart Airfryer, 6-Qt Basket Air Fryer Oven that Roasts, "
    "Bakes, Broils & Air Frys Quick & Easy Meals - Digital Display with 5 Presets, Non Stick & Dishwasher Safe, AIR-200"
    )

In [None]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
	model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a copy editor who writes very successful ad copy among other things!"},
    {"role": "user", "content": myprompt},
]

prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(**input_ids,
                                  max_new_tokens=256,
                                  eos_token_id=terminators,
                                  do_sample=True,
                                  temperature=0.6,
                                  top_p=0.9,)

pprint(tokenizer.decode(output[0], skip_special_tokens=True), compact=True)


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


('system\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 'You are a copy editor who writes very successful ad copy among other '
 'things!user\n'
 '\n'
 'write a three sentence description for the following product: Cuisinart '
 'Airfryer, 6-Qt Basket Air Fryer Oven that Roasts, Bakes, Broils & Air Frys '
 'Quick & Easy Meals - Digital Display with 5 Presets, Non Stick & Dishwasher '
 'Safe, AIR-200assistant\n'
 '\n'
 "Here's a three-sentence description for the Cuisinart Airfryer:\n"
 '\n'
 '"Transform your kitchen with the Cuisinart Airfryer, a 6-quart powerhouse '
 'that effortlessly cooks a variety of dishes with ease. This versatile '
 'appliance roasts, bakes, broils, and air fries your favorite meals with '
 'precision, thanks to its 5 preset settings and digital display. With its '
 'non-stick basket and dishwasher-safe design, cleanup is a breeze - making it '
 'the perfect addition to any home chef\'s arsenal."')


In [None]:
print(model.quantization_method)

QuantizationMethod.BITS_AND_BYTES


What if we ask the model to write a review of a product based solely on naming the product but without other information.  Let's also make the voice a variable.

In [None]:
#Try some different tasks/prompts
myvoice = "millenial parent"
myprompt = f"Write a very positive three sentence review in the voice of a {myvoice} for a Cuisinart 6-Qt Airfryer"

messages = [
        {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
#model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'Write a very positive three sentence review in the voice of a millenial '
 'parent for a Cuisinart 6-Qt '
 'Airfryer<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
 '\n'
 '"Oh my gosh, I am OBSESSED with our Cuisinart 6-Qt Airfryer - it\'s '
 "literally been a game-changer for our family's health and cooking routine, "
 "and the kids are loving the 'fries' they can make with it without all the "
 'extra grease! The cleanup is also a breeze, which is a total win in my book. '
 "I've already ordered a few extra accessories to take our airfrying to the "
 "next level - this thing is a total investment, but it's worth every "
 'penny!"<|eot_id|>')


Now we'll ask the model to generate a review of the product using the product description we just generated.  We'll also ask that the model generate the review in the voice of a baby boomer.

In [None]:
myprompt = (
    "Write a very positive three sentence review in the voice of a boomer based on the following product description: "
    "The Cuisinart Airfryer, model number AIR-200, is a 6-Qt capacity basket air fryer oven that allows for roasting, "
    "baking, broiling, and air frying to prepare quick and easy meals. Its 'digital display includes 5 preset functions, "
    "making cooking convenient and hassle-free. Featuring a non-stick interior and dishwasher safe parts, it ensures effortless "
    "cleanup."
)

messages = [
        {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
#model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'Write a very positive three sentence review in the voice of a boomer based '
 'on the following product description: The Cuisinart Airfryer, model number '
 'AIR-200, is a 6-Qt capacity basket air fryer oven that allows for roasting, '
 'baking, broiling, and air frying to prepare quick and easy meals. Its '
 "'digital display includes 5 preset functions, making cooking convenient and "
 'hassle-free. Featuring a non-stick interior and dishwasher safe parts, it '
 'ensures effortless '
 'cleanup.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
 '\n'
 '"Oh my stars, this Cuisinart Airfryer is just the bee\'s knees! It\'s like '
 'having my own personal chef in the kitchen, with all these fancy features '
 'like digital displays and preset functions that make cooking a

Now we'll ask the model to rewrite its previous review in the voice of a Gen Z gamer.  We can use this kind of functionality to generate synthetic data that can augment our data set if we have small amounts of data or if we have a class imbalance.

In [None]:
myprompt = (
    "Write a very positive three sentence review in the voice of a Gen Z gamer based on the following product description: "
    "The Cuisinart Airfryer, model number AIR-200, is a 6-Qt capacity basket air fryer oven that allows for roasting, "
    "baking, broiling, and air frying to prepare quick and easy meals. Its 'digital display includes 5 preset functions, "
    "making cooking convenient and hassle-free. Featuring a non-stick interior and dishwasher safe parts, it ensures effortless "
    "cleanup."
)

messages = [
        {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
#model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'Write a very positive three sentence review in the voice of a Gen Z gamer '
 'based on the following product description: The Cuisinart Airfryer, model '
 'number AIR-200, is a 6-Qt capacity basket air fryer oven that allows for '
 'roasting, baking, broiling, and air frying to prepare quick and easy meals. '
 "Its 'digital display includes 5 preset functions, making cooking convenient "
 'and hassle-free. Featuring a non-stick interior and dishwasher safe parts, '
 'it ensures effortless '
 'cleanup.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
 '\n'
 'OMG, the Cuisinart Airfryer AIR-200 is literally a GAME CHANGER in the '
 'kitchen! With its 6-Qt capacity, I can whip up a ton of delicious meals for '
 "my squad, from crispy fries to juicy chicken, and it's so e

In [None]:
#Simple cell to experiment
#Try some different tasks/prompts
myprompt = (
    "something, anything?",
    ""
)

messages = [
        {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
#model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'Your idea here.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
 '\n'
 "Let's create a story together. I'll start by setting the scene, and then you "
 'can add to it, and so on.\n'
 '\n'
 '**The Mysterious Island**\n'
 '\n'
 'You are a young adventurer who has always been drawn to the sea. One day, '
 'while sailing on a small boat, you stumble upon a mysterious island that '
 'appears out of nowhere. The island is shrouded in mist, and the air is thick '
 'with an otherworldly energy.\n'
 '\n'
 'As you approach the island, you feel a strange sensation, like the island is '
 'calling to you. Your boat is drawn towards the shore, as if by an unseen '
 'force.\n'
 '\n'
 'You step off the boat and onto the sandy beach, feeling the warm sun on your '
 'skin. The air is 

Because of the extensive pre-training of [the model](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), it has many different capabilities.  Let's see if it can translate Chinese into English.

In [None]:
myprompt = (
    "Translate the following Chinese into English: boys, 我想告诉你，这个 "
 "CuisinartAirfryer，模型号是AIR-200，真是一件事! 六quirelit的容量,你问？绝对足够丰盛家人晚餐。接下来，烤、烘、�� "
 "broil和空炐烤全在一个机器上，就像拥有整个厨房一样。但是等下， Digital display你知道什么？五种预设功能？就像拥有personal "
 "chef一样，使cooking轻松无难度。最後，非沾锅内部和洗装盘，清理真是可以比作�arts simple as "
 "pie。简直神赞，这个Airfryer真是棒棒哇!"

)

messages = [
        {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
#model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 26 Jul 2024\n'
 '\n'
 '<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'Translate the following Chinese into English: boys, 我想告诉你，这个 '
 'CuisinartAirfryer，模型号是AIR-200，真是一件事! 六quirelit的容量,你问？绝对足够丰盛家人晚餐。接下来，烤、烘、�� '
 'broil和空炐烤全在一个机器上，就像拥有整个厨房一样。但是等下， Digital display你知道什么？五种预设功能？就像拥有personal '
 'chef一样，使cooking轻松无难度。最後，非沾锅内部和洗装盘，清理真是可以比作�arts simple as '
 'pie。简直神赞，这个Airfryer真是棒棒哇!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
 '\n'
 'Here is the translation:\n'
 '\n'
 '"Boys, I want to tell you, this Cuisinart Airfryer, model number is AIR-200, '
 'is truly amazing! Six quarts of capacity, you ask? Absolutely enough for a '
 'plentiful family dinner. Next, baking, roasting, broiling, and air frying '
 'all in one machine, just like having a whole kitchen. But wait, the digital '
 'display, you know what? Five pre-set functions? Just lik

Let's run some of the same prompts that we ran above to see how well this model performs.  Note that it takes a lot longer to generate answers because this model has 8 billion rather than 3 billion parameters.  The next cell takes about 2 minutes to complete.

How well do the outputs from Llama3.1 compare with the outputs from Gemma 2?  How can we measure their performance? How can we compare the two models quantitatively?

[Return to Top](#returnToTop)  
<a id = 'gemma'></a>

## Gemma 2


[Gemma 2](https://huggingface.co/google/gemma-2-9b-it) is a model produced by Google.  We'll use the 9B model that's been instuction fine-tuned.  We'll use quantization to be able to run it in free Colab and its base GPU.

Why are we looking at a second model? Because the behavior of these models is idiosyncratic and it is important to try multiple models to identify these differences.

In [None]:
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "google/gemma-2-9b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto")

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
pprint(tokenizer.decode(outputs[0]), compact=True)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

('<bos>Write me a poem about Machine Learning.\n'
 '\n'
 'A mind of silicon, a heart of code,\n'
 'Learning patterns, a path to be trod')


That's a really short poem.  Too short and ending on a comma is suspicious.  Why did this happen?

Hint: what are the defaults?

In [None]:
prompt_text = "Write me a very postive review of a Cuisinart 6 Qt Air Fryer"
input_ids = tokenizer(prompt_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=512)
pprint(tokenizer.decode(outputs[0]), compact=True)

('<bos>Write me a very postive review of a Cuisinart 6 Qt Air Fryer Toaster '
 'Oven Combo\n'
 '\n'
 "This is a kitchen game-changer! I've had the Cuisinart 6 Qt Air Fryer "
 "Toaster Oven Combo for a few weeks now, and I can honestly say it's the best "
 "appliance purchase I've made in years. \n"
 '\n'
 'First off, the versatility is incredible. I use it for everything from '
 'perfectly crispy air-fried chicken to golden-brown toast, and even baking '
 'cookies. The six-quart capacity is generous, allowing me to cook for a '
 'family of four without any issues. \n'
 '\n'
 'The air fryer function is truly exceptional. Food cooks quickly and evenly, '
 'with a delicious crispy texture that rivals deep-frying. The toaster oven '
 'function is equally impressive, with precise temperature control and a '
 'spacious interior. \n'
 '\n'
 "I love the intuitive controls and the clear digital display. It's so easy to "
 "use, even for someone who isn't a cooking expert. The pre-programmed "
 

In [None]:
prompt_text = "Write a very positive three sentence review in the voice of a boomer based on the following product description: The Cuisinart Airfryer, model number AIR-200, is a 6-Qt capacity basket air fryer oven that allows for roasting, baking, broiling, and air frying to prepare quick and easy meals. Its 'digital display includes 5 preset functions, making cooking convenient and hassle-free. Featuring a non-stick interior and dishwasher safe parts, it ensures effortless cleanup."

input_ids = tokenizer(prompt_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=512)
pprint(tokenizer.decode(outputs[0]), compact=True)

('<bos>Write a very positive three sentence review in the voice of a boomer '
 'based on the following product description: The Cuisinart Airfryer, model '
 'number AIR-200, is a 6-Qt capacity basket air fryer oven that allows for '
 'roasting, baking, broiling, and air frying to prepare quick and easy meals. '
 "Its 'digital display includes 5 preset functions, making cooking convenient "
 'and hassle-free. Featuring a non-stick interior and dishwasher safe parts, '
 'it ensures effortless cleanup.\n'
 '\n'
 "This Cuisinart Air Fryer is a real game changer!  It's so easy to use, even "
 'my grandkids could figure it out, and the food comes out crispy and '
 'delicious every time.  I love that it can do so much more than just fry, '
 "like roast and bake, so I'm using it all the time now.\n"
 '\n'
 '\n'
 '<end_of_turn><eos>')


In [None]:
prompt_text = "Q: A juggler can juggle 16 balls.  Half of the balls are golf balls and half of the golf balls are blue.  How many blue golf balls are there? A: Let's think step by step. "
input_ids = tokenizer(prompt_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=200)
pprint(tokenizer.decode(outputs[0]), compact=True)

('<bos>Q: A juggler can juggle 16 balls.  Half of the balls are golf balls and '
 'half of the golf balls are blue.  How many blue golf balls are there? A: '
 "Let's think step by step. 1. Find the number of golf balls: 16 balls / 2 = 8 "
 'golf balls 2. Find the number of blue golf balls: 8 golf balls / 2 = 4 blue '
 'golf balls Answer: There are **4** blue golf balls.<end_of_turn>\n'
 '<eos>')


[Return to Top](#returnToTop)  
<a id = 'mistral-ift'></a>

## Mistral 7B -

[Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) is a small but highly performant model. It is also possible to use it commercially. The model has been instruction fine-tuned by Mistral.ai so it should be able to follow ou prompts and return good on point output.  We'll also use a quantixzed version (down to 4 bits) so we know it can load in our small GPU.  

If you've already run the Llama and Gemma models, you need to shutdown and disconnect your session.  Then reconnect, run setup at the top of the notebook to load the libaries necessary for it to work.  Then return here and you can try the Mistral model.

This model has been trained to work with dialog, meaning instances where we have multiple utterance and response pairs to create the context so the model can reply. For these examples we'll populate the context with only our prompt and not have any back and forth.

First we'll ask the model to generate a product desciption based on the content of the product title.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
myprompt = (
    "write a three sentence description for the following product: Cuisinart Airfryer, 6-Qt Basket Air Fryer Oven that Roasts, "
    "Bakes, Broils & Air Frys Quick & Easy Meals - Digital Display with 5 Presets, Non Stick & Dishwasher Safe, AIR-200"
    )

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
#Note: It can take up to 3 minutes to download this model

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)


generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


('<s>[INST] write a three sentence description for the following product: '
 'Cuisinart Airfryer, 6-Qt Basket Air Fryer Oven that Roasts, Bakes, Broils & '
 'Air Frys Quick & Easy Meals - Digital Display with 5 Presets, Non Stick & '
 'Dishwasher Safe, AIR-200[/INST] The Cuisinart Airfryer, model AIR-200, is a '
 'versatile 6-Quart Air Fryer Oven that offers innovative cooking options such '
 'as roasting, baking, broiling, and air frying. Equipped with a digital '
 'display and 5 convenient presets, this device streamlines meal preparation '
 'for a quick and easy cooking experience. Additionally, it features a '
 'non-stick and dishwasher-safe basket for effortless clean-up after every '
 'use.</s>')


In [None]:
print(model.quantization_method)

QuantizationMethod.BITS_AND_BYTES


What if we ask the model to write a review of a product based solely on naming the product but without other information.  Let's also make the voice a variable.

In [None]:
#Try some different tasks/prompts
myvoice = "millenial parent"
myprompt = f"Write a very positive three sentence review in the voice of a {myvoice} for a Cuisinart 6-Qt Airfryer"

messages = [
        {"role": "user", "content": myprompt}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
#model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
pprint(decoded[0], compact=True)

[Return to Top](#returnToTop)  
<a id = '#qwen3'></a>
## Qwen 3

Qwen 3 is the latest open source model from Alibaba's Qwen group. We'll used it extensively in this class. Check out [the model card](https://huggingface.co/Qwen/Qwen3-8B) for further details. It makes it easy to toggle between "thinking" and "non-thinking" modes.  This allows us to easily see if the added thinking helps or hurts the perfrmence of the model.  We're using the 8 billion parameter version but quantized so it has a much smaller memory footprint.  We talked about quantization in week 5.

This way loads the model using the AutoModel abstraction from Hugging Face.  You have access to several ways of interacting with a model.  One of those is through the AutoModel and the AutoTokenizer classes.  The other is via the Pipeline class.  Let's look very briefly at the AutoModel approach.

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B" #runs barely in T4 so you must quantize

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content - separate thoughts from final answer
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)


tokenizer_config.json:   0%|          | 0.00/9.73k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/32.9k [00:00<?, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.19G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

('thinking content:', '')
('content:',
 'Large language models (LLMs) are advanced artificial intelligence systems '
 'designed to understand and generate human-like text. These models are '
 'trained on vast amounts of text data from the internet, books, and other '
 'sources, allowing them to recognize patterns, grasp context, and produce '
 'coherent responses across a wide range of topics. LLMs can perform tasks '
 'such as answering questions, writing stories, coding, and even engaging in '
 'conversations, making them powerful tools in various fields like education, '
 'customer service, and content creation. Their ability to process and '
 'generate natural language has revolutionized the way we interact with '
 'technology.')


Let's run some of the same prompts that we ran above to see how well this model performs.  Note that it takes a lot longer to generate answers because this model "thinks" before it answers.  The next cell can take about 2 minutes to complete.

How well do the outputs from Qwen 3 compare with the outputs from the previous models?  How can we measure their performance? How can we compare the two models' outputs quantitatively?

In [7]:
# prepare the model input
messages = [
    {"role": "system", "content": "You are a science communicator who makes technology accessible to everyone!"},
    {"role": "user", "content": "Please write a five sentence explanation of how LLMs do knowledge representation."},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, the user wants a five-sentence explanation of how LLMs do knowledge '
 'representation. Let me start by recalling what knowledge representation '
 'means in the context of LLMs. They use embeddings to convert text into '
 'numerical vectors, right? So the first sentence should mention embeddings '
 'and how they capture semantic meaning.\n'
 '\n'
 'Next, I need to explain how these embeddings work. Maybe talk about training '
 "on large datasets to learn patterns. Then, mention that the model's layers "
 "process these embeddings through attention mechanisms. That's important "
 'because attention allows the model to focus on relevant parts of the input.\n'
 '\n'
 "Then, I should address how the model stores and retrieves knowledge. It's "
 'not like a traditional database, but through learned patterns. Maybe include '
 'that it can infer relationships and generate new information based on that. '
 "Finally, wrap it up by saying it's a dynamic

A second way of interacting with these models is to use the Pipeline class.

In [8]:
# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "How many r's in raspberry?"},
]

# uncomment the following to run with quantization ~ 7.0 of 15.0GB and 3 minutes to respond
pipe = pipeline("text-generation", model="Qwen/Qwen3-8B", model_kwargs={"torch_dtype": torch.bfloat16, "quantization_config": quantization_config, },
    device_map="auto",)

outputs = pipe(messages, max_new_tokens=2048,)


pprint(outputs[0]["generated_text"][-1], compact=True)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Device set to use cuda:0


{'content': '<think>\n'
            "Okay, the user is asking how many times the letter 'r' appears in "
            'the word "raspberry." Let me break this down step by step. First, '
            'I need to write out the word and check each letter one by one. '
            'The word is "raspberry." Let me spell it out: R-A-S-P-B-E-R-R-Y. '
            "Now, I'll go through each letter and count the 'r's.\n"
            '\n'
            "Starting with the first letter: R. That's one. Then the next "
            "letters are A, S, P, B, E. None of those are 'r'. The next letter "
            "is R again. That's the second 'r'. Then there's another R. So "
            "that's the third 'r'. Wait, let me make sure I'm not missing any. "
            'Let me count again. R (1), then A, S, P, B, E, R (2), R (3), Y. '
            "So that's three 'r's. Wait, but sometimes people might miscount. "
            'Let me double-check. The word is spelled R-A-S-P-B-E-R-R-Y. So '
            "posit

By default, Qwen 3 will think.  However you can turn it off by simply adding `/no_think` at the end of your prompt.

In [9]:
#let's try a second example
messages = [
    {"role": "user", "content": "What is the ratio of the circumfrence of the world to the circumference of the sun? /no_think"},
]

outputs = pipe(messages, max_new_tokens=1024)


pprint(outputs[0]["generated_text"][-1], compact=True)

{'content': '<think>\n'
            '\n'
            '</think>\n'
            '\n'
            'To find the **ratio of the circumference of the world to the '
            'circumference of the Sun**, we need to compare the '
            '**circumference of the Earth** (often referred to as the "world" '
            'in historical or poetic contexts) to the **circumference of the '
            'Sun**.\n'
            '\n'
            '---\n'
            '\n'
            '### Step 1: Find the circumference of the Earth\n'
            '\n'
            'The **Earth** is an oblate spheroid, but for simplicity, we can '
            'use the **equatorial circumference**:\n'
            '\n'
            '$$\n'
            'C_{\\text{Earth}} \\approx 40,075 \\text{ kilometers}\n'
            '$$\n'
            '\n'
            '---\n'
            '\n'
            '### Step 2: Find the circumference of the Sun\n'
            '\n'
            'The **Sun** is a star, and its radius is approximately

[Return to Top](#returnToTop)  
<a id = 'cohere'></a>

## Cohere


[Cohere](cohere.com) is a company with commercial LLMs and endpoints you can access via an API call. They let you try before you buy so they offer a free account to non-commercial entities which will give you an API key.  You should open a free account and get an API key.  

**DO NOT PUT THE API KEY DIRECTLY INTO YOUR NOTEBOOK**

Make sure you either use Colab secrets or a .env file so the key is hidden from view.

We'll use the Cohere embeddings endpoint in a future assignment so now is the time to get your account open.

In [None]:
#access the hidden API key
from google.colab import userdata
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

In [None]:
import pprint

In [None]:
!pip install -q cohere

This prompt includes a paragraph from [Wikipedia](https://en.wikipedia.org/wiki/Grammar) about grammar.  The paragraph contains a number of places and dates.  We can use the model to extract a number of the entities and return it to us in a JSON record.  It can be a powerful way of extracting value from noisy data.

In [None]:
myprompt = (
    "Please identify all of the places and dates in the following paragraph "
    "and return them in a JSON record with fields for date and loc: "
    "Belonging to the trivium of the seven liberal arts, grammar was taught "
    "as a core discipline throughout the Middle Ages, following the influence "
    "of authors from Late Antiquity, such as Priscian. Treatment of vernaculars "
    "began gradually during the High Middle Ages, with isolated works such as "
    "the First Grammatical Treatise, but became influential only in the Renaissance "
    "and Baroque periods. In 1486, Antonio de Nebrija published Las introduciones "
    "Latinas contrapuesto el romance al Latin, and the first Spanish grammar, "
    "Gramática de la lengua castellana, in 1492. During the 16th-century Italian "
    "Renaissance, the Questione della lingua was the discussion on the status and "
    "ideal form of the Italian language, initiated by Dante's de vulgari eloquentia "
    "(Pietro Bembo, Prose della volgar lingua Venice 1525). The first grammar of "
    "Slovene was written in 1583 by Adam Bohorič, and Grammatica Germanicae Linguae, "
    "the first grammar of German, was published in 1578."
)


In [None]:
import cohere
co = cohere.Client(COHERE_API_KEY) # This is your trial API key
response = co.generate(
  prompt=myprompt,
  model='command',
  max_tokens=300,
  temperature=0.5,
  k=0,
  p=0.96,
  stop_sequences=[],
  return_likelihoods='NONE')
pprint.pprint('Prediction: {}'.format(response.generations[0].text), compact=True)

[Return to Top](#returnToTop)  
<a id = 'chatgpt'></a>
## 3. ChatGPT

You can access a [free version of ChatGPT here](https://chat.openai.com/).  You will need to create an account (unless you already have one) using your personal and **NOT your @berkeley.edu account**.  You want to use ChatGPT 4o-mini and not anything higher.  This model can only be accessed via a web browser.  Later we will create pay as you go accounts that will allow you call these models from a notebook using your unique key.  We'll do it in a way that minimizes your expenses.

Let's access ChatGPT and try a couple of super simple prompts to see how well it performs.  It can be a very handy tool. (If you prefer you can repeat the prompts we used with Mistral)

Prompt 1 - ```write a blurb for the following book: Lee McIntyre, “On Disinformation: How to Fight for Truth and Protect Democracy” (MIT 2023)```

Prompt 2 - ```write a three sentence review of the book described in the following blurb: <insert generated blurb>```

Prompt 3 - ```What is the sentiment expressed in this review:```

Prompt 4 - ```Write a negative and pedantic three sentence review of the book described in the following blurb:```
  

How did those perform?  

Is the performance between the three models equivalent or are there noticeable differences.