# lesson_1_Simple_Gen_AI_Examples

**Purpose:**

In the Friday lecture we described a number of attributes we expect an intelligent person to have.  We want our Gen AI models to display many attributes:

  * **Knowledgeable** - we want to see evidence of *breadth* of knowledge.  As a generalist, the model should be well-informed about a very wide variety of topics
  * **Erudite** - we also want to see evidence of *depth* of knowledge equivalent to advanced academic study of a subject
  * **Eloquent** - we want the model to be able to express ideas well to a variety of different audiences
  * **Rational** - we want the model to be able to reason. Given a problem or puzzle, we want the model to solve it correctly.
  * **Novel** - we want the model to emit unique content and not to simply copy and paste together old content
  * **Obedient** - we want the model to accurately interpret and follow our instructions as in "do what I mean."

We will try to ask some questions/give some instructions that require a model to display those attributes.  You should also run this notebook and test the model(s) yourself.



**Description:**

There are a variety of ways of accessing Gen AI models.  In this notebook we will show you two smaller and free ones.<br>

Section 1 is about accessing ChatGPT via a web interface.  This allows you to experiment and to copy and paste but it does not allow you to access the model programatically.  We'll talk about that later.

Section 2 is about using the HuggingFace Transformer libraries and models in the repository.  Here we are using a new model just deployed by Microsoft that is very performant despite it's small size.  This model is part of a current trend to improve efficiency and to squeeze more and more performance out of smaller and smaller models.  This Microsoft model will fit onto one GPU.  Other models may require multiple GPUs to load and run.

Section 3 demonstrates calling OpenAI's top model.  You'll need your own paid API key to run that code.

Section 4 runs the latest models from Microsoft Phi-4-mini and Alibaba's Qwen3.  We'll make extensive use of Hugging Face models in this class because we can run it in free Colab.

<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. Commercial Models
    * 1.1 [Open AI ChatGPT](#chatgpt)
    * 1.2 [Anthopic Claude](#claude)
    * 1.3 [Google Gemini](#gemini)
  * 2. [Hugging Face and Phi-4-mini](#huggingface)  
  * 3. [Open AI API Access](#openai)
  * 4. [Qwen 3](#qwen3)

  
**To run this notebook** you should copy it to your personal Colab Plus Google account by uploading it into your Google Drive. From there you can open it as a Colab notebook and run it.  It needs a GPU to run and it needs enough memory to load these 3.84 billion parameter models.

[Return to Top](#returnToTop)  
<a id = 'chatgpt'></a>
## 1.1 ChatGPT

You can access a [free version of ChatGPT here](https://chatgpt.com/).  You will need to create an account (unless you already have one) using your personal and **NOT your @berkeley.edu account**.  This model can only be accessed via a web browser.  Later we will create pay as you go accounts that will allow you call these models from a notebook using your unique key.  We'll do it in a way that minimizes your expenses.

Let's access ChatGPT and try a couple of super simple prompts to see how well it performs.  It can be a very handy tool.




### 1.1.1 Prompts

As noted above there are a number of attributes we want to see demonstrated from the models if we are to think of them as intelligent.  Here are seven prompts that capture some of those attributes.  Think about which attributes they capture and how well they capture them.  You'll be able to run this notebook and test your own prompts.

Prompt 1 - ```Please write a five sentence explanation of how LLMs do knowledge representation.```

Prompt 2 - ```Complete the following limerick that begins: There once was a man from Gibraltar ```

Prompt 3 - ```You are a world renowned baker with many awards and Michelin stars.  Give us your world famous recipe for chocolate chip cookies.```

Prompt 4 - ```Complete the following haiku about LLMs that begins: Modeling language ```  

Prompt 5 - ```Write a function in python to take an input string, break it down into words and return all of the word level tri-grams it contains.```

Prompt 6 - ```Miriam has seven books and two podcasts.  Marwan has five papers and a regular lecture series.  Who has published more?```

Prompt 7 - ```List the countries in Europe along with their capital cities.
```

How did those perform with ChatGPT?  These prompts are very simplistic.  Later in the semester we'll talk about a variety of ways of improving the output.  We'll also look at much more complex prompts.

[Return to Top](#returnToTop)  
<a id = 'claude'></a>
## 1.2 Anthropic Claude

You can access a [free version of Claude here](https://claude.ai/).  You will need to create an account (unless you already have one) using your personal and **NOT your @berkeley.edu account**.  You want to use Claude 3.7 Sonnet if possible.  This model can only be accessed via a web browser.  Later we will create pay as you go accounts that will allow you call these models from a notebook using your unique key.  We'll do it in a way that minimizes your expenses.

Let's access Claude and try a couple of super simple prompts to see how well it performs.  It can be a very handy tool.


[Return to Top](#returnToTop)  
<a id = 'gemini'></a>
## 1.3 Google Gemini

You can access a [free version of Gemini here](https://gemini.google.com/).  You will need to create an account (unless you already have one) using your personal and **NOT your @berkeley.edu account**.  You want to use Gemini 2.0 Flash.  This model can only be accessed via a web browser.  Later we will create pay as you go accounts that will allow you call these models from a notebook using your unique key.  We'll do it in a way that minimizes your expenses.

Let's access Gemini and try a couple of super simple prompts to see how well it performs.  It can be a very handy tool.


[Return to Top](#returnToTop)  
<a id = 'huggingface'></a>
## 2. Hugging Face

Hugging Face is a company that offers a library of "transformers" as well as pre-trained models geared for a variety of tasks.  We are going to explore several ways of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look at the HuggingFace library at the highest most abstract level -- the pipeline.

Note that HuggingFace supports PyTorch and an alternative called Tensorflow.  The default language for HuggingFace is PyTorch.  They port many of their models to Tensorflow.  When using Huggingface just pay attention to which version you're using.  When the model you're using is TensorFlow, the model name often begins with TF as in TFBert or TFDistilBert.  If it doesn't have a TF at the begining of the model name, it is using PyTorch which is the language we'll be using in this class.


The [HuggingFace web site](https://huggingface.co/transformers) offers an interesting set of resources.  Their [model documentation](https://huggingface.co/transformers/model_summary.html) provides an excellent explanation of transformers as well as the growing variety of models they offer (see the left hand navigation column).  In addition, their collection of [notebooks](https://huggingface.co/transformers/notebooks.html) is a valuable set of examples.  Their [blog](https://huggingface.co/blog) has some interesting and useful posts about how transformers of all varieties work. Their LLM leader board keeps track of the [performance of open source models.](https://huggingface.co/open-llm-leaderboard)  Finally, they offer [an excellent set of tutorials](https://huggingface.co/docs/transformers/index) as well as a Quick Start Guide with videos on the use of the full Hugging Face family of resources.

---

One word of caution:  this is a rapidly evolving resource and as a result you can often run in to bugs.  They will get fixed, eventually, but may be buggy for a while.  

In [1]:
!pip install -q -U transformers  #>=4.51.0
!pip install einops
!pip install -q -U accelerate  #>=1.5.0
!pip install -q -U bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m119.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m90.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We'll use a brand new model from Microsoft called PHI-4-mini.  It stands out because it is relatively small at 3.8 billion paramters but has a 128K context window.  Let's look at [the model card](https://huggingface.co/microsoft/Phi-4-mini-instruct) from Hugging Face to get more background on just what distinguishes it from others.  Note it is optimized for common sense, language understanding, math, code, long context and logical reasoning.

In [2]:
#In case we want to know our installed transformers library version
!pip list | grep transformers
!pip list | grep accelerate

sentence-transformers                 3.4.1
transformers                          4.51.3
accelerate                            1.6.0


Add the following to make sure we are using the GPU if it will fit in the GPU's memory.

In [3]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

This will allow us to print output with a large horizontal scroll bar.

In [4]:
from pprint import pprint

Now, let's load some Hugging Face abstractions -- AutoModelForCausalLM, AutoTokenizer, and the pipeline.  These make it very easy to just try a model and see how it performs.

In [5]:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

<torch._C.Generator at 0x7985257707b0>

## Instantiate Model

Now let's load the actual Microsoft model.  We'll be using Phi-4-mini-instruct.  This is a model designed to give quick answers.  Sometimes, depending on your problem, such a short thought model is what's best for you.  Other times, like if doing math, logic, or puzzles then a longer thought reasoning model is most approrpaite.  After you've run all seven prompts with the short thought model you can uncomment the reasoning model line and comment the instruct line and then try the reasonong model to see how it performs.

In [None]:
phi_model = AutoModelForCausalLM.from_pretrained(
    #"microsoft/Phi-4-mini-reasoning",
    "microsoft/Phi-4-mini-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    attn_implementation="eager"
)
#phi_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")
phi_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct")

config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/54.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.91M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

What is the meaning of the string - `microsoft/Phi-4-mini-instruct`.  The `microsoft` portion means it comes from Microsoft. `Phi-4` is the name of the model.  `Mini` refers to the variant of the model usually indicating the number of parameters. Finally, `instruct` means this model has been instruction fine tuned and shuold be good at following our instructions.

Now we can run the model.  We'll construct our prompt which we'll put in the messages list.  Note that the model is trained to do some dialog.  We can toggle back and forth between the 'user' and 'assistant' roles.  We can also just feed in the initial 'user' field if we just want one prompt.

Since this model is pre-trained for math and logical reasoning, let's try it with a simple equation.

In [None]:
#This shows the turn-taking conversational approach. The first user and assistant pair represents the first turn in the conversation.
#We'll typically just use one 'user' input.
messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

phi_pipe = pipeline(
    "text-generation",
    model=phi_model,
    tokenizer=phi_tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.1,
    "do_sample": True,
}

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'],compact=True)


Device set to use cuda


('To solve the equation 2x + 3 = 7, follow these steps:\n'
 '\n'
 '1. Subtract 3 from both sides of the equation to isolate the term with the '
 'variable (x) on one side:\n'
 '   2x + 3 - 3 = 7 - 3\n'
 '   2x = 4\n'
 '\n'
 '2. Divide both sides of the equation by 2 to solve for x:\n'
 '   2x / 2 = 4 / 2\n'
 '   x = 2\n'
 '\n'
 'So, the solution to the equation 2x + 3 = 7 is x = 2.')


Before we try the prompts we specified in the intro, let's start with a completion task example.  We'll give the begining of the sentence and let it predict what comes next.  

In [None]:
phi_pipe = pipeline("text-generation",
                model=phi_model,
                tokenizer=phi_tokenizer,
                trust_remote_code=True)

output = phi_pipe("Today is such a wonderful", do_sample=True, top_p=0.95, temperature=0.7, max_new_tokens=50)
#print(output

#print(output)
pprint(output[0]['generated_text'], compact=True)

Device set to use cuda


('Today is such a wonderful day and the weather is gorgeous. We have just '
 'moved into our new home and we are still getting used to it. I hope the '
 'weather stays nice all week. I have been planning a picnic for this Saturday '
 'and I want to take my friends with')


Let's do that again but without any args, just the prompt and the defaults of the system.  It tends to produce more verbose output when we just rely on the defaults.

In [None]:
messages = [
    {"role": "user", "content": "Today is such a wonderful"},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)


("I'm glad to hear that you're having a wonderful day! If there's anything "
 "specific you'd like to talk about or do, feel free to let me know. I'm here "
 'to help!')


Let's ask the model how it thinks that LLMs model represent knowledge.

In [None]:
#prompt 1
messages = [
    {"role": "user", "content": "Please write a five sentence explanation of how LLMs do knowledge representation."},
]

generation_args['max_new_tokens'] = 1500
output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

('Large Language Models (LLMs) represent knowledge through a combination of '
 'vast amounts of data, sophisticated algorithms, and neural network '
 'architectures. They ingest and process diverse text sources, enabling them '
 'to understand and generate human-like text. Through techniques like '
 'attention mechanisms and transformer architectures, LLMs can capture '
 'relationships between words and concepts, facilitating nuanced understanding '
 'and context-aware responses. By leveraging pre-training on extensive '
 'corpora, LLMs develop a broad knowledge base that encompasses various '
 'domains and topics. Fine-tuning further refines their capabilities, allowing '
 'them to apply this knowledge effectively in specific tasks and applications.')


We want our model to be a generalist.  Let's try a completely unrelated task, specifically let's have the model write some poetry, starting with a limerick.  This is challenging because it requires rhyming.

In [None]:
#prompt 2
pmessages = [
    {"role": "user", "content": "Complete the following limerick that begins There once was a man from Gibraltar "},
]

output = phi_pipe(pmessages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

('There once was a man from Gibraltar,\n'
 'Whose love for the sea was quite fair.\n'
 'He sailed on the waves,\n'
 'With a hearty, brave laugh,\n'
 'And found treasures beyond compare.')


Now let's see if the model can give us a good recipe for chocolate chip cookies.  We care about the ingredients and we care about the instructions.  Let's try and encourage a good recipe with our prompt.

In [None]:
#prompt 3
messages = [
    {"role": "user", "content": "You are a world renowned baker with many awards and Michelin stars.  Give us your world famous recipe for chocolate chip cookies."},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

("Thank you for your kind words! I'm thrilled to share my world-famous "
 'chocolate chip cookie recipe with you. This recipe has been perfected over '
 'years of baking and has earned numerous accolades, including Michelin stars. '
 'Here it is:\n'
 '\n'
 '**World-Famous Chocolate Chip Cookies Recipe**\n'
 '\n'
 '**Ingredients:**\n'
 '\n'
 '- 1 cup (2 sticks) unsalted butter, softened\n'
 '- 1 cup granulated sugar\n'
 '- 1 cup packed brown sugar\n'
 '- 2 large eggs\n'
 '- 2 teaspoons vanilla extract\n'
 '- 3 cups all-purpose flour\n'
 '- 1 teaspoon baking soda\n'
 '- 1/2 teaspoon baking powder\n'
 '- 1/2 teaspoon salt\n'
 '- 2 cups semisweet chocolate chips\n'
 '- 1 cup chopped walnuts (optional)\n'
 '\n'
 '**Instructions:**\n'
 '\n'
 '1. **Preheat the Oven:**\n'
 '   Preheat your oven to 350°F (175°C). Line baking sheets with parchment '
 'paper or silicone baking mats.\n'
 '\n'
 '2. **Cream the Butter and Sugars:**\n'
 '   In a large mixing bowl, cream together the softened butter, 

Do you think the generated recipe contains a good set of ingredients? Would you want to eat the cookie as desribed. Let's try a more complex task for the model.  Can it produce a haiku which requires counting syllables in the words.

In [None]:
#prompt 4
messages = [
    {"role": "user", "content": "Complete the following haiku about LLMs that begins: Modeling language "},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

'Modeling language,\nInfinite knowledge unfolds,\nWisdom in bytes weaves.'


The Phi-4 model is also trained to output code.  Let's ask it to write us a function.




In [None]:
#prompt 5
messages = [
    {"role": "user", "content": "Write a function in python to take an input string, breaks it down into words and return all of the word level tri-grams it contains"},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

('Certainly! A tri-gram is a sequence of three consecutive words in a given '
 'text. Below is a Python function that takes an input string, breaks it down '
 'into words, and returns all the word-level tri-grams it contains:\n'
 '\n'
 '```python\n'
 'def get_word_trigrams(input_string):\n'
 '    # Split the input string into words\n'
 '    words = input_string.split()\n'
 '    \n'
 '    # Initialize an empty list to store the tri-grams\n'
 '    tri_grams = []\n'
 '    \n'
 '    # Loop through the words and create tri-grams\n'
 '    for i in range(len(words) - 2):\n'
 '        tri_gram = (words[i], words[i + 1], words[i + 2])\n'
 '        tri_grams.append(tri_gram)\n'
 '    \n'
 '    return tri_grams\n'
 '\n'
 '# Example usage:\n'
 'input_string = "This is an example string for generating tri-grams"\n'
 'tri_grams = get_word_trigrams(input_string)\n'
 'print(tri_grams)\n'
 '```\n'
 '\n'
 'This function works as follows:\n'
 '1. It splits the input string into a list of words using the 

Now let's ask it a logic question, but we won't give it enough information to really answer the question.  What will it do?

In [None]:
#Prompt 6
messages = [
    {"role": "user", "content": "Miriam has seven books and two podcasts. Marwan has five papers and a regular lecture series. Who has published more? "},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)


('Miriam has published more. She has seven books and two podcasts, which '
 'totals nine published works. Marwan has five papers and a regular lecture '
 'series, which totals six published works. Therefore, Miriam has published '
 'more than Marwan.')


Now let's ask a simple knowledge retrieval question.

In [None]:
#Prompt 7
messages = [
    {"role": "user", "content": "List the countries in Europe along with their capital cities. "},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

('Sure, here is a list of countries in Europe along with their capital '
 'cities:\n'
 '\n'
 '1. Albania - Tirana\n'
 '2. Andorra - Andorra la Vella\n'
 '3. Armenia - Yerevan\n'
 '4. Austria - Vienna\n'
 '5. Azerbaijan - Baku\n'
 '6. Belarus - Minsk\n'
 '7. Belgium - Brussels\n'
 '8. Bosnia and Herzegovina - Sarajevo\n'
 '9. Bulgaria - Sofia\n'
 '10. Croatia - Zagreb\n'
 '11. Cyprus - Nicosia\n'
 '12. Czech Republic - Prague\n'
 '13. Denmark - Copenhagen\n'
 '14. Estonia - Tallinn\n'
 '15. Finland - Helsinki\n'
 '16. France - Paris\n'
 '17. Georgia - Tbilisi\n'
 '18. Germany - Berlin\n'
 '19. Greece - Athens\n'
 '20. Hungary - Budapest\n'
 '21. Iceland - Reykjavik\n'
 '22. Ireland - Dublin\n'
 '23. Italy - Rome\n'
 '24. Kazakhstan - Nur-Sultan (Astana)\n'
 '25. Kosovo - Pristina\n'
 '26. Latvia - Riga\n'
 '27. Liechtenstein - Vaduz\n'
 '28. Lithuania - Vilnius\n'
 '29. Luxembourg - Luxembourg City\n'
 '30. Malta - Valletta\n'
 '31. Moldova - Chișinău\n'
 '32. Monaco - Monaco\n'
 '33. M

Does the list cutoff in the output? To make the list go longer increase the value of ```max_net_tokens``` in the ```generation_arg``` dictionary we created when we first called the model.

Now, you can try your own prompt and see what sort of results you get.

In [None]:
#Try your own
messages = [
    {"role": "user", "content": "PUT YOUR PROMPT TEXT HERE "},
]

output = phi_pipe(messages, **generation_args)
pprint(output[0]['generated_text'], compact=True)

[Return to Top](#returnToTop)  
<a id = 'openai'></a>

## Open AI

OpenAI is the creator of ChatGPT.  In order to use it programatically you must have an API key which requires a paid account.  You should never put your API key directly into the notebook.  Google offers the ability to attach secrets to your Colab notebooks.  This cell assumes you have added your API key as a secret in Colab.

In [None]:
!pip install -q openai

In [None]:
import os
import time
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPEN_AI_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Let's try the limerick problem and see if a full size commercial model can handle the challenge.

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "system", "content": "You are the American national poet."},
    {"role": "user", "content": "Complete the following limerick that begins There once was a man from Gibraltar "},
  ]
)

print(response.choices[0].message.content)

There once was a man from Gibraltar,  
Whose demeanor was suave, like a palter.  
He danced with such grace,  
In a curious place,  
And charmed every soul in the altar.  


[Return to Top](#returnToTop)  
<a id = 'qwen3'></a>
##Qwen 3

Qwen 3 is one of the latest models from Alibaba. We'll use it  and others extensively in this class. Check out [the model card](https://huggingface.co/Qwen/Qwen3-4B) for further details. It is open-sourced.  You can just use it.

In [6]:
!pip install -q -U bitsandbytes flash_attn

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/6.0 MB[0m [31m73.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.0/6.0 MB[0m [31m92.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone


This is the config for quantization and we'll leave it here as a placeholder even though we aren't using it right now.

In [7]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,


)

Now let's load the Qwen model.

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer

#model_name = "Qwen/Qwen3-8B" #runs barely in T4 so you might quantize
model_name = "Qwen/Qwen3-4B"   #we'll use 4 billion parameter model

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content - separate thoughts from final answer
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)


tokenizer_config.json:   0%|          | 0.00/9.68k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/32.8k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

('thinking content:',
 '<think>\n'
 'Okay, the user is asking for a short introduction to large language models. '
 "Let me start by recalling what I know about them. First, they're a type of "
 "AI model that's really good at understanding and generating human language. "
 'They\'re called "large" because they have a huge number of parameters, which '
 'allows them to handle complex tasks.\n'
 '\n'
 'I should mention their main applications, like text generation, translation, '
 "and answering questions. Maybe also touch on how they're trained on vast "
 'amounts of data. Oh, and the different types, like GPT, BERT, and others. '
 'Wait, but the user wants it short, so I need to be concise.\n'
 '\n'
 "I should also explain why they're powerful—maybe because of their ability to "
 'learn patterns from data, leading to high accuracy. But I need to avoid '
 "getting too technical. Also, note that they're used in various fields like "
 "healthcare, finance, and more. Maybe mention that th

Let's run some of the same prompts that we ran above to see how well this model performs.  Note that it takes a lot longer to generate answers because this model "thinks" before it answers.  The next cell can take about 2 minutes to complete.

How well do the outputs from Qwen 3 compare with the outputs from Phi-4?  How can we measure their performance? How can we compare the two models' outputs quantitatively?

In [None]:
# prepare the model input
messages = [
    {"role": "system", "content": "You are a science communicator who makes technology accessible to everyone!"},
    {"role": "user", "content": "Please write a five sentence explanation of how LLMs do knowledge representation."},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, the user wants a five-sentence explanation of how LLMs do knowledge '
 'representation. Let me start by recalling what I know about LLMs. They '
 'process text, right? So knowledge representation here probably refers to how '
 'they store and use information. I need to break it down into key points.\n'
 '\n'
 'First, I should mention that LLMs use large datasets to learn patterns. '
 'Then, maybe talk about the neural network structure, like layers and '
 "weights. Oh, and they use embeddings to represent words and concepts. That's "
 'important for capturing meaning. Also, they can handle complex relationships '
 'through training on diverse data. Finally, they generate coherent responses '
 "by combining learned knowledge. Wait, that's four points. Let me check if I "
 'can condense or add another. Maybe mention the role of attention mechanisms? '
 'Or perhaps the dynamic nature of their knowledge? Hmm, the user asked for '
 'five sentences.

Let's see if "thinking" helps with limmerick generation.

In [None]:
# prepare the model input
messages = [
    {"role": "system", "content": "You are a well known poet famous for your limmericks."},
    {"role": "user", "content": "Please finish the limmerick that begins There once was a man from Gibraltar "},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, the user wants me to finish a limmerick that starts with "There once '
 'was a man from Gibraltar." Let me recall what a limmerick is. It\'s a type '
 'of rhyming verse with a specific structure: four lines, each with eight '
 'syllables, and the rhyme scheme is AABB. So the first line is "There once '
 'was a man from Gibraltar," which is 12 syllables. Wait, that\'s too long. '
 "Wait, no, maybe I'm mixing up the structure. Let me check.\n"
 '\n'
 'Wait, actually, limmericks are typically four lines with eight syllables '
 'each, and the rhyme scheme is AABB. So the first line should be eight '
 'syllables. Let me count: "There once was a man from Gibraltar." That\'s 12 '
 "syllables. Hmm, that's too long. Maybe the user made a mistake? Or maybe the "
 'original line is longer, and the rest of the lines need to adjust? Wait, '
 'maybe the user is using a different structure. Alternatively, perhaps the '
 'user is okay with the first line bein

What about the recipe generation?

In [None]:
# prepare the model input
messages = [
    {"role": "system", "content": "You are a world renowned baker with many awards and Michelin stars."},
    {"role": "user", "content": "Give us your world famous recipe for chocolate chip cookies."},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, the user is asking for my world-famous chocolate chip cookie recipe. '
 'Let me start by recalling the key elements that make a chocolate chip cookie '
 'stand out. First, the dough needs to be just right—soft but not too runny. '
 'The balance between butter and flour is crucial. I remember that using a '
 'high-quality butter, like European butter, gives a richer flavor. Also, the '
 'chocolate chips should be high quality, maybe a dark chocolate with a good '
 'balance of cocoa and sugar.\n'
 '\n'
 'Wait, the user mentioned "world famous," so I should think about the classic '
 "recipe that's been perfected over time. Maybe the original Toll House "
 'recipe? But I need to make sure to add my own twists. Maybe using a higher '
 'percentage of butter for a fluffier texture. Also, the temperature of the '
 'oven is important—usually around 350°F (175°C) for even baking.\n'
 '\n'
 'I should list the ingredients clearly, making sure to specify 

Now let's try the haiku.  Does thinking help the model better deal with syllables?

In [None]:
# prepare the model input
messages = [
    {"role": "system", "content": "You are the national poet of the United States."},
    {"role": "user", "content": "Complete the following haiku about LLMs that begins: Modeling language  "},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, the user wants a haiku about LLMs that starts with "Modeling '
 'language". Let me think. A haiku is a traditional Japanese poem with three '
 'lines, syllable structure 5-7-5. So first line is 5 syllables: "Modeling '
 'language" – that\'s 5 syllables. Good. Now the second line needs 7 '
 'syllables. Maybe something about the process of learning or the data they '
 'use. Like "From data\'s vast sea" – that\'s 5 syllables. Wait, need 7. Maybe '
 '"From data\'s vast sea, I learn" – that\'s 7. Then the third line, 5 '
 'syllables. Something about the outcome or the purpose. Maybe "To speak in '
 'your name." That\'s 5. Let me check the syllables again. First line: '
 'Modeling (2) language (2) – wait, "Modeling language" is 5 syllables? Let me '
 'count: Mod-el-ing lan-guage. Hmm, maybe "Modeling language" is 5? Wait, '
 '"Modeling" is 3 syllables (Mod-el-ing), and "language" is 2 (lan-guage). '
 'Wait, that would be 3 + 2 = 5? Wait, no. Wait, "

Okay, what about generating some code?  Thinking can definetly help with that.

In [None]:
# prepare the model input
messages = [
    {"role": "system", "content": "You are a well-known educator in computer science and coding boot camps."},
    {"role": "user", "content": "Write a function in python to take an input string, breaks it down into words and return all of the word level tri-grams it contains"},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, I need to write a Python function that takes an input string, splits '
 'it into words, and returns all the word-level trigrams. Let me think about '
 'how to approach this.\n'
 '\n'
 "First, what's a trigram? Oh right, a trigram is a sequence of three "
 'consecutive words. So if the input is "I love programming", the trigrams '
 'would be ["I love programming"], but wait, that\'s only one word. Wait, no, '
 'the input is split into words. Wait, the example might be different. Let me '
 'think again. For example, if the input is "the quick brown fox jumps over '
 'the lazy dog", then the trigrams would be sequences of three consecutive '
 'words. So like ["the quick brown"], ["quick brown fox"], ["brown fox '
 'jumps"], etc.\n'
 '\n'
 'So the steps are: split the input string into words, then iterate through '
 'the list of words, taking each consecutive triplet. But how to split the '
 'string into words? Well, the default split() method in 

Now what about our flawed logic puzzle.  

In [None]:
# prepare the model input
messages = [
    {"role": "user", "content": "Miriam has seven books and two podcasts. Marwan has five papers and a regular lecture series. Who has published more?"},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 "Okay, let's try to figure out who has published more between Miriam and "
 'Marwan. The question says Miriam has seven books and two podcasts. Marwan '
 'has five papers and a regular lecture series. Hmm, the term "published" is a '
 'bit tricky here. \n'
 '\n'
 'First, I need to understand what each of these items represents. Books and '
 'podcasts are types of publications, right? But wait, the question is about '
 'who has published more. So, I need to count how many publications each '
 'person has. \n'
 '\n'
 "Miriam has seven books and two podcasts. So that's 7 + 2 = 9 publications. "
 "But wait, are podcasts considered publications? That's a bit unclear. In "
 'academic contexts, a podcast might not be considered a publication, but in a '
 "more general sense, maybe it is. The question doesn't specify the context, "
 'so maybe I should take it at face value. \n'
 '\n'
 'Marwan has five papers and a regular lecture series. Papers are definitel

Finally, what about our knowledge retreival problem?  Does thinking help?  Does it result in error correction?

In [None]:
# prepare the model input
messages = [
        {"role": "user", "content": "List all of the countries in Europe as well as their capital cities."},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:',
 '<think>\n'
 'Okay, the user is asking for a list of all the countries in Europe along '
 'with their capital cities. First, I need to make sure I have the correct '
 'list of European countries. I remember that there are 44 countries in Europe '
 "according to the European Union, but that's not the total. The EU has 27 "
 'members, but there are other countries that are not part of the EU but are '
 'still in Europe. So, I should check the list of all countries in Europe.\n'
 '\n'
 'Wait, but I need to be careful. The European Union includes 27 countries, '
 'but there are also countries like Turkey, which is not part of the EU but is '
 'in Europe. Also, some countries like the United Kingdom are in Europe but '
 "not in the EU anymore. Then there's the issue of countries that are part of "
 'the European Union but not part of the EU, like the UK. Hmm, no, the UK is '
 "part of the EU before Brexit. Wait, the UK left the EU, so now it's not part "
 'of the EU. 

Let's try generating that list again but without "thinking."

In [9]:
# prepare the model input
messages = [
        {"role": "user", "content": "List all of the countries in Europe as well as their capital cities."},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

pprint(("thinking content:", thinking_content), compact=True)
pprint(("content:", content), compact=True)

('thinking content:', '')
('content:',
 'There are 44 sovereign countries in Europe. Here is a list of all the '
 'countries in Europe along with their capital cities:\n'
 '\n'
 '1. **Albania** – Tirana  \n'
 '2. **Andorra** – Andorra la Vella  \n'
 '3. **Austria** – Vienna  \n'
 '4. **Belarus** – Minsk  \n'
 '5. **Belgium** – Brussels  \n'
 '6. **Bosnia and Herzegovina** – Sarajevo  \n'
 '7. **Bulgaria** – Sofia  \n'
 '8. **Croatia** – Zagreb  \n'
 '9. **Cyprus** – Nicosia  \n'
 '10. **Czech Republic** – Prague  \n'
 '11. **Denmark** – Copenhagen  \n'
 '12. **Estonia** – Tallinn  \n'
 '13. **Finland** – Helsinki  \n'
 '14. **France** – Paris  \n'
 '15. **Georgia** – Tbilisi  \n'
 '16. **Germany** – Berlin  \n'
 '17. **Greece** – Athens  \n'
 '18. **Hungary** – Budapest  \n'
 '19. **Iceland** – Reykjavík  \n'
 '20. **Ireland** – Dublin  \n'
 '21. **Italy** – Rome  \n'
 '22. **Kazakhstan** – Nur-Sultan (formerly Astana)  \n'
 '23. **Kosovo** – Pristina  \n'
 '24. **Latvia** – Riga  \n'


You should take advantage of this notebook to experiment.  See how far you can push these models?  What are they capable of doing?  Where and how do they break down?  You can compare the performance of these two smaller models with the output you can generate using your free ChatGPT account or your free Claude account.  This will give you a sense of the capabilities of the full size commercial models.