# Comparing Llama Models

**Update: Llama 3 was released on April 18 and this notebook has been updated to compare Llama 3 and Llama 2 models hosted on Together.ai.**

- Load helper function to prompt Llama models

In [1]:
from utils import llama, llama_chat

### Task 1: Sentiment Classification
- Compare the models on few-shot prompt sentiment classification.
- You are asking the model to return a one word response.

In [6]:
prompt = '''
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
'''

- First, use the 7B parameter chat model (`llama-2-7b-chat`) to get the response.

**Note the model names accepted by Together.ai are case insensitive and can be either "META-LLAMA/LLAMA-2-7B-CHAT-HF" or "togethercomputer/llama-2-7b-chat". The names starting with "META-LLAMA" are preferred now.**

In [9]:
response = llama(prompt,
                 #model="togethercomputer/LLama-2-7b-chat")
                 model = "META-LLAMA/Llama-2-7B-CHAT-HF")
print(response)

  Hungry


- Now, use the 70B parameter chat model (`Qwen/QwQ-32B-Preview`) on the same task

In [10]:
response = llama(prompt,
                 #model="Qwen/QwQ-32B-Preview")
                 model="Qwen/QwQ-32B-Preview")
print(response)

Positive
Negative
Positive



**Using Llama 3 chat models**

In [11]:
response = llama(prompt,
                 model = "META-LLAMA/Llama-3-8B-CHAT-HF")
print(response)

Excited


In [12]:
response = llama(prompt,
                 model = "META-LLAMA/Llama-3-70B-CHAT-HF")
print(response)

Positive


### Task 2: Summarization
- Compare the models on summarization task.
- This is the same "email" as the one you used previously in the course.

In [41]:
email = """
Dear Amit,

An increasing variety of large language models (LLMs) are open source, or close to it. The proliferation of models with relatively permissive licenses gives developers more options for building applications.

Here are some different ways to build applications based on LLMs, in increasing order of cost/complexity:

Prompting. Giving a pretrained LLM instructions lets you build a prototype in minutes or hours without a training set. Earlier this year, I saw a lot of people start experimenting with prompting, and that momentum continues unabated. Several of our short courses teach best practices for this approach.
One-shot or few-shot prompting. In addition to a prompt, giving the LLM a handful of examples of how to carry out a task — the input and the desired output — sometimes yields better results.
Fine-tuning. An LLM that has been pretrained on a lot of text can be fine-tuned to your task by training it further on a small dataset of your own. The tools for fine-tuning are maturing, making it accessible to more developers.
Pretraining. Pretraining your own LLM from scratch takes a lot of resources, so very few teams do it. In addition to general-purpose models pretrained on diverse topics, this approach has led to specialized models like BloombergGPT, which knows about finance, and Med-PaLM 2, which is focused on medicine.
For most teams, I recommend starting with prompting, since that allows you to get an application working quickly. If you’re unsatisfied with the quality of the output, ease into the more complex techniques gradually. Start one-shot or few-shot prompting with a handful of examples. If that doesn’t work well enough, perhaps use RAG (retrieval augmented generation) to further improve prompts with key information the LLM needs to generate high-quality outputs. If that still doesn’t deliver the performance you want, then try fine-tuning — but this represents a significantly greater level of complexity and may require hundreds or thousands more examples. To gain an in-depth understanding of these options, I highly recommend the course Generative AI with Large Language Models, created by AWS and DeepLearning.AI.

(Fun fact: A member of the DeepLearning.AI team has been trying to fine-tune Llama-2-7B to sound like me. I wonder if my job is at risk? 😜)

Additional complexity arises if you want to move to fine-tuning after prompting a proprietary model, such as GPT-4, that’s not available for fine-tuning. Is fine-tuning a much smaller model likely to yield superior results than prompting a larger, more capable model? The answer often depends on your application. If your goal is to change the style of an LLM’s output, then fine-tuning a smaller model can work well. However, if your application has been prompting GPT-4 to perform complex reasoning — in which GPT-4 surpasses current open models — it can be difficult to fine-tune a smaller model to deliver superior results.

Beyond choosing a development approach, it’s also necessary to choose a specific model. Smaller models require less processing power and work well for many applications, but larger models tend to have more knowledge about the world and better reasoning ability. I’ll talk about how to make this choice in a future letter.

Keep learning!

Andrew
"""

prompt = f"""
Summarize this email and extract some key points.

What did the author say about llama models?
```
{email}
```
"""

- First, use the 7B parameter chat model (`llama-2-7b-chat`) to summarize the email.

In [42]:
response_7b = llama(prompt,
                model="META-LLAMA/Llama-2-7B-CHAT-HF")
print(response_7b)

  The author of the email discusses the use of large language models (LLMs) and how they can be used to build applications. The email highlights several ways to build applications based on LLMs, ranging from simple prompting to fine-tuning.

Key points from the email include:

1. Increasing variety of LLMs are open source or close to it, giving developers more options for building applications.
2. Different ways to build applications based on LLMs, in increasing order of cost/complexity: prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.
3. For most teams, starting with prompting is recommended, as it allows for quick development of an application.
4. If unsatisfied with the quality of the output, developers can ease into more complex techniques gradually.
5. Fine-tuning a smaller model can work well for changing the style of an LLM's output, but may not yield superior results than prompting a larger, more capable model.
6. Choosing a specific model also needs to 

- Now, use the 13B parameter chat model (`meta-llama/Llama-Vision-Free`) to summarize the email.

In [43]:
response_free = llama(prompt,
                model="meta-llama/Llama-Vision-Free")
print(response_free)

**Summary:**
The email discusses the increasing availability of large language models (LLMs) and provides guidance on how to build applications using these models. The author, Andrew, recommends starting with prompting, which allows for quick prototyping, and gradually moving to more complex techniques like fine-tuning if needed. He also touches on the importance of choosing the right model for the application and the trade-offs between smaller and larger models.

**Key points:**

* LLMs are becoming more open-source, giving developers more options for building applications.
* There are different ways to build applications using LLMs, including prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.
* Prompting is a good starting point, as it allows for quick prototyping without requiring a large training set.
* One-shot or few-shot prompting can improve results by providing examples of how to carry out a task.
* Fine-tuning is a more complex technique that requires a 

- Lastly, use the 70B parameter chat model (`Qwen/QwQ-32B-Preview`) to summarize the email.

In [19]:
response_qwen = llama(prompt,
                model="Qwen/QwQ-32B-Preview")
print(response_qwen)

This email is from Andrew, likely Andrew Ng, a well-known figure in the field of artificial intelligence and machine learning. He is addressing someone named Amit and discussing the growing availability of large language models (LLMs), particularly those that are open-source or have permissive licenses. Andrew emphasizes the benefits of this trend for developers looking to build applications using LLMs.

He outlines four methods for building applications based on LLMs, arranged in order of increasing cost and complexity:

1. **Prompting**: This involves providing pre-trained LLMs with instructions to create prototypes quickly, often in minutes or hours, without needing a training dataset. Andrew mentions that many people have been experimenting with this approach and that his short courses teach best practices for it.

2. **One-shot or few-shot prompting**: Beyond simple prompting, this method includes giving the LLM a few examples of inputs and desired outputs to improve performance.


**Using Llama 3 chat models**

In [20]:
response_llama3_8b = llama(prompt,
                model="META-LLAMA/Llama-3-8B-CHAT-HF")
print(response_llama3_8b)

Here is a summary of the email and some key points:

**Summary:** The author, Andrew, discusses the various ways to build applications using large language models (LLMs), including prompting, one-shot or few-shot prompting, fine-tuning, and pretraining. He recommends starting with prompting and gradually moving to more complex techniques if needed.

**Key Points:**

* LLMs with permissive licenses provide developers with more options for building applications.
* Prompting is a simple and fast way to build a prototype, but may not yield high-quality results.
* One-shot or few-shot prompting can improve results by providing examples of desired outputs.
* Fine-tuning an LLM requires a small dataset and more resources, but can lead to better results.
* Pretraining an LLM from scratch is resource-intensive and typically not done by most teams.
* The author recommends starting with prompting and gradually moving to more complex techniques if needed.
* Choosing the right model (smaller vs. la

In [21]:
response_llama3_70b = llama(prompt,
                model="META-LLAMA/Llama-3-70B-CHAT-HF")
print(response_llama3_70b)

Here is a summary of the email and some key points:

**Summary:** The email discusses the increasing availability of open-source large language models (LLMs) and the various ways to build applications using them, ranging from simple prompting to fine-tuning and pretraining. The author recommends starting with prompting and gradually moving to more complex techniques if needed.

**Key Points:**

1. Open-source LLMs are becoming more available, giving developers more options for building applications.
2. There are four ways to build applications using LLMs, in increasing order of cost/complexity: prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.
3. Prompting is a quick and easy way to build a prototype, while fine-tuning and pretraining require more resources and expertise.
4. The choice of development approach depends on the application and the desired output.
5. The author recommends starting with prompting and gradually moving to more complex techniques if neede

#### Model-Graded Evaluation: Summarization

- Interestingly, you can ask a LLM to evaluate the responses of other LLMs.
- This is known as **Model-Graded Evaluation**.

- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`llama-2-70b-chat`).
- In the `prompt`, provide the "email", "name of the models", and the "summary" generated by each model.

In [22]:
prompt = f"""
Given the original text denoted by `email`
and the name of several models: `model:<name of model>
as well as the summary generated by that model: `summary`

Provide an evaluation of each model's summary:
- Does it summarize the original text well?
- Does it follow the instructions of the prompt?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

email: ```{email}`

model: llama-2-7b-chat
summary: {response_7b}

model: meta-llama/Llama-Vision-Free
summary: {response_free}

model: Qwen/QwQ-32B-Preview
summary: {response_qwen}

model: llama-3-8b-chat
summary: {response_llama3_8b}

model: llama-3-70b-chat
summary: {response_llama3_70b}
"""

response_eval = llama(prompt,
                model="META-LLAMA/Llama-3-70B-CHAT-HF")
print(response_eval)

Here is the evaluation of each model's summary:

**Model: llama-2-7b-chat**

* The summary is concise and covers the main points of the original email, but it lacks details and specific examples.
* The summary follows the instructions of the prompt, but it's too brief and doesn't provide much insight into the original email.
* The model's output is straightforward and lacks creativity.

**Model: llama-2-13b-chat**

* The summary is extremely brief and doesn't provide any meaningful information about the original email.
* The summary doesn't follow the instructions of the prompt, as it's not a summary of the email.
* The model's output is unhelpful and doesn't demonstrate any understanding of the original email.

**Model: Qwen/QwQ-32B-Preview**

* The summary is detailed and covers all the main points of the original email, including specific examples and nuances.
* The summary follows the instructions of the prompt and provides a clear and concise overview of the email.
* The model's o

### Task 3: Reasoning ###
- Compare the three models' performance on reasoning tasks.

In [34]:
context = """
Jeff and Tommy are neighbors

Tommy and Eddy are not neighbors
"""

In [35]:
query = """
Are Jeff and Eddy neighbors?
"""

In [36]:
prompt = f"""
Given this context: ```{context}```,

and the following query:
```{query}```

Please answer the questions in the query and explain your reasoning.
If there is not enough informaton to answer, please say
"I do not have enough information to answer this questions."
"""

- First, use the 7B parameter chat model (`llama-2-7b-chat`) for the response.

In [37]:
response_7b_chat = llama(prompt,
                        model="META-LLAMA/Llama-2-7B-CHAT-HF")
print(response_7b_chat)

  Sure, I'd be happy to help! Based on the information provided, we can answer the query as follows:

Are Jeff and Eddy neighbors?

No, Jeff and Eddy are not neighbors. The context states that Jeff and Tommy are neighbors, but does not mention Eddy as a neighbor. Therefore, we can conclude that Eddy and Jeff are not neighbors.


- Now, use the 13B parameter chat model (`meta-llama/Llama-Vision-Free`) for the response.

In [39]:
response_free_chat = llama(prompt,
                        model="meta-llama/Llama-Vision-Free")
print(response_free_chat)

I do not have enough information to answer this question.

The context only provides information about the relationships between Jeff, Tommy, and Eddy, but it does not provide any information about the relationship between Jeff and Eddy specifically.


- Then, use the 70B parameter chat model (`Qwen/QwQ-32B-Preview`) for the response.

In [28]:
response_qwen_chat = llama(prompt,
                        model="Qwen/QwQ-32B-Preview")
print(response_qwen_chat)

I do not have enough information to answer this question.

Explanation:

From the given context, we know two things:

1. Jeff and Tommy are neighbors.

2. Tommy and Eddy are not neighbors.

The question is whether Jeff and Eddy are neighbors.

Let's think about this step by step.

First, since Jeff and Tommy are neighbors, they live next to each other.

Second, Tommy and Eddy are not neighbors, so Eddy does not live next to Tommy.

However, there is no direct information about the relationship between Jeff's and Eddy's residences.

It's possible that Jeff and Eddy are neighbors, especially if Eddy lives next to Jeff, skipping Tommy.

Alternatively, Eddy could live farther away, not being neighbors with either Jeff or Tommy.

Without more information about the layout of their homes or Eddy's location relative to Jeff, we cannot definitively say whether Jeff and Eddy are neighbors.

Therefore, I do not have enough information to answer this question.

**Final Answer**

\[ \boxed{\text{I 

- Lastly, use the Llama 3 models.

In [29]:
response_llama3_8b_chat = llama(prompt,
                        model="META-LLAMA/Llama-3-8B-CHAT-HF")
print(response_llama3_8b_chat)

Based on the given context, I can answer the query as follows:

Are Jeff and Eddy neighbors?

I do not have enough information to answer this question.

The context only provides information about Jeff and Tommy being neighbors, and Tommy and Eddy not being neighbors. It does not provide any information about Jeff and Eddy's relationship. Therefore, I cannot determine whether Jeff and Eddy are neighbors or not.


In [30]:
response_llama3_70b_chat = llama(prompt,
                        model="META-LLAMA/Llama-3-70B-CHAT-HF")
print(response_llama3_70b_chat)

Based on the given context, I can conclude that:

**Jeff and Eddy are not neighbors.**

Here's my reasoning:

1. We know that Jeff and Tommy are neighbors.
2. We also know that Tommy and Eddy are not neighbors.
3. Since Tommy is a neighbor of Jeff, but not a neighbor of Eddy, it implies that Jeff and Eddy do not share a common neighbor (Tommy).
4. Therefore, Jeff and Eddy are not neighbors.

So, I can confidently answer that Jeff and Eddy are not neighbors.


#### Model-Graded Evaluation: Reasoning

- Again, ask a LLM to compare the three responses.
- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`llama-2-70b-chat`).
- In the `prompt`, provide the `context`, `query`,"name of the models", and the "response" generated by each model.

In [40]:
prompt = f"""
Given the context `context:`,
Also also given the query (the task): `query:`
and given the name of several models: `mode:<name of model>,
as well as the response generated by that model: `response:`

Provide an evaluation of each model's response:
- Does it answer the query accurately?
- Does it provide a contradictory response?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

context: ```{context}```

model: llama-2-7b-chat
response: ```{response_7b_chat}```

model: meta-llama/Llama-Vision-Free
response: ```{response_free_chat}```

model: Qwen/QwQ-32B-Preview
response: ``{response_qwen_chat}```

model: llama-3-8b-chat
response: ```{response_llama3_8b_chat}```

model: llama-3-70b-chat
response: ``{response_llama3_70b_chat}``
"""

In [32]:
response_eval = llama(prompt, 
                      model="META-LLAMA/Llama-3-70B-CHAT-HF")

print(response_eval)

Here's the evaluation of each model's response:

**Llama-2-7b-chat**

* Does it answer the query accurately? Yes, the model correctly concludes that Jeff and Eddy are not neighbors based on the given context.
* Does it provide a contradictory response? No, the response is consistent with the context.
* Are there any other interesting characteristics of the model's output? The model provides a clear and concise explanation for its answer.

**Llama-2-13b-chat**

* Does it answer the query accurately? No, the model's response is incomplete and does not provide an answer to the query.
* Does it provide a contradictory response? No, the response is incomplete and does not provide any information.
* Are there any other interesting characteristics of the model's output? The model's response is incomplete and does not provide any useful information.

**Qwen/QwQ-32B-Preview**

* Does it answer the query accurately? No, the model states that it does not have enough information to answer the ques