# Comparing Llama Models

- Load helper function to prompt Llama models

In [18]:
from utils import llama, llama_chat

### Task 1: Sentiment Classification
- Compare the models on few-shot prompt sentiment classification.
- You are asking the model to return a one word response.

In [19]:
prompt = '''
Message: Hi Amit, thanks for the thoughtful birthday card!
Sentiment: Positive
Message: Hi Dad, you're 20 minutes late to my piano recital!
Sentiment: Negative
Message: Can't wait to order pizza for dinner tonight!
Sentiment: ?

Give a one word response.
'''

- First, use the 7B parameter chat model (`llama-2-7b-chat`) to get the response.

In [20]:
response = llama(prompt,
                 model="togethercomputer/llama-2-7b-chat")
print(response)

  Hungry


- Now, use the 70B parameter chat model (`llama-2-70b-chat`) on the same task

In [21]:
response = llama(prompt,
                 model="togethercomputer/llama-2-70b-chat")
print(response)

  Positive


### Task 2: Summarization
- Compare the models on summarization task.
- This is the same "email" as the one you used previously in the course.

In [22]:
email = """
Dear Amit,

An increasing variety of large language models (LLMs) are open source, or close to it. The proliferation of models with relatively permissive licenses gives developers more options for building applications.

Here are some different ways to build applications based on LLMs, in increasing order of cost/complexity:

Prompting. Giving a pretrained LLM instructions lets you build a prototype in minutes or hours without a training set. Earlier this year, I saw a lot of people start experimenting with prompting, and that momentum continues unabated. Several of our short courses teach best practices for this approach.
One-shot or few-shot prompting. In addition to a prompt, giving the LLM a handful of examples of how to carry out a task — the input and the desired output — sometimes yields better results.
Fine-tuning. An LLM that has been pretrained on a lot of text can be fine-tuned to your task by training it further on a small dataset of your own. The tools for fine-tuning are maturing, making it accessible to more developers.
Pretraining. Pretraining your own LLM from scratch takes a lot of resources, so very few teams do it. In addition to general-purpose models pretrained on diverse topics, this approach has led to specialized models like BloombergGPT, which knows about finance, and Med-PaLM 2, which is focused on medicine.
For most teams, I recommend starting with prompting, since that allows you to get an application working quickly. If you’re unsatisfied with the quality of the output, ease into the more complex techniques gradually. Start one-shot or few-shot prompting with a handful of examples. If that doesn’t work well enough, perhaps use RAG (retrieval augmented generation) to further improve prompts with key information the LLM needs to generate high-quality outputs. If that still doesn’t deliver the performance you want, then try fine-tuning — but this represents a significantly greater level of complexity and may require hundreds or thousands more examples. To gain an in-depth understanding of these options, I highly recommend the course Generative AI with Large Language Models, created by AWS and DeepLearning.AI.

(Fun fact: A member of the DeepLearning.AI team has been trying to fine-tune Llama-2-7B to sound like me. I wonder if my job is at risk? 😜)

Additional complexity arises if you want to move to fine-tuning after prompting a proprietary model, such as GPT-4, that’s not available for fine-tuning. Is fine-tuning a much smaller model likely to yield superior results than prompting a larger, more capable model? The answer often depends on your application. If your goal is to change the style of an LLM’s output, then fine-tuning a smaller model can work well. However, if your application has been prompting GPT-4 to perform complex reasoning — in which GPT-4 surpasses current open models — it can be difficult to fine-tune a smaller model to deliver superior results.

Beyond choosing a development approach, it’s also necessary to choose a specific model. Smaller models require less processing power and work well for many applications, but larger models tend to have more knowledge about the world and better reasoning ability. I’ll talk about how to make this choice in a future letter.

Keep learning!

Andrew
"""

prompt = f"""
Summarize this email and extract some key points.

What did the author say about llama models?
```
{email}
```
"""

- First, use the 7B parameter chat model (`llama-2-7b-chat`) to summarize the email.

In [23]:
response_7b = llama(prompt,
                model="togethercomputer/llama-2-7b-chat")
print(response_7b)

  The author discusses the proliferation of large language models (LLMs) and the various ways to build applications using them, ranging from prompting to fine-tuning. Key points include:

1. LLMs are becoming more open source, giving developers more options for building applications.
2. Different approaches to building applications include prompting, one-shot or few-shot prompting, fine-tuning, and pretraining.
3. The author recommends starting with prompting for most teams, as it allows for quick development, and gradually moving to more complex techniques if needed.
4. Fine-tuning a smaller model can yield superior results than prompting a larger, more capable model in some cases, depending on the application.
5. Choosing a specific model also requires consideration, as smaller models require less processing power but may not have as much knowledge about the world or reasoning ability as larger models.
6. The author plans to discuss how to make this choice in a future letter.

Regard

- Now, use the 13B parameter chat model (`llama-2-13b-chat`) to summarize the email.

In [24]:
response_13b = llama(prompt,
                model="togethercomputer/llama-2-13b-chat")
print(response_13b)

  Sure! Here's a summary of the email and some key points:

Summary:
The author discusses different approaches to building applications using large language models (LLMs), ranging from prompting to fine-tuning, and provides recommendations on when to use each approach. They also discuss the trade-offs between using smaller or larger models and the importance of choosing the right model for the application.

Key points:

1. Prompting: Giving a pretrained LLM instructions to build a prototype quickly, without a training set.
2. One-shot or few-shot prompting: Providing a handful of examples to the LLM for better results.
3. Fine-tuning: Training an LLM further on a small dataset for a specific task.
4. Pretraining: Training an LLM from scratch, but requires significant resources.
5. Choosing a development approach: Depending on the application, the author recommends starting with prompting and gradually moving to more complex techniques if needed.
6. Choosing a specific model: Smaller mo

- Lastly, use the 70B parameter chat model (`llama-2-70b-chat`) to summarize the email.

In [25]:
response_70b = llama(prompt,
                model="togethercomputer/llama-2-70b-chat")
print(response_70b)

  The author of the email, Andrew, discusses the various ways to build applications using large language models (LLMs), including prompting, one-shot or few-shot prompting, fine-tuning, and pretraining. He recommends starting with prompting and gradually moving on to more complex techniques if necessary. He also mentions the challenges of fine-tuning a proprietary model like GPT-4 and notes that smaller models may not always deliver superior results.

The author also mentions a "fun fact" about a member of the DeepLearning.AI team trying to fine-tune Llama-2-7B to sound like him, which suggests that the author may be exploring the possibility of creating a personalized AI model.

Some key points from the email include:

1. There are several ways to build applications using LLMs, each with increasing cost and complexity.
2. Prompting is a quick and easy way to build a prototype, and it's a good starting point for most teams.
3. Fine-tuning is a more complex approach that requires a smal

#### Model-Graded Evaluation: Summarization

- Interestingly, you can ask a LLM to evaluate the responses of other LLMs.
- This is known as **Model-Graded Evaluation**.

- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`llama-2-70b-chat`).
- In the `prompt`, provide the "email", "name of the models", and the "summary" generated by each model.

In [26]:
prompt = f"""
Given the original text denoted by `email`
and the name of several models: `model:<name of model>
as well as the summary generated by that model: `summary`

Provide an evaluation of each model's summary:
- Does it summarize the original text well?
- Does it follow the instructions of the prompt?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

email: ```{email}`

model: llama-2-7b-chat
summary: {response_7b}

model: llama-2-13b-chat
summary: {response_13b}

model: llama-2-70b-chat
summary: {response_70b}
"""

response_eval = llama(prompt,
                model="togethercomputer/llama-2-70b-chat")
print(response_eval)

  Based on the summaries provided, it seems that all three models (llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat) were able to capture the main points of the email. However, there are some differences in the way the information is presented and the level of detail provided.

Llama-2-7b-chat's summary is the shortest and most concise, focusing on the key points of the email. It does not provide any additional information or insights beyond what is mentioned in the email.

Llama-2-13b-chat's summary is slightly longer and provides more context, including the author's recommendations for choosing a development approach and the trade-offs between using smaller or larger models. It also mentions the "fun fact" about the DeepLearning.AI team trying to fine-tune Llama-2-7B.

Llama-2-70b-chat's summary is the longest and most detailed, providing a comprehensive overview of the email's content. It includes all the key points mentioned in the other two summaries and adds additional inf

### Task 3: Reasoning ###
- Compare the three models' performance on reasoning tasks.

In [27]:
context = """
Jeff and Tommy are neighbors

Tommy and Eddy are not neighbors
"""

In [28]:
query = """
Are Jeff and Eddy neighbors?
"""

In [29]:
prompt = f"""
Given this context: ```{context}```,

and the following query:
```{query}```

Please answer the questions in the query and explain your reasoning.
If there is not enough informaton to answer, please say
"I do not have enough information to answer this questions."
"""

- First, use the 7B parameter chat model (`llama-2-7b-chat`) for the response.

In [30]:
response_7b_chat = llama(prompt,
                        model="togethercomputer/llama-2-7b-chat")
print(response_7b_chat)

  Sure, I'd be happy to help! Based on the information provided, we can answer the query as follows:

Are Jeff and Eddy neighbors?

No, Jeff and Eddy are not neighbors.

Reasoning:

* Jeff and Tommy are neighbors (given)
* Tommy and Eddy are not neighbors (given)

So, Jeff and Eddy are not neighbors because they are not in a direct relationship (neighborhood) with each other.


- Now, use the 13B parameter chat model (`llama-2-13b-chat`) for the response.

In [31]:
response_13b_chat = llama(prompt,
                        model="togethercomputer/llama-2-13b-chat")
print(response_13b_chat)

  Based on the information provided, I do not have enough information to answer the question "Are Jeff and Eddy neighbors?" because there is no information about the relationship between Jeff and Eddy. The only information provided is that Tommy and Eddy are not neighbors, but there is no information about Jeff's relationship with either Tommy or Eddy. Therefore, I cannot determine whether Jeff and Eddy are neighbors or not.


- Lastly, use the 70B parameter chat model (`llama-2-70b-chat`) for the response.

In [32]:
response_70b_chat = llama(prompt,
                        model="togethercomputer/llama-2-70b-chat")
print(response_70b_chat)

  No, Jeff and Eddy are not neighbors.

The given information states that Jeff and Tommy are neighbors, and Tommy and Eddy are not neighbors. Since Tommy is not a neighbor of Eddy, it means that Eddy is not a neighbor of Tommy. Therefore, Jeff, who is a neighbor of Tommy, cannot be a neighbor of Eddy.


#### Model-Graded Evaluation: Reasoning

- Again, ask a LLM to compare the three responses.
- Create a `prompt` that will evaluate these three responses using 70B parameter chat model (`llama-2-70b-chat`).
- In the `prompt`, provide the `context`, `query`,"name of the models", and the "response" generated by each model.

In [33]:
prompt = f"""
Given the context `context:`,
Also also given the query (the task): `query:`
and given the name of several models: `mode:<name of model>,
as well as the response generated by that model: `response:`

Provide an evaluation of each model's response:
- Does it answer the query accurately?
- Does it provide a contradictory response?
- Are there any other interesting characteristics of the model's output?

Then compare the models based on their evaluation \
and recommend the models that perform the best.

context: ```{context}```

model: llama-2-7b-chat
response: ```{response_7b_chat}```

model: llama-2-13b-chat
response: ```{response_13b_chat}```

model: llama-2-70b-chat
response: ``{response_70b_chat}```
"""

In [34]:
response_eval = llama(prompt, 
                      model="togethercomputer/llama-2-70b-chat")

print(response_eval)

  Evaluation of each model's response:

1. llama-2-7b-chat:
	* Accuracy: The model's response accurately answers the query by stating that Jeff and Eddy are not neighbors.
	* Contradictory response: No, the response does not provide a contradictory answer.
	* Other characteristics: The model's output provides a clear and concise explanation of the reasoning behind the answer, citing the given information and using logical deduction.
2. llama-2-13b-chat:
	* Accuracy: The model's response does not provide an answer to the query, stating that there is not enough information to determine whether Jeff and Eddy are neighbors.
	* Contradictory response: No, the response does not provide a contradictory answer.
	* Other characteristics: The model's output acknowledges the limitations of the given information and does not make an incorrect assumption or provide a misleading answer.
3. llama-2-70b-chat:
	* Accuracy: The model's response accurately states that Jeff and Eddy are not neighbors.
	* 