# Prompt Engineering with Llama

- https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/Prompt_Engineering_with_Llama.ipynb
- https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/getting-started/Prompt_Engineering_with_Llama.ipynb

Prompt engineering is using natural language to produce a desired response from a large language model (LLM).

This interactive guide covers prompt engineering & best practices with Llama.

Note: The notebook can be extended to any (latest) Llama models.

## Introduction

### Why now?

[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.

Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**.

### Llama Models

In 2023, Meta introduced the [Llama language models](https://www.llama.com/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.

Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.

#### Llama 3.1
1. `llama-3.1-8b` - base pretrained 8 billion parameter model
1. `llama-3.1-70b` - base pretrained 70 billion parameter model
1. `llama-3.1-405b` - base pretrained 405 billion parameter model
1. `llama-3.1-8b-instruct` - instruction fine-tuned 8 billion parameter model
1. `llama-3.1-70b-instruct` - instruction fine-tuned 70 billion parameter model
1. `llama-3.1-405b-instruct` - instruction fine-tuned 405 billion parameter model (flagship)


#### Llama 3
1. `llama-3-8b` - base pretrained 8 billion parameter model
1. `llama-3-70b` - base pretrained 70 billion parameter model
1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model
1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)

#### Llama 2
1. `llama-2-7b` - base pretrained 7 billion parameter model
1. `llama-2-13b` - base pretrained 13 billion parameter model
1. `llama-2-70b` - base pretrained 70 billion parameter model
1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model
1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model
1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)

## Getting an LLM

Large language models are deployed and accessed in a variety of ways, including:

1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).
    * Best for privacy/security or if you already have a GPU.
1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.
    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).
1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.
    * Easiest option overall.

### Hosted APIs

Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:

1. **`completion`**: generate a response to a given prompt (a string).
1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots.

## Tokens

LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...

> Our destiny is written in the stars.

...is tokenized into `["Our", " destiny", " is", " written", " in", " the", " stars", "."]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.

Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).

Each model has a maximum context length that your prompt cannot exceed. That's 128k tokens for Llama 3.1, 4K for Llama 2, and 100K for Code Llama.

## Notebook Setup

The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama chat using Groq

- https://console.groq.com/docs/models

To install prerequisites run:

In [None]:
!pip install -qq groq

In [None]:
import os
from typing import Dict, List
from groq import Groq

# Get a free API key from https://console.groq.com/keys
# os.environ["GROQ_API_KEY"] = "YOUR_GROQ_API_KEY"

LLAMA3_70B_INSTRUCT = "llama-3.3-70b-versatile"
LLAMA3_8B_INSTRUCT = "llama3.1-8b-instant"

DEFAULT_MODEL = LLAMA3_70B_INSTRUCT

client = Groq()

def assistant(content: str):
    return { "role": "assistant", "content": content }

def user(content: str):
    return { "role": "user", "content": content }

def chat_completion(
    messages: List[Dict],
    model = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    response = client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=temperature,
        top_p=top_p,
    )
    return response.choices[0].message.content
        

def completion(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    return chat_completion(
        [user(prompt)],
        model=model,
        temperature=temperature,
        top_p=top_p,
    )

def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):
    print(f'==============\n{prompt}\n==============')
    response = completion(prompt, model)
    print(response, end='\n\n')

### Completion APIs

Let's try Llama 3.1!

In [None]:
complete_and_print("The typical color of the sky is: ")

The typical color of the sky is: 
Blue.



In [None]:
complete_and_print("which model version are you?")

which model version are you?
I am a Meta AI model, and my version is Meta LLama.



### Chat Completion APIs

Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some "context" or "history" from which to continue.

Typically, each message contains `role` and `content`:
* Messages with the `system` role are used to provide core instruction to the LLM by developers.
* Messages with the `user` role are typically human-provided messages.
* Messages with the `assistant` role are typically generated by the LLM.

In [None]:
response = chat_completion(messages=[
    user("My favorite color is blue."),
    assistant("That's great to hear!"),
    user("What is my favorite color?"),
])
print(response)

You mentioned earlier that your favorite color is blue.


### LLM Hyperparameters

#### `temperature` & `top_p`

These APIs also take parameters which influence the creativity and determinism of your output.

At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are "cut" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).

In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.

[Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683).

Let's try it out:

In [None]:
def print_tuned_completion(temperature: float, top_p: float):
    response = completion("Write a haiku about llamas", temperature=temperature, top_p=top_p)
    print(f'[temperature: {temperature} | top_p: {top_p}]\n{response.strip()}\n')

print_tuned_completion(0.01, 0.01)
print_tuned_completion(0.01, 0.01)
# These two generations are highly likely to be the same

print_tuned_completion(1.0, 1.0)
print_tuned_completion(1.0, 1.0)
# These two generations are highly likely to be different

[temperature: 0.01 | top_p: 0.01]
Fuzzy, gentle soul
Llama's soft and watchful eyes
Misty mountain friend

[temperature: 0.01 | top_p: 0.01]
Fuzzy, gentle soul
Llama's soft and watchful eyes
Misty mountain friend

[temperature: 1.0 | top_p: 1.0]
Fuzzy, gentle soul
Llama's soft and curious
Misty mountain friend

[temperature: 1.0 | top_p: 1.0]
Fuzzy, gentle soul
Llama's soft and watchful eyes
Misty mountain friend



## Prompting Techniques

### Explicit Instructions

Detailed, explicit instructions produce better results than open-ended prompts:

In [None]:
complete_and_print(prompt="Describe quantum physics in one short sentence of no more than 12 words")

Describe quantum physics in one short sentence of no more than 12 words
Quantum physics studies behavior of tiny particles and energy.



You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.

- Stylization
    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`
    - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`
    - `Give your answer like an old timey private investigator hunting down a case step by step.`
- Formatting
    - `Use bullet points.`
    - `Return as a JSON object.`
    - `Use less technical terms and help me apply it in my work in communications.`
- Restrictions
    - `Only use academic papers.`
    - `Never give sources older than 2020.`
    - `If you don't know the answer, say that you don't know.`

Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources.

In [None]:
complete_and_print("Explain the latest advances in large language models to me.")
# More likely to cite sources from 2017

Explain the latest advances in large language models to me.
Large language models (LLMs) have made tremendous progress in recent years, with significant advances in areas like natural language processing, machine learning, and artificial intelligence. Here are some of the latest developments:

1. **Transformer Architecture**: The transformer architecture, introduced in 2017, has become the foundation for most modern LLMs. This architecture relies on self-attention mechanisms to process input sequences, allowing for parallelization and improved performance. Recent variants, such as the BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimized BERT Pretraining Approach), have further refined this architecture.
2. **Pre-training and Fine-tuning**: Pre-training LLMs on large datasets and fine-tuning them on specific tasks has become a standard approach. This allows models to learn general language representations and then adapt to specific tasks, such a

In [None]:
complete_and_print("Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.")
# Gives more specific advances and only cites sources from 2020

Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.
The field of large language models has seen significant advancements in recent years. One of the most notable developments is the introduction of transformer-based architectures, which have become the standard for many state-of-the-art language models (Vaswani et al., 2020) [1]. These models use self-attention mechanisms to weigh the importance of different input elements, allowing for more efficient and effective processing of sequential data.

More recently, researchers have explored the use of larger and more complex models, such as the BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly optimized BERT approach) models (Liu et al., 2020) [2]. These models have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks, including question answering, sentiment analysis, and text classification.

Ano

### Example Prompting using Zero- and Few-Shot Learning

A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).

#### Zero-Shot Prompting

Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called "zero-shot prompting".

Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting.

In [None]:
complete_and_print("Text: This was the best movie I've ever seen! \n The sentiment of the text is: ")
# Returns positive sentiment

Text: This was the best movie I've ever seen! 
 The sentiment of the text is: 
The sentiment of the text is: Positive. The text expresses a strong and enthusiastic opinion, using the phrase "the best movie I've ever seen", which indicates a very favorable sentiment.



In [None]:
complete_and_print("Text: The director was trying too hard. \n The sentiment of the text is: ")
# Returns negative sentiment

Text: The director was trying too hard. 
 The sentiment of the text is: 
The sentiment of the text is: Negative.

The phrase "trying too hard" implies that the director's efforts were excessive or insincere, which has a negative connotation.



#### Few-Shot Prompting

Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called "few-shot prompting".

In this example, the generated response follows our desired format that offers a more nuanced sentiment classifier that gives a positive, neutral, and negative response confidence percentage.

See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).

In [None]:
def sentiment(text):
    response = chat_completion(messages=[
        user("You are a sentiment classifier. For each message, give the percentage of positive/netural/negative."),
        user("I liked it"),
        assistant("70% positive 30% neutral 0% negative"),
        user("It could be better"),
        assistant("0% positive 50% neutral 50% negative"),
        user("It's fine"),
        assistant("25% positive 50% neutral 25% negative"),
        user(text),
    ])
    return response

def print_sentiment(text):
    print(f'INPUT: {text}')
    print(sentiment(text))

In [None]:
print_sentiment("I thought it was okay")
# More likely to return a balanced mix of positive, neutral, and negative

INPUT: I thought it was okay
10% positive 80% neutral 10% negative


In [None]:
print_sentiment("I loved it!")
# More likely to return 100% positive

INPUT: I loved it!
90% positive 10% neutral 0% negative


In [None]:
print_sentiment("Terrible service 0/10")
# More likely to return 100% negative

INPUT: Terrible service 0/10
0% positive 0% neutral 100% negative


### Role Prompting

Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://arxiv.org/pdf/2308.07702)). Roles give context to the LLM on what type of answers are desired.

Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch.

In [None]:
complete_and_print("Explain the pros and cons of using PyTorch.")
# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve

Explain the pros and cons of using PyTorch.
PyTorch is an open-source machine learning library developed by Facebook's AI Research Lab (FAIR). It is primarily used for building and training deep learning models. Here are the pros and cons of using PyTorch:

**Pros:**

1. **Dynamic Computation Graph**: PyTorch uses a dynamic computation graph, which allows for more flexibility and ease of use. This is in contrast to static computation graphs used in other frameworks like TensorFlow.
2. **Rapid Prototyping**: PyTorch's dynamic computation graph and Pythonic API make it ideal for rapid prototyping and research.
3. **Ease of Use**: PyTorch has a relatively low barrier to entry, making it easier for developers to get started with deep learning.
4. **Strong GPU Support**: PyTorch has excellent support for GPU acceleration, making it well-suited for large-scale deep learning tasks.
5. **Large Community**: PyTorch has a large and active community, which means there are many resources available

In [None]:
complete_and_print("Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.")
# Often results in more technical benefits and drawbacks that provide more technical details on how model layers

Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.
As a machine learning expert, I'd be happy to provide a detailed analysis of the pros and cons of using PyTorch, a popular open-source machine learning framework.

**Pros:**

1. **Dynamic Computation Graph**: PyTorch's dynamic computation graph allows for more flexibility and ease of use, particularly when working with complex, dynamic models. This is in contrast to static computation graphs used in other frameworks like TensorFlow, which can be more rigid and difficult to modify.
2. **Rapid Prototyping**: PyTorch's Pythonic API and dynamic computation graph make it ideal for rapid prototyping and research. It's easy to quickly test and iterate on new ideas, which is essential in the fast-paced field of machine learning.
3. **Modular Architecture**: PyTorch's modular architecture allows for easy integration of new m

### Chain-of-Thought

Simply adding a phrase encouraging step-by-step thinking "significantly improves the ability of large language models to perform complex reasoning" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called "CoT" or "Chain-of-Thought" prompting.

Llama 3.1 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness.

In [None]:
prompt = "Who lived longer, Mozart or Elvis?"

complete_and_print(prompt)
# Llama 2 would often give the incorrect answer of "Mozart"

complete_and_print(f"{prompt} Let's think through this carefully, step by step.")
# Gives the correct answer "Elvis"

Who lived longer, Mozart or Elvis?
Wolfgang Amadeus Mozart lived from January 27, 1756, to December 5, 1791. He died at the age of 35.

Elvis Presley lived from January 8, 1935, to August 16, 1977. He died at the age of 42.

So, Elvis Presley lived longer than Mozart by about 7 years.

Who lived longer, Mozart or Elvis? Let's think through this carefully, step by step.
To determine who lived longer, Mozart or Elvis, let's break down the information step by step.

1. **Mozart's lifespan**: Wolfgang Amadeus Mozart was born on January 27, 1756, and he died on December 5, 1791. To find his age at death, we subtract his birth year from his death year: 1791 - 1756 = 35. Therefore, Mozart lived to be 35 years old.

2. **Elvis's lifespan**: Elvis Presley was born on January 8, 1935, and he died on August 16, 1977. To find his age at death, we subtract his birth year from his death year: 1977 - 1935 = 42. Therefore, Elvis lived to be 42 years old.

3. **Comparison**: Now that we have the ages a

### Self-Consistency

LLMs are probabilistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):

In [None]:
import re
from statistics import mode

def gen_answer():
    response = completion(
        "John found that the average of 15 numbers is 40."
        "If 10 is added to each number then the mean of the numbers is?"
        "Report the answer surrounded by backticks (example: `123`)",
    )
    match = re.search(r'`(\d+)`', response)
    if match is None:
        return None
    return match.group(1)

answers = [gen_answer() for i in range(5)]  # The same question is asked to the AI model five times, collecting various answers.

print(
    f"Answers: {answers}\n",
    f"Final answer: {mode(answers)}",
    )

# Sample runs of Llama-3-70B (all correct):
# ['60', '50', '50', '50', '50'] -> 50
# ['50', '50', '50', '60', '50'] -> 50
# ['50', '50', '60', '50', '50'] -> 50

Answers: ['50', '50', '50', '50', '50']
 Final answer: 50


### Retrieval-Augmented Generation

You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):

In [None]:
complete_and_print("What is the capital of the California?")
# Gives the correct answer "Sacramento"

What is the capital of the California?
The capital of California is Sacramento.



However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:

In [None]:
complete_and_print("What was the temperature in Menlo Park on December 12th, 2023?")
# "I'm just an AI, I don't have access to real-time weather data or historical weather records."

complete_and_print("What time is my dinner reservation on Saturday and what should I wear?")
# "I'm not able to access your personal information [..] I can provide some general guidance"

What was the temperature in Menlo Park on December 12th, 2023?
I'm not able to provide real-time or historical weather data for specific locations. However, I can suggest some options to help you find the temperature in Menlo Park on December 12th, 2023:

1. Check online weather websites: You can visit websites like AccuWeather, Weather.com, or the National Weather Service (NWS) to see if they have archived weather data for Menlo Park on December 12th, 2023.
2. Contact local weather stations: Reach out to local weather stations or news outlets in the Menlo Park area to see if they have records of the temperature on December 12th, 2023.
3. Use weather apps: Download weather apps like Dark Sky or Weather Underground, which may have historical weather data for specific locations, including Menlo Park.

Please note that my knowledge cutoff is December 2023, but I don't have real-time access to current or historical weather data.

What time is my dinner reservation on Saturday and what shou

Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrieved from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.

This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:

In [None]:
MENLO_PARK_TEMPS = {
    "2023-12-11": "52 degrees Fahrenheit",
    "2023-12-12": "51 degrees Fahrenheit",
    "2023-12-13": "51 degrees Fahrenheit",
}


def prompt_with_rag(retrived_info, question):
    complete_and_print(
        f"Given the following information: '{retrived_info}', respond to: '{question}'"
    )


def ask_for_temperature(day):
    temp_on_day = MENLO_PARK_TEMPS.get(day) or "unknown temperature"
    prompt_with_rag(
        f"The temperature in Menlo Park was {temp_on_day} on {day}'",  # Retrieved fact
        f"What is the temperature in Menlo Park on {day}?",  # User question
    )


ask_for_temperature("2023-12-12")
# "Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit."

ask_for_temperature("2023-07-18")
# "I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown."

Given the following information: 'The temperature in Menlo Park was 51 degrees Fahrenheit on 2023-12-12'', respond to: 'What is the temperature in Menlo Park on 2023-12-12?'
The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit.

Given the following information: 'The temperature in Menlo Park was unknown temperature on 2023-07-18'', respond to: 'What is the temperature in Menlo Park on 2023-07-18?'
The temperature in Menlo Park on 2023-07-18 is unknown.



### Program-Aided Language Models

LLMs, by nature, aren't great at performing calculations. Let's try:

$$
((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
$$

(The correct answer is 91383.)

In [None]:
complete_and_print("""
Calculate the answer to the following math problem:

((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
""")
# Gives incorrect answers like 92448, 92648, 95463


Calculate the answer to the following math problem:

((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))

To calculate the answer, we need to follow the order of operations (PEMDAS):

1. Calculate the exponent: 4^4 = 256
2. Multiply 93 and 4: 93 * 4 = 372
3. Add and subtract from left to right:
   -5 + 372 - 0 = 367
4. Add and subtract from left to right inside the second parentheses:
   256 + (-7) + 0 * 5 = 256 - 7 + 0 = 249
5. Multiply the results of steps 3 and 4:
   367 * 249 = 91473

The answer is 91473.



[Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of "Program-aided Language Models" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks.

In [None]:
complete_and_print(
    """
    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    """,
)


    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    
### Calculating the Expression

The given expression is `((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))`. We can calculate this using Python.

```python
# Calculate the first part of the expression
first_part = -5 + 93 * 4 - 0

# Calculate the second part of the expression
second_part = 4 ** 4 + -7 + 0 * 5

# Calculate the final result
result = first_part * second_part

print("The result of the expression is:", result)
```

### Explanation

1. **First Part Calculation**: 
   - Multiply 93 by 4: `93 * 4 = 372`
   - Add -5 to the result: `-5 + 372 = 367`
   - Subtract 0 from the result: `367 - 0 = 367`

2. **Second Part Calculation**:
   - Calculate 4 to the power of 4: `4 ** 4 = 256`
   - Add -7 to the result: `256 + -7 = 249`
   - Multiply 0 by 5: `0 * 5 = 0`
   - Add the result to the previous result: `249 + 0 = 249`

3. **Final Result**:
   - Multiply the first part by the second part: `367 * 249 = 91413`

### A

In [None]:
# The following code was generated by Llama 3 70B:

result = ((-5 + 93 * 4 - 0) * (4 ** 4 + -7 + 0 * 5))
print("The result of the expression is:", result)

The result of the expression is: 91383


### Limiting Extraneous Tokens

A common struggle with Llama 2 is getting output without extraneous tokens (ex. "Sure! Here's more information on..."), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3.x can better follow instructions.

Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:

In [None]:
complete_and_print(
    "Give me the zip code for Menlo Park in JSON format with the field 'zip_code'",
)
# Likely returns the JSON and also "Sure! Here's the JSON..."

complete_and_print(
    """
    You are a robot that only outputs JSON.
    You reply in JSON format with the field 'zip_code'.
    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}
    Now here is my question: What is the zip code of Menlo Park?
    """,
)
# "{'zip_code': 94025}"

Give me the zip code for Menlo Park in JSON format with the field 'zip_code'
```json
{
  "zip_code": "94025"
}
```


    You are a robot that only outputs JSON.
    You reply in JSON format with the field 'zip_code'.
    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}
    Now here is my question: What is the zip code of Menlo Park?
    
{"zip_code": 94025}



## Additional References
- [PromptingGuide.ai](https://www.promptingguide.ai/)
- [LearnPrompting.org](https://learnprompting.org/)
- [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)