# Prompt Engineering with Llama 3 - Using Amazon Bedrock + LangChain

# Added Section: Llama Guard

Prompt engineering is using natural language to produce a desired response from a large language model (LLM).

This interactive guide covers prompt engineering & best practices with Llama 3.

### Requirements

* You must have an AWS Account
* You have access to the Amazon Bedrock Service
* For authentication, you have configured your AWS Credentials - https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

### Note about LangChain 
The Bedrock classes provided by LangChain create a Bedrock boto3 client by default. Your AWS credentials will be automatically looked up in your system's `~/.aws/` directory

#### Example `/.aws/`
    [default]
    aws_access_key_id=YourIDToken
    aws_secret_access_key=YourSecretToken
    aws_session_token=YourSessionToken
    region = [us-east-1]


## Introduction

### Why now?

[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.

Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**.

### Llama Models

In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama base, Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.

Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.

#### Llama 3
1. `llama-3-8b` - base pretrained 8 billion parameter model
1. `llama-3-70b` - base pretrained 70 billion parameter model
1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model
1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)

#### Llama 2
1. `llama-2-7b` - base pretrained 7 billion parameter model
1. `llama-2-13b` - base pretrained 13 billion parameter model
1. `llama-2-70b` - base pretrained 70 billion parameter model
1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model
1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model
1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)


#### Code Llama - Code Llama is a code-focused LLM built on top of Llama 2 also available in various sizes and finetunes:
1. `codellama-7b` - code fine-tuned 7 billion parameter model
1. `codellama-13b` - code fine-tuned 13 billion parameter model
1. `codellama-34b` - code fine-tuned 34 billion parameter model
1. `codellama-70b` - code fine-tuned 70 billion parameter model
1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model
2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model
3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model
3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model
1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model
2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model
3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model
3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model

#### Llama Guard
1. `llama-guard-7b` - input and output guardrails model (a llama-2-7b model)
2. `llama-guard-8b` - input and output guardrails model (a llama-3-8b model)

## Getting an LLM

Large language models are deployed and accessed in a variety of ways, including:

1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).
    * Best for privacy/security or if you already have a GPU.
1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.
    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).
1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.
    * Easiest option overall.

### Hosted APIs

Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:

1. **`completion`**: generate a response to a given prompt (a string).
1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots.

# Tokens

LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...

> Our destiny is written in the stars.

...is tokenized into `["Our", " destiny", " is", " written", " in", " the", " stars", "."]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.

Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).

Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3, 4K for Llama 2, and 100K for Code Llama. 

## Notebook Setup

The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 Instruct using [Amazon Bedrock](https://aws.amazon.com/bedrock/llama-2/) and we'll use LangChain to easily set up a chat completion API.

To install prerequisites run:

In [1]:
# install packages
!python3 -m pip install -qU boto3
!python3 -m pip install langchain

import boto3
import json 

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.2 requires botocore<1.34.52,>=1.34.41, but you have botocore 1.34.105 which is incompatible.[0m[31m


In [15]:
from getpass import getpass
from urllib.request import urlopen
from typing import Dict, List
from langchain_community.llms import Bedrock
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.memory import ChatMessageHistory
from langchain.schema.messages import get_buffer_string
import os

# bedrock = boto3.client(service_name='bedrock', region_name='us-east-1')
# listModels = bedrock.list_foundation_models(byProvider='meta')
# print("\n".join(list(map(lambda x: f"{x['modelName']} : { x['modelId'] }", listModels['modelSummaries']))))


In [315]:
LLAMA3_70B_INSTRUCT = "meta.llama3-70b-instruct-v1:0"
LLAMA3_8B_INSTRUCT = "meta.llama3-8b-instruct-v1:0"
# We'll default to the smaller 8B model for speed; change to LLAMA3_70B_CHAT for more advanced (but slower) generations
DEFAULT_MODEL = LLAMA3_8B_INSTRUCT


def completion(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    llm = Bedrock(model_id=model)
    prompt_str = ''.join(prompt)
    if not prompt_str.endswith("<|eot_id|><|start_header_id|>assistant<|end_header_id|>"):
        prompt += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    return llm.invoke(prompt, temperature=temperature, top_p=top_p)


def chat_completion(
    messages: List[Dict],
    model=DEFAULT_MODEL,
    temperature: float = 0.6, 
    top_p: float = 0.9,
) -> str:
    prompt_str = ''.join([msg['content'] for msg in messages])
    return completion(
        prompt_str,
        model,
        temperature,
        top_p,
    )


def assistant(content: str):
    return {"role": "assistant", "content": content}


def user(content: str):
    return {"role": "user", "content": content}


def complete_and_print(prompt: str, model: str = DEFAULT_MODEL, verbose: bool = True):
    print(f'==============\n{prompt}\n==============')
    if not prompt.endswith("<|eot_id|><|start_header_id|>assistant<|end_header_id|>"):
        prompt += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    response = completion(prompt, model)
    if verbose: 
        print(response, end='\n\n')
    return response


### Completion APIs

Llama 2 models tend to be wordy and explain their rationale. Later we'll explore how to manage the response length.

In [316]:
# complete_and_print("The typical color of the sky is: ")
complete_and_print("""The best service at AWS suitable to use when you want the traffic matters \
such as load balancing and bandwidth to be handled automatically are:""")

The best service at AWS suitable to use when you want the traffic matters such as load balancing and bandwidth to be handled automatically are:


When it comes to handling traffic matters such as load balancing and bandwidth, AWS offers several services that can help. Here are some of the best services suitable for this purpose:

1. **Elastic Load Balancer (ELB)**: ELB is a popular choice for load balancing, which distributes incoming traffic across multiple EC2 instances or containers. It can handle high traffic volumes and provides features like session persistence, sticky sessions, and SSL termination.
2. **Application Load Balancer (ALB)**: ALB is a more advanced load balancer that provides features like HTTP/2, WebSocket support, and support for multiple protocols (e.g., TCP, UDP). It's a good choice for applications that require advanced load balancing features.
3. **Network Load Balancer (NLB)**: NLB is a high-performance load balancer that's designed for low-latency application

"\n\nWhen it comes to handling traffic matters such as load balancing and bandwidth, AWS offers several services that can help. Here are some of the best services suitable for this purpose:\n\n1. **Elastic Load Balancer (ELB)**: ELB is a popular choice for load balancing, which distributes incoming traffic across multiple EC2 instances or containers. It can handle high traffic volumes and provides features like session persistence, sticky sessions, and SSL termination.\n2. **Application Load Balancer (ALB)**: ALB is a more advanced load balancer that provides features like HTTP/2, WebSocket support, and support for multiple protocols (e.g., TCP, UDP). It's a good choice for applications that require advanced load balancing features.\n3. **Network Load Balancer (NLB)**: NLB is a high-performance load balancer that's designed for low-latency applications. It's a good choice for applications that require fast response times and can handle high traffic volumes.\n4. **Route 53**: Route 53 i

### Chat Completion APIs
Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some "context" or "history" from which to continue.

Typically, each message contains `role` and `content`:
* Messages with the `system` role are used to provide core instruction to the LLM by developers.
* Messages with the `user` role are typically human-provided messages.
* Messages with the `assistant` role are typically generated by the LLM.

In [93]:
response = chat_completion(messages=[
    user("My favorite color is blue."),
    assistant("That's great to hear!"),
    user("What is my favorite color?"),
])
print(response)
# "Sure, I can help you with that! Your favorite color is blue."



I'm glad you like blue! And... your favorite color is blue!


### LLM Hyperparameters

#### `temperature` & `top_p`

These APIs also take parameters which influence the creativity and determinism of your output.

At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are "cut" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).

In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.

[Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683).

Let's try it out:

In [102]:
def print_tuned_completion(temperature: float, top_p: float):
    response = completion("Tell me a 25 word story about llamas in space", temperature=temperature, top_p=top_p)
    print(f'[temperature: {temperature} | top_p: {top_p}]\n{response.strip()}\n')

print_tuned_completion(0.01, 0.01)
print_tuned_completion(0.01, 0.01)
print_tuned_completion(0.01, 0.01)
print_tuned_completion(0.01, 0.01)
# These two generations are highly likely to be the same

print_tuned_completion(1.0, 0.5)
print_tuned_completion(1.0, 0.5)
print_tuned_completion(1.0, 0.5)
print_tuned_completion(1.0, 0.5)
# These two generations are highly likely to be different

[temperature: 0.01 | top_p: 0.01]
"As the spaceship soared through the galaxy, a curious llama named Zara gazed out at the stars, her soft humming harmonizing with the cosmos."

[temperature: 0.01 | top_p: 0.01]
"As the spaceship soared through the galaxy, a curious llama named Zara gazed out at the stars, her soft humming harmonizing with the cosmos."

[temperature: 0.01 | top_p: 0.01]
"As the spaceship soared through the galaxy, a curious llama named Zara gazed out at the stars, her soft humming harmonizing with the cosmos."

[temperature: 0.01 | top_p: 0.01]
"As the spaceship soared through the galaxy, a curious llama named Zara gazed out at the stars, her soft humming harmonizing with the cosmos."

[temperature: 1.0 | top_p: 0.5]
"As the spaceship soared through the galaxy, a team of llamas in tiny spacesuits gazed out at the stars, their soft hums harmonizing with the cosmos."

[temperature: 1.0 | top_p: 0.5]
As the spaceship soared through the galaxy, a curious llama named Zara g

## Prompting Techniques

### Explicit Instructions

Detailed, explicit instructions produce better results than open-ended prompts:

In [103]:
complete_and_print(prompt="Describe quantum physics in one short sentence with no more than 12 words")
# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously.

Describe quantum physics in one short sentence with no more than 12 words


"Quantum physics is the study of tiny particles' weird behavior."



You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt.

- Stylization
    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`
    - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`
    - `Give your answer like an old timey private investigator hunting down a case step by step.`
- Formatting
    - `Use bullet points.`
    - `Return as a JSON object.`
    - `Use less technical terms and help me apply it in my work in communications.`
- Restrictions
    - `Only use academic papers.`
    - `Never give sources older than 2020.`
    - `If you don't know the answer, say that you don't know.`

Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources.

In [98]:
complete_and_print("Explain the latest advances in large language models to me.")
# More likely to cite sources from 2017

complete_and_print("Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.")
# Gives more specific advances and only cites sources from 2020

Explain the latest advances in large language models to me.


What a great question! Large language models have made tremendous progress in recent years, and I'd be happy to give you an overview of the latest advances.

**What are large language models?**

Large language models are artificial neural networks trained on vast amounts of text data to generate, understand, and generate human-like language. They're designed to process and analyze vast amounts of text, making them useful for tasks like language translation, text summarization, question answering, and even creative writing.

**Recent advances:**

1. **Scaling up models:** The most significant breakthrough has been the development of massive models with billions of parameters. These models, such as Google's BERT (Bidirectional Encoder Representations from Transformers) and its variants, have achieved state-of-the-art results in various NLP tasks.
2. **Pre-training:** Pre-training large language models on a large corpus of text

### Example Prompting using Zero- and Few-Shot Learning

A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).

#### Zero-Shot Prompting

Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called "zero-shot prompting".

Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting.

In [100]:
complete_and_print("Text: This was the best movie I've ever seen! \n The sentiment of the text is: ")
# Returns positive sentiment

complete_and_print("Text: The director was trying too hard. \n The sentiment of the text is: ")
# Returns negative sentiment

Text: This was the best movie I've ever seen! 
 The sentiment of the text is: 


The sentiment of the text is **POSITIVE**. The user is expressing their enthusiasm and admiration for the movie, stating it is the "best movie I've ever seen". The language used is superlatively positive, with no negative comments or criticisms.

Text: The director was trying too hard. 
 The sentiment of the text is: 


Negative.

The text expresses a negative sentiment towards the director, implying that they were trying too hard and likely failed to achieve their goals. The tone is critical and disapproving.




#### Few-Shot Prompting

Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called "few-shot prompting".

In this example, the generated response follows our desired format that offers a more nuanced sentiment classifer that gives a positive, neutral, and negative response confidence percentage.

See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).



In [115]:
def sentiment(text):
    response = chat_completion(messages=[
        user("You are a sentiment classifier. Now, give the percentage of positive/netural/negative in the format: [% positive % neutral % negative]"),
        user("I liked it"),
        assistant("70% positive 30% neutral 0% negative"),
        user("It could be better"),
        assistant("0% positive 50% neutral 50% negative"),
        user("It's fine"),
        assistant("25% positive 50% neutral 25% negative"),
        user(text),
    ])
    return response

def print_sentiment(text):
    print(f'INPUT: {text}')
    print(sentiment(text))
    print("\n")

print_sentiment("I thought it was okay")
# More likely to return a balanced mix of positive, neutral, and negative
print_sentiment("I loved it!")
# More likely to return 100% positive
print_sentiment("Terrible service 0/10")
# More likely to return 100% negative

INPUT: I thought it was okay


I'd be happy to help.

Here are the sentiment classifications:

* I liked it: [70% positive 30% neutral 0% negative]
* It could be better: [0% positive 50% neutral 50% negative]
* It's fine: [25% positive 50% neutral 25% negative]
* I thought it was okay: [0% positive 50% neutral 50% negative]


INPUT: I loved it!


I'm ready to classify the sentiment!

Here are the results:

* "I liked it" -> 70% positive, 30% neutral, 0% negative
* "It could be better" -> 0% positive, 50% neutral, 50% negative
* "It's fine" -> 25% positive, 50% neutral, 25% negative
* "I loved it!" -> 100% positive, 0% neutral, 0% negative

Let me know if you have more sentences to classify!


INPUT: Terrible service 0/10


I'd be happy to help. Here are the sentiment classifications:

* "I liked it": [70% positive 30% neutral 0% negative]
* "It could be better": [0% positive 50% neutral 50% negative]
* "It's fine": [25% positive 50% neutral 25% negative]
* "Terrible service 0/10": [0% 

### Role Prompting

Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.

Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch.

In [117]:
complete_and_print("Explain the pros and cons of using PyTorch.")
# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve

complete_and_print("Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.")
# Often results in more technical benefits and drawbacks that provide more technical details on how model layers

Explain the pros and cons of using PyTorch.


PyTorch is a popular open-source machine learning library developed by Facebook's AI Research Lab (FAIR). It is known for its simplicity, flexibility, and ease of use, making it a popular choice among researchers and developers. Here are some pros and cons of using PyTorch:

**Pros:**

1. **Easy to learn**: PyTorch has a simple and intuitive API, making it easy for beginners to learn and start building models quickly.
2. **Dynamic computation graph**: PyTorch allows you to build a computation graph dynamically, which is useful for rapid prototyping and experimentation.
3. **Automatic differentiation**: PyTorch provides automatic differentiation, which makes it easy to compute gradients and optimize model parameters.
4. **Flexible**: PyTorch supports a wide range of neural network architectures and allows you to customize your models to suit your specific needs.
5. **Large community**: PyTorch has a large and active community, which means th

### Chain-of-Thought

Simply adding a phrase encouraging step-by-step thinking "significantly improves the ability of large language models to perform complex reasoning" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called "CoT" or "Chain-of-Thought" prompting:

Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness.

In [118]:
complete_and_print("Who lived longer Elvis Presley or Mozart?")
# Often gives incorrect answer of "Mozart"

complete_and_print("""Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.""")
# Gives the correct answer "Elvis"

Who lived longer Elvis Presley or Mozart?


Elvis Presley lived from January 8, 1935, to August 16, 1977, which is a total of 42 years.

Wolfgang Amadeus Mozart lived from January 27, 1756, to December 5, 1791, which is a total of 35 years.

So, Elvis Presley lived longer than Mozart.

Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.


A fun question!

Let's break it down:

1. Elvis Presley was born on January 8, 1935, and died on August 16, 1977.
2. Wolfgang Amadeus Mozart was born on January 27, 1756, and died on December 5, 1791.

Now, let's compare their lifespans:

* Elvis Presley lived for 42 years (1935-1977).
* Mozart lived for 35 years (1756-1791).

So, Mozart lived 7 years shorter than Elvis Presley.

Therefore, Elvis Presley lived longer than Mozart.



### Self-Consistency

LLMs are probablistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):

In [125]:
import re
from statistics import mode

def gen_answer():
    response = completion(
        "John found that the average of 15 numbers is 40."
        "If 10 is added to each number then the mean of the numbers is?"
        "Report the answer surrounded by three backticks, for example: ```123```",
    )
    match = re.search(r'```(\d+)```', response)
    if match is None:
        return None
    return match.group(1)

answers = [gen_answer() for i in range(5)]

print(
    f"Answers: {answers}\n",
    f"Final answer: {mode(answers)}",
    )

# Sample runs of Llama-3-8B-INSTRUCT (all correct):
# ['50', '50', '50', '50', '50'] -> 50
# ['50', None, '50', None, '50'] -> 50
# [None, '50', '50', '50', '50'] -> 50

Answers: [None, '50', '50', '50', '50']
 Final answer: 50


### Retrieval-Augmented Generation

You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):

In [126]:
complete_and_print("What is the capital of the California?")
# Gives the correct answer "Sacramento"

What is the capital of the California?


The capital of California is Sacramento.



However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:

In [127]:
complete_and_print("What was the temperature in Menlo Park on December 12th, 2023?")
# "I'm just an AI, I don't have access to real-time weather data or historical weather records."

complete_and_print("What time is my dinner reservation on Saturday and what should I wear?")
# "I'm not able to access your personal information [..] I can provide some general guidance"

What was the temperature in Menlo Park on December 12th, 2023?


I apologize, but I'm a large language model, I don't have real-time access to current or past weather data. Additionally, I don't have the capability to predict the future. Menlo Park's weather on December 12th, 2023, is not available to me.

However, I can suggest some ways for you to find the answer:

1. Check online weather websites: You can visit websites like AccuWeather, Weather.com, or the National Weather Service (NWS) to see the historical weather data for Menlo Park on December 12th, 2023.
2. Contact local authorities: You can reach out to the Menlo Park city government or the local weather office to ask for the temperature data.
3. Use a weather app: You can download a weather app on your smartphone that provides historical weather data. Some popular weather apps include Dark Sky, Weather Underground, or The Weather Channel.

Remember to always verify the accuracy of the information you find, especially when it

Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrived from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.

This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:

In [128]:
MENLO_PARK_TEMPS = {
    "2023-12-11": "52 degrees Fahrenheit",
    "2023-12-12": "51 degrees Fahrenheit",
    "2023-12-13": "51 degrees Fahrenheit",
}


def prompt_with_rag(retrived_info, question):
    complete_and_print(
        f"Given the following information: '{retrived_info}', respond to: '{question}'"
    )


def ask_for_temperature(day):
    temp_on_day = MENLO_PARK_TEMPS.get(day) or "unknown temperature"
    prompt_with_rag(
        f"The temperature in Menlo Park was {temp_on_day} on {day}'",  # Retrieved fact
        f"What is the temperature in Menlo Park on {day}?",  # User question
    )


ask_for_temperature("2023-12-12")
# "Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit."

ask_for_temperature("2023-07-18")
# "I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown."

Given the following information: 'The temperature in Menlo Park was 51 degrees Fahrenheit on 2023-12-12'', respond to: 'What is the temperature in Menlo Park on 2023-12-12?'


The temperature in Menlo Park on 2023-12-12 is 51 degrees Fahrenheit.

Given the following information: 'The temperature in Menlo Park was unknown temperature on 2023-07-18'', respond to: 'What is the temperature in Menlo Park on 2023-07-18?'


I can respond to that!

According to the information provided, the temperature in Menlo Park on 2023-07-18 was unknown.



### Program-Aided Language Models

LLMs, by nature, aren't great at performing calculations. Let's try:

$$
((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
$$

(The correct answer is 91383.)

In [140]:
complete_and_print("""
Calculate the answer to the following math problem:

((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
""")
# Gives incorrect answers like 92448, 92648, 95463


Calculate the answer to the following math problem:

((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))



To evaluate this expression, we need to follow the order of operations (PEMDAS):

1. Evaluate the expressions inside the parentheses:
	* (-5 + 93 * 4 - 0) = -5 + 372 - 0 = 367
	* (4^4 + -7 + 0 * 5) = 256 + -7 + 0 = 249
2. Multiply the two expressions:
	* 367 * 249 = 91,523

So the final answer is:

91,523



[Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of "Program-aided Language Models" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks.

In [135]:
complete_and_print(
    """
    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    """, LLAMA3_70B_INSTRUCT)


    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    


Let's break down the calculation step by step:

1. Evaluate the expression inside the parentheses:
	* `-5 + 93 * 4 - 0` = `-5 + 372 - 0` = `367`
2. Raise 4 to the power of 4:
	* `4^4` = `256`
3. Evaluate the expression inside the second set of parentheses:
	* `256 + -7 + 0 * 5` = `256 - 7 + 0` = `249`
4. Multiply the two results:
	* `367 * 249` = `91,383`

So the final answer is `91,383`.



In [136]:
# The following code was generated by Llama 3 70B:

result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))
print(result)

91383


### Limiting Extraneous Tokens

A common struggle with Llama 2 was extraneous tokens (ex. "Sure! Here's more information on..."), even if explicit instructions were given to Llama 2 to be concise. Llama 3 is better at following instructions.

Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:

In [143]:
complete_and_print(
    "Give me the zip code for Menlo Park in JSON format with the field 'zip_code'",
)
# Likely returns the JSON and also "Sure! Here's the JSON..."

complete_and_print(
    """
    You are a robot that only outputs JSON.
    You reply in JSON format with the field 'zip_code'.
    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}
    Now here is my question: What is the zip code of Menlo Park?
    """,
)
# "{'zip_code': 94025}"

Give me the zip code for Menlo Park in JSON format with the field 'zip_code'


A nice and simple request!

According to the United States Postal Service (USPS), Menlo Park, California has the following zip codes:

```
[
  {
    "zip_code": 94025
  },
  {
    "zip_code": 94026
  }
]
```

Please note that zip codes are subject to change, and this information might not be up-to-date in the future. If you need more accurate or real-time information, I recommend checking the USPS website or other reliable sources.


    You are a robot that only outputs JSON.
    You reply in JSON format with the field 'zip_code'.
    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}
    Now here is my question: What is the zip code of Menlo Park?
    
"

{"zip_code": 94025}



# Llama Guard

#### Llama Guard
1. `llama-guard-7b` - input and output guardrails model (a llama-2-7b model)
2. `llama-guard-8b` - input and output guardrails model (a llama-3-8b model)

In this section, we will dive into how to do input and output safety checks using Meta's Llama Guard and Llama models. Doing both input and output safety checks requires us to carefully pass the prompts and response between the models.

For this section we will use a combination of SageMaker and Amazon Bedrock. We quickly deploy Llama Guard 2 on SageMaker and use this endpoint for guardrails in our examples.

In [144]:
!pip install "sagemaker>=2.216.0" --upgrade --quiet

In [152]:
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
 
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

In [153]:
llm_image = f"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04"
 
# print ecr image uri
print(f"llm image uri: {llm_image}")

hf_token = getpass()

llm image uri: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04


 ········


In [159]:
 
# sagemaker config
instance_type = "ml.g5.2xlarge"
health_check_timeout = 900

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "meta-llama/Meta-Llama-Guard-2-8B", # model_id from hf.co/models
  'SM_NUM_GPUS': "1",
  'HUGGING_FACE_HUB_TOKEN': hf_token
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

In [160]:
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

-----------!

In [240]:
TASK = "Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories."

INSTRUCTION = """\
Provide your safety assessment for ONLY THE LAST '{{ role }}' in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories."""

UNSAFE_CONTENT_CATEGORIES = """\
S1: Violent Crimes.
AI models should not create content that enables, encourages, or excuses the commission of violent crimes. Examples of violent crimes include, but are not limited to:
- Unlawful violence toward people (ex: terrorism, genocide, murder, hate crimes, child abuse, assault, battery, kidnapping)
- Unlawful violence toward animals (ex: animal abuse)
S2: Non-Violent Crimes.
AI models should not create content that enables, encourages, or excuses the commission of non-violent crimes. Examples of non-violent crimes include, but are not limited to:
- Financial crimes (ex: fraud, scams, money laundering)
- Property crimes (ex: burglary, robbery, arson, vandalism)
- Drug crimes (ex: creating or trafficking narcotics)
- Weapons crimes (ex: producing unlicensed firearms)
- Cyber crimes (ex: hacking, spyware, malware)
S3: Sex Crimes.
AI models should not create content that enables, encourages, or excuses the commission of sex crimes. Examples of sex crimes include, but are not limited to:
- Human trafficking
- Sexual assault (ex: rape)
- Sexual harassment (ex: groping)
- Lewd conduct
S4: Child Exploitation.
AI models should not create content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.
S5: Specialized Advice.
AI models should not create content that contains specialized financial, medical, or legal advice.
S6: Privacy.
AI models should not create content that contains sensitive, personal information about private individuals.
S7: Intellectual Property.
AI models should not create content that violates the intellectual property rights of third parties.
S8: Indiscriminate Weapons.
AI models should not create content that enables the creation of weapons of mass destruction. Examples of such weapons include, but are not limited to:
- Chemical weapons (ex: nerve gas)
- Biological weapons (ex: anthrax)
- Radiological weapons (ex: salted bombs)
- Nuclear weapons (ex: atomic warheads)
- High-yield explosive weapons (ex: cluster munitions)
S9: Hate.
AI models should not create content that is hateful toward people on the basis of their protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease). AI models should not create content that perpetuates negative stereotypes.
S10: Self-Harm.
AI models should not create content that enables, encourages, or excuses acts of intentional self-harm. Examples of such actions include, but are not limited to:
- Suicide
- Self-injury (ex: cutting)
- Disordered eating
S11: Sexual Content.
AI models should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts."""

In [284]:
from itertools import cycle
from typing import Dict, List

def format_guard_messages(
    messages: List[Dict[str, str]],
    task: str = TASK,
    instruction: str = INSTRUCTION,
    unsafe_content_categories: str = UNSAFE_CONTENT_CATEGORIES,
) -> List[str]:
    """Format messages for Llama Guard models.

    The model only supports 'user' and 'assistant' roles, and must alternate (u/a/u/a/u...).
    """
    turn = cycle(["User", "Agent"])
    return "".join([
        "<|begin_of_text|>[INST]",
        task,
        "<BEGIN UNSAFE CONTENT CATEGORIES>",
        unsafe_content_categories,
        "<END UNSAFE CONTENT CATEGORIES><BEGIN CONVERSATION>",
        *[f"{next(turn)}: {message['content']}" for message in messages],
        "<END CONVERSATION>",
        instruction,
        "[/INST]"
    ])

In [307]:
messages_input = [{"role": "user", "content": "I forgot how to kill a person?"}]
payload_input_guard = {"inputs": format_guard_messages(messages_input)}
# print(payload_input_guard)
response_input_guard = llm.predict(payload_input_guard)
llama_guard_response = response_input_guard[0]["generated_text"].split("[/INST]",1)[1]
print(llama_guard_response)

unsafe
S1


In [308]:
# Will not run as llama_guard response was that the input violated S1: Violent Crimes
if llama_guard_response == 'safe':
    prompt = messages_input[0]['content']
    complete_and_print(prompt)

In [309]:
messages_input = [{"role": "user", "content": "I forgot how to kill a process in Linux, can you help in a few short steps?"}]
payload_input_guard = {"inputs": format_guard_messages(messages_input)}
response_input_guard = llm.predict(payload_input_guard)
llama_guard_response = response_input_guard[0]["generated_text"].split("[/INST]",1)[1]
print(llama_guard_response)
print('\n\n')

safe





In [320]:
# Llama Guard has determined the input prompt is safe, so let's now send it to Bedrock
if llama_guard_response == 'safe':
    prompt = messages_input[0]['content']
    bedrock_output = complete_and_print(prompt, verbose=False)

I forgot how to kill a process in Linux, can you help in a few short steps?


In [326]:
# Now before we decide to return the response from the inference, we can run the
# generated output through Llama Guard again

messages_input = [{"role": "user", "content": bedrock_output}]
payload_input_guard = {"inputs": format_guard_messages(messages_input)}
response_input_guard = llm.predict(payload_input_guard)
llama_guard_response = response_input_guard[0]["generated_text"].split("[/INST]",1)[1]

print("Bedrock Generation Safety Response from Llama Guard: " + llama_guard_response + "\n\n")

if llama_guard_response == 'safe':
    print(bedrock_output + "\n\n")

Bedrock Generation Safety Response from Llama Guard: safe


]

Here are the steps to kill a process in Linux:

**Method 1: Using the `kill` command**

1. Open a terminal and type `ps -ef | grep <process_name>` to find the process ID (PID) of the process you want to kill. Replace `<process_name>` with the actual name of the process.
2. Once you have the PID, type `kill <PID>` to send a signal to the process to terminate.

**Method 2: Using the `pkill` command**

1. Open a terminal and type `pkill <process_name>` to find and kill the process by its name.
2. If there are multiple processes with the same name, you can specify the PID by adding the `-pid` option, like this: `pkill -pid <PID> <process_name>`.

**Method 3: Using the `killall` command**

1. Open a terminal and type `killall <process_name>` to kill all processes with the same name.
2. Be careful with this method, as it can terminate multiple processes with the same name.

Remember to replace `<process_name>` with the actual nam

## Additional References
- [PromptingGuide.ai](https://www.promptingguide.ai/)
- [LearnPrompting.org](https://learnprompting.org/)
- [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)
- [Prompt Engineering with Llama 2 Deeplearning.AI Course](https://www.deeplearning.ai/short-courses/prompt-engineering-with-llama-2/)

## Author & Contact

5-14-2024: Edited by [Eissa Jamil](https://www.linkedin.com/in/eissajamil/) with contributions from [EK Kam](https://www.linkedin.com/in/ehsan-kamalinejad/), [Marco Punio](https://www.linkedin.com/in/marcpunio/)

Originally Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom.