[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb)

# RAG with LLaMa 13B

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.
At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours). If you need guidance on getting access please refer to the beginning of this [article](https://www.pinecone.io/learn/llama-2/) or [video](https://youtu.be/6iHVJyX2e50?t=175).

---

🚨 _Note that running this on CPU is sloooow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m79.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m8.6 M

## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

**"Vector embeddings**" are numerical representations of text.
By initializing an embedding pipeline, you are preparing the necessary tools and configurations to take raw textual documents and convert them into vector embeddings. This process is crucial in many NLP applications, such as text classification, sentiment analysis, document retrieval, and more, where machine learning models can operate on the transformed numerical data to perform specific tasks

In [None]:
# Import the `cuda` module from the `torch` library for GPU management
from torch import cuda

# Import the `HuggingFaceEmbeddings` class from a custom package or module
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Define the Hugging Face model identifier for embeddings
embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

# Check if a GPU is available, and set the `device` accordingly
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Initialize the `embed_model` using the `HuggingFaceEmbeddings` class
embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,  # Specify the Hugging Face model to use
    model_kwargs={'device': device},  # Set the device for model execution
    encode_kwargs={'device': device, 'batch_size': 32}  # Additional encoding parameters
)


(…)f3d3c277d6e90027e55de9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

(…)7d6e90027e55de9125/1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(…)e2f80f3d3c277d6e90027e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

(…)f80f3d3c277d6e90027e55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

(…)de9125/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

(…)d3c277d6e90027e55de9125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

(…)90027e55de9125/sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

(…)6e90027e55de9125/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

(…)f3d3c277d6e90027e55de9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

(…)7d6e90027e55de9125/tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

(…)3d3c277d6e90027e55de9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

(…)e2f80f3d3c277d6e90027e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)80f3d3c277d6e90027e55de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:

In [None]:
docs = [
    "this is one document",
    "and another document"
]

# Using the `embed_model` to convert the text documents into embeddings
embeddings = embed_model.embed_documents(docs)

# Printing information about the generated embeddings
print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

A **Pinecone vector** index is a specialized data structure and service provided by Pinecone, a cloud-based service designed for scalable and efficient similarity search. Pinecone is particularly useful for tasks where you need to search and retrieve items that are most similar to a given query based on vector embeddings.

In [None]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or 'bf90f04b-0cd2-4ccc-8991-fc1da1299c4c',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)

Now we initialize the index.

In [None]:
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    # If it doesn't exist, create a new Pinecone index with the specified parameters
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),  # Dimensionality of the embeddings
        metric='cosine'  # Similarity metric
    )

    # Wait for the newly created index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        # Sleep for 1 second between checks to avoid excessive polling
        time.sleep(1)

Now we connect to the index:

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [None]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

In [None]:
# Convert the dataset to a Pandas DataFrame for easier data manipulation
data = data.to_pandas()
data.head()

Unnamed: 0,doi,chunk-id,chunk,id,title,summary,source,authors,categories,comment,journal_ref,primary_category,published,updated,references
0,1102.0183,0,High-Performance Neural Networks\nfor Visual O...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
1,1102.0183,1,"January 2011\nAbstract\nWe present a fast, ful...",1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
2,1102.0183,2,promising architectures for such tasks. The mo...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
3,1102.0183,3,"Mutch and Lowe, 2008), whose lters are xed, ...",1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]
4,1102.0183,4,We evaluate various networks on the handwritte...,1102.0183,High-Performance Neural Networks for Visual Ob...,"We present a fast, fully parameterizable GPU i...",http://arxiv.org/pdf/1102.0183,"[Dan C. Cireşan, Ueli Meier, Jonathan Masci, L...","[cs.AI, cs.NE]","12 pages, 2 figures, 5 tables",,cs.AI,20110201,20110201,[]


We will embed and index the documents like so:

In [None]:
#creates the visual bar to see and mornitor progress
from tqdm.auto import tqdm

# Define the batch size for processing the data
batch_size = 32

# Loop through the data in batches
for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i + batch_size)
    batch = data.iloc[i:i_end]

    # Extract document identifiers, text chunks, and embeddings for the current batch
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)

    # Prepare metadata for the documents to be stored in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]

    # Add the batch of document embeddings and metadata to the Pinecone index
    index.upsert(vectors=zip(ids, embeds, metadata))

  0%|          | 0/152 [00:00<?, ?it/s]

In [None]:
# Retrieve statistics and information about the Pinecone vector index
index_stats = index.describe_index_stats()

# Print the retrieved index statistics
print(index_stats)

{'dimension': 384,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}


## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
# Import the necessary libraries and modules
from torch import cuda, bfloat16
import transformers

# Define the model identifier
model_id = 'meta-llama/Llama-2-13b-chat-hf'

# Determine the device to run the model on (GPU if available, otherwise CPU)
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Set quantization configuration to load a large model with less GPU memory
# Requires the 'bitsandbytes' library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,            # Load model in 4-bit quantized format
    bnb_4bit_quant_type='nf4',    # Use 'nf4' quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization
    bnb_4bit_compute_dtype=bfloat16  # Compute with bfloat16 data type
)

# Begin initializing Hugging Face (HF) items, an authentication token is required for these
hf_auth = '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'  # Replace with your actual Hugging Face authentication token

# Load the model configuration from Hugging Face with authentication token
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

# Load the model for causal language modeling from Hugging Face
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,  # Apply the quantization configuration
    device_map='auto',
    use_auth_token=hf_auth
)

# Set the model to evaluation mode
model.eval()

# Print the device where the model is loaded (GPU or CPU)
print(f"Model loaded on {device}")


(…)a-2-13b-chat-hf/resolve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]



(…)esolve/main/model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

(…)t-hf/resolve/main/generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [None]:
# Initialize a tokenizer for the Hugging Face model
# This tokenizer is used to process and tokenize text data for the model
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,          # Model identifier (should match the loaded model)
    use_auth_token=hf_auth  # Use the Hugging Face authentication token
)


(…)at-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)-13b-chat-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("Explain to me the difference between love and lust.")
print(res[0]["generated_text"])

Explain to me the difference between love and lust.
Love is a deep and enduring emotion that encompasses a range of feelings, including affection, attachment, and commitment. It is characterized by a desire for the well-being and happiness of the person you love, as well as a willingness to make sacrifices for their benefit. Love can take many forms, including romantic love, familial love, and platonic love.
Lust, on the other hand, is a strong desire for sexual pleasure, often driven by physical attraction or a desire for intimacy. Lust is typically associated with a more superficial level of connection and does not necessarily involve a deep emotional bond or long-term commitment. While lust can be intense and all-consuming, it is generally considered a more self-centered emotion than love.
Here are some key differences between love and lust:
1. Depth of Emotion: Love is a deeper and more complex emotion than lust, involving a range of feelings beyond just physical attraction. Lust i

Now to implement this in LangChain

In [None]:
# Import the HuggingFacePipeline class from the langchain.llms module
from langchain.llms import HuggingFacePipeline

# This pipeline is intended for working with Hugging Face language models
llm = HuggingFacePipeline(
    pipeline=generate_text  # Initialize the pipeline with a component for generating text
)

In [None]:
llm(prompt="Explain to me the difference between Love and Lust")

'.\nLove is a deep and enduring emotion that encompasses a range of feelings, including affection, attachment, commitment, and devotion. It is characterized by a desire for the well-being and happiness of the beloved, as well as a willingness to make sacrifices for their benefit. Love can take many forms, such as romantic love, familial love, or platonic love, and it is often marked by a sense of loyalty, trust, and mutual respect.\nLust, on the other hand, is a strong desire for sexual pleasure, often driven by physical attraction or a desire for self-gratification. It is typically characterized by a focus on the body or physical appearance of the desired partner, rather than their personality or inner qualities. While lust can be intense and all-consuming, it is generally considered a more superficial and short-term emotion compared to love.\nHere are some key differences between love and lust:\n1. Intentions: Love is motivated by a desire for the well-being and happiness of the belo

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [None]:
# Import the Pinecone class from the langchain.vectorstores module
from langchain.vectorstores import Pinecone

# Define the field in metadata that contains text content
text_field = 'text'

# Create a connection to the Pinecone vector store
# - 'index' refers to the Pinecone index you've previously initialized
# - 'embed_model.embed_query' is likely a function for embedding text queries
# - 'text_field' specifies the metadata field that contains text content
vectorstore = Pinecone(index, embed_model.embed_query, text_field)

We can confirm this works like so:

In [None]:
query = 'what makes llama 2 special?'

vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

[Document(page_content='Ricardo Lopez-Barquilla, Marc Shedroﬀ, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta\nChauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh\nYazdan, Elisa Garcia Anzano, and Natascha Parks.\n•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.\n46\n•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original\nLlama team who helped get this work started.\n•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the ﬁgures in the\npaper.\n•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the\ninternal demo.\n•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,\nLaurens van der Maaten, Jason Weston, and Omer Levy.', metadata={'source': 'http://arxiv.org/pdf/230

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm,  # The language model for generating responses
    chain_type='stuff',  # The type of the chain, which is 'stuff' (specific to the application)
    retriever=vectorstore.as_retriever()  # The retriever for text-based retrieval tasks
)

Let's begin asking questions! First let's try *without* RAG:

In [None]:
llm('what is so special about llama 2?')

'\n\nAnswer: Llama 2 is a unique and special animal for several reasons. Here are some of the most notable features that make it stand out:\n\n1. Size: Llamas are known for their size, and Llama 2 is no exception. It is one of the largest llamas in existence, with some individuals reaching heights of over 6 feet (1.8 meters) at the shoulder and weighing up to 400 pounds (180 kilograms).\n2. Coat: Llama 2 has a distinctive coat that is soft, fine, and silky to the touch. The coat can be a variety of colors, including white, cream, beige, and brown.\n3. Temperament: Llama 2 is known for its friendly and docile nature. They are social animals that thrive on human interaction and are often used as therapy animals due to their calm demeanor.\n4. Intelligence: Llama 2 is highly intelligent and can learn a wide range of tasks, from simple commands like "sit" and "stay" to more complex tasks like pulling carts or carrying packs.\n5. Adaptability: Llama 2 is highly adaptable and can survive in 

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [None]:
rag_pipeline('what is so special about llama 2?')

{'query': 'what is so special about llama 2?',
 'result': ' Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) developed and released by GenAI, Meta. The models are optimized for dialogue use cases and outperform open-source chat models on most benchmarks tested. Additionally, they are considered a suitable substitute for closed-source models like ChatGPT, BARD, and Claude.\n\nPlease let me know if you need any further information or clarification.'}

This looks *much* better! Let's try some more.

In [None]:
llm('what safety measures were used in the development of llama 2?')

"\n\nI'm looking for information on how the developers of Llama 2 ensured the safety of their users during the development process. Specifically, I'm interested in knowing about any safety measures that were implemented to protect users from potential risks or hazards associated with the use of the platform.\n\nHere are some possible answers:\n\n1. The developers of Llama 2 conducted thorough risk assessments to identify and mitigate any potential safety risks associated with the platform. This included identifying potential hazards such as data breaches, cyber attacks, and other security risks, and implementing appropriate safeguards to prevent these risks from occurring.\n2. The platform was designed with user privacy and security in mind, and the developers implemented various measures to protect user data and ensure that it is not compromised. For example, the platform may have implemented encryption techniques to protect user data, or implemented strict access controls to limit wh

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.

In [None]:
rag_pipeline('what safety measures were used in the development of llama 2?')

{'query': 'what safety measures were used in the development of llama 2?',
 'result': ' The development of llama 2 included safety measures such as pre-training, fine-tuning, and model safety approaches. Additionally, the authors delayed the release of the 34B model due to a lack of time to sufficiently red team.'}

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

In [None]:
rag_pipeline('what red teaming procedures were followed for llama 2?')

{'query': 'what red teaming procedures were followed for llama 2?',
 'result': " The paper describes the red teaming procedures used for Llama 2. These included creating prompts that might elicit unsafe or undesirable responses from the model, such as those based on sensitive topics or those that could potentially cause harm if the model were to respond inappropriately. The red teaming exercises were performed by a set of experts who evaluated the model's responses and provided feedback on its performance. The paper also mentions that multiple additional rounds of red teaming were performed over several months to measure the robustness of the model as it was released internally."}

Very interesting!

In [None]:
rag_pipeline('how does the performance of llama 2 compare to other local LLMs?')

{'query': 'how does the performance of llama 2 compare to other local LLMs?',
 'result': ' The performance of llama 2 is compared to other local LLMs such as chinchilla and bard in the paper. Specifically, the authors report that llama 2 outperforms these other models on the series of helpfulness and safety benchmarks they tested. Additionally, the authors note that llama 2 appears to be on par with some of the closed-source models, at least on the human evaluations they performed.'}