

# Retrieval Augmentation with LLaMa 13B







Large Language Models (LLMs) have a data freshness problem. The most powerful LLMs in the world, like Llama 2, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is retrieval augmentation. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that with Konko endpoint

In [None]:
!pip install konko
!pip install pinecone-client==2.2.2
!pip install langchain==0.1.4
!pip install sentence-transformers
!pip install datasets



In [None]:
import os
os.environ['KONKO_API_KEY'] = 'your_api_key'

## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings


embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [None]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(api_key="your_api_key", environment="your_env")

Now we initialize the index.

In [None]:
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=384,
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Now we connect to the index:

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [None]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

We will embed and index the documents like so:

In [None]:
data = data.to_pandas()

batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

In [None]:
from langchain.chat_models import ChatKonko
from langchain.schema import AIMessage, HumanMessage, SystemMessage

In [None]:
llm = ChatKonko(model='meta-llama/llama-2-13b-chat', max_tokens = 2000)

Confirm this is working:

In [None]:
messages = [
    SystemMessage(
        content="You are a helpful assistant."
    ),
    HumanMessage(
        content="Summarize the Foundation by Isaac Asimov"
    ),
]


llm(messages)

  warn_deprecated(


AIMessage(content='  Certainly! Here is a summary of Isaac Asimov\'s Foundation:\n\nFoundation is a science fiction novel written by Isaac Asimov, published in 1951. The book is set in a distant future where humanity has colonized the galaxy and formed a vast interstellar empire. However, the empire is in decline and a group of psychohistorians, led by a man named Hari Seldon, have predicted that it will soon collapse, leading to a dark age of barbarism and chaos that will last for thousands of years.\n\nTo prevent this, Seldon and his team develop a new field of science called psychohistory, which uses mathematical models and statistics to predict the future and guide human action. They also create a plan called the "Foundation," which is a secret organization that will gather and preserve knowledge during the impending collapse of the empire, with the goal of eventually rebuilding society and restoring civilization.\n\nThe story follows the fortunes of the Foundation over the centuri

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [None]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)



We can confirm this works like so:

In [None]:
query = 'what is so special about llama 2?'

vectorstore.similarity_search(
    query,  # the search query
    k=5  # returns top 3 most relevant chunks of text
)

[Document(page_content='Ricardo Lopez-Barquilla, Marc Shedroﬀ, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta\nChauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh\nYazdan, Elisa Garcia Anzano, and Natascha Parks.\n•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.\n46\n•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original\nLlama team who helped get this work started.\n•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the ﬁgures in the\npaper.\n•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the\ninternal demo.\n•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,\nLaurens van der Maaten, Jason Weston, and Omer Levy.', metadata={'source': 'http://arxiv.org/pdf/230

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever(search_kwargs={'k': 5})

)

Let's begin asking questions! First let's try *without* RAG:

In [None]:
messages = [
    SystemMessage(
        content="You are a helpful assistant."
    ),
    HumanMessage(
        content="what is so special about llama 2?"
    ),
]

llm(messages)


AIMessage(content='  Oh my llama! *ahem* I mean, oh my goodness! Llama 2 is a real treat! It\'s like, the second coming of the llama, you know? *giggles* But seriously, Llama 2 is a sequel to the original Llama game, and it\'s packed with even more adorable llamas and fun features!\n\nHere are some of the special things about Llama 2:\n\n1. More llamas than before: Can you believe it? Llama 2 has even more lovable llamas than the first game! Each one has its own unique personality, so you\'ll never get bored.\n2. New game modes: Llama 2 introduces two new game modes: "Llama Race" and "Llama Show." In "Llama Race," you\'ll race your llama against other players to see who can cross the finish line first. In "Llama Show," you\'ll compete in llama-themed challenges, like "Llama Dressage" and "Llama Obstacle Course." It\'s like a llama festival!\n3. Customization options: In Llama 2, you can customize your llamas with different outfits, accessories, and hairstyles. You can even give them ha

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [None]:
rag_pipeline('what is so special about llama 2?')

  warn_deprecated(


{'query': 'what is so special about llama 2?',
 'result': '  Based on the provided context, Llama 2 is a collection of pre-trained and fine-tuned large language models (LLMs) developed and released by the authors. The models are optimized for dialogue use cases and have outperformed open-source chat models on most benchmarks tested. The authors also claim that their models are suitable substitutes for closed-source models like ChatGPT, BARD, and Claude, which are heavily fine-tuned to align with human preferences.\n\nHowever, the authors do not provide a clear answer to what makes Llama 2 special compared to other pre-trained language models. They mention that their models have remarkable capabilities considering the simple nature of the training methodology, but they do not explicitly state what these capabilities are or how they differ from other models. Therefore, without additional information, it is difficult to determine what is special about Llama 2.'}

This looks *much* better! Let's try some more.

In [None]:
messages = [
    SystemMessage(
        content="You are a helpful assistant."
    ),
    HumanMessage(
        content="what safety measures were used in the development of llama 2?"
    ),
]

llm(messages)

AIMessage(content="  As a helpful assistant, I'm here to provide you with the most accurate and up-to-date information about the development of Llama 2. The safety measures used in the development of this game were of the utmost importance to the developers, and they took every precaution to ensure the well-being of both the players and the animals.\n\nFirst and foremost, the developers worked closely with animal welfare organizations to ensure that the animals used in the game were treated with the utmost respect and care. All of the animals were obtained from reputable sources and were handled by trained professionals. The developers also made sure that the animals were provided with comfortable living conditions and were given regular breaks to prevent fatigue and stress.\n\nIn addition to the welfare of the animals, the developers also prioritized the safety of the players. They implemented a number of measures to prevent accidents and injuries, such as:\n\n1. Rigorous testing: The

Okay, it looks like the LLM with no RAG is less than ideal. Let's ask the same question to our RAG pipeline.

In [None]:
rag_pipeline('what safety measures were used in the development of llama 2?')

{'query': 'what safety measures were used in the development of llama 2?',
 'result': "  Based on the provided context, I do not see any information about safety measures used in the development of Llama 2. The context only mentions that the model was trained using publicly available online sources, and that safety testing and tuning were performed to improve the model's safety. However, the specific safety measures used in the development of the model are not described in the provided context.\n\nTherefore, I cannot answer the user's question based on the provided context. If the user has any further information or context about the development of Llama 2, I may be able to provide a more informed answer."}

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

In [None]:
rag_pipeline('what red teaming procedures were followed for llama 2?')

{'query': 'what red teaming procedures were followed for llama 2?',
 'result': "  Based on the text provided, the red teaming procedures for LLAMA 2 included the following:\n\n1. Pretraining: The model was pretrained using publicly available online sources.\n2. Fine-tuning: The initial version of the model was fine-tuned through the application of a red teaming process.\n3. Red Teaming: The model was subjected to multiple rounds of red teaming exercises performed by a set of experts.\n4. Analysis: After each exercise, the collected data was analyzed, including dialogue length, risk area distribution, histogram of topic of misinformation, and rated degree of risk.\n5. Feedback: The lessons learned from each exercise were used to further improve model safety training.\n6. Model Refinements: The model was continuously improved with additional red teaming eﬀorts, leading to an evolution of the model's robustness.\n\nThese procedures were followed to measure the robustness of the model and 