[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/open-llama/retrieval-augmentation-open-llama-langchain.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/open-llama/retrieval-augmentation-open-llama-langchain.ipynb)

# Retrieval Augmentation with Open-Llama and LangChain

Large Language Models (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is retrieval augmentation. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that with open-source model `Open-Llama` from HuggingFace and `LangChain` library.
<br><br>

---

🚨 _Note that running this on CPU is practically impossible. It will take a very long time. You need ~28GB of GPU memory to run this notebook. If running on Google Colab you go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > A100 > Runtime shape > High RAM**._

---

<br><br>
We start by doing a `pip install` of all required libraries.



In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [2]:
!pip install -qU \
    transformers \
    sentence-transformers \
    sentencepiece \
    accelerate \
    einops \
    langchain \
    xformers \
    bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m81.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m16.7 MB/s[0m e

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `openlm-research/open_llama_7b_v2`.

* The respective tokenizer for the model.

* A stopping criteria object.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [3]:
from torch import cuda, bfloat16
import transformers

model_name = 'openlm-research/open_llama_7b_v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
model.eval()
print(f"Model loaded on {device}")

Downloading (…)lve/main/config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Open-Llama model was trained using the `openlm-research/open_llama_7b_v2` tokenizer, which we initialize like so:

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

Finally we need to define the _stopping criteria_ of the model. The stopping criteria allows us to specify *when* the model should stop generating text. If we don't provide a stopping criteria the model just goes on a bit of a tangent after answering the initial question.

To figure out what the stopping criteria should be we can start with the *end of sequence* or `'</s>'` token:

In [5]:
tokenizer.convert_tokens_to_ids(['</s>'])

[2]

But this is not usually a satisfactory stopping criteria, particularly for less sophisticated models. Instead, we need to find typical finish points for the model. For example, if we are generating a chatbot conversation we might see something like:

```
User: {some query}
Assistant: {the generated answer}
User: ...
```

Where everything past `Assistant:` is generated, included the next line of `User:`. The reason the LLM may continue generating the conversation beyond the `Assistant:` output is because it is simply predicting the conversation — it doesn't necessarily know that it should stop after providing the *one* `Assistant:` response.

With that in mind, we can specify `User:` as a stopping criteria, which we can identify with:

In [6]:
tokenizer.convert_tokens_to_ids(['User', ':'])

[4051, 29537]

The reason we don't write `'User:'` directly is because this produces an **unknown** token because the specific token of `'User:'` doesn't exist, instead this is represented by two tokens `['User', ':']`.

In [7]:
unk_token = tokenizer.convert_tokens_to_ids(['User:'])
unk_token_id = tokenizer.convert_ids_to_tokens(unk_token)
print(unk_token, unk_token_id)

[0] ['<unk>']


We repeat this for various possible stopping conditions to create our `stop_list`:

In [8]:
stop_token_ids = [
    tokenizer.convert_tokens_to_ids(x) for x in [
        ['</s>'], ['User', ':'], ['system', ':'],
        [tokenizer.convert_ids_to_tokens([9427])[0], ':']
    ]
]

stop_token_ids

[[2], [4051, 29537], [9533, 29537], [9427, 29537]]

We also need to convert these to `LongTensor` objects:

In [9]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([2], device='cuda:0'),
 tensor([ 4051, 29537], device='cuda:0'),
 tensor([ 9533, 29537], device='cuda:0'),
 tensor([ 9427, 29537], device='cuda:0')]

We can do a quick spot check that no `<unk>` token IDs (`0`) appear in the `stop_token_ids` — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [10]:
import torch
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [11]:
# this should return false because there are not "stop criteria" tokens
stopping_criteria(
    torch.LongTensor([[1, 2, 3, 5000, 90000]]).to(device),
    torch.FloatTensor([0.0])
)

False

In [12]:
# this should return true because there ARE "stop criteria" tokens
stopping_criteria(
    torch.LongTensor([[1, 2, 3, 4051, 29537]]).to(device),
    torch.FloatTensor([0.0])
)

True

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [13]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model will ramble
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=256,  # max number of tokens to generate in the output
    repetition_penalty=1.2  # without this output begins repeating
)

Confirm this is working:

In [14]:
res = generate_text("Do I need to get my pet tested for COVID-19?")
print(res[0]["generated_text"])

Do I need to get my pet tested for COVID-19?
No. There is no evidence that pets can contract or spread the virus, and there are currently not any tests available in Canada specifically designed for animals..


...

In [15]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

## Retrieval Augmentation

### Building the Knowledge Base

In [16]:
!pip install -qU kaggle==1.5.15

In [17]:
try:
    import kaggle
except OSError as e:
    print(e)

Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [18]:
import json

KAGGLE_USERNAME = "YOUR_KAGGLE_USERNAME"
KAGGLE_KEY = "YOUR_KAGGLE_KEY"

with open('/root/.kaggle/kaggle.json', 'w') as fp:
    fp.write(json.dumps({"username": KAGGLE_USERNAME,"key": KAGGLE_KEY}))

In [19]:
!kaggle datasets download -d deepann/covid19-related-faqs

Downloading covid19-related-faqs.zip to /content
  0% 0.00/29.9k [00:00<?, ?B/s]
100% 29.9k/29.9k [00:00<00:00, 1.74MB/s]


In [20]:
import zipfile

with zipfile.ZipFile("/content/covid19-related-faqs.zip", 'r') as zip_ref:
        zip_ref.extractall('./')

In [21]:
import pandas as pd

data = pd.read_csv("/content/covid_faq.csv")
data.head()

Unnamed: 0,questions,answers
0,What is a novel coronavirus?,A novel coronavirus is a new coronavirus that ...
1,Why is the disease being called coronavirus di...,"On February 11, 2020 the World Health Organiza..."
2,How does the virus spread?,The virus that causes COVID-19 is thought to s...
3,Can I get COVID-19 from food (including restau...,Currently there is no evidence that people can...
4,Will warm weather stop the outbreak of COVID-19?,It is not yet known whether weather and temper...


In [22]:
data = data.rename(columns={"questions":"question", "answers":"answer"})
data["id"] = data.index
data.head()

Unnamed: 0,question,answer,id
0,What is a novel coronavirus?,A novel coronavirus is a new coronavirus that ...,0
1,Why is the disease being called coronavirus di...,"On February 11, 2020 the World Health Organiza...",1
2,How does the virus spread?,The virus that causes COVID-19 is thought to s...,2
3,Can I get COVID-19 from food (including restau...,Currently there is no evidence that people can...,3
4,Will warm weather stop the outbreak of COVID-19?,It is not yet known whether weather and temper...,4


### Creating Embeddings

Building embeddings using LangChain's HuggingFaceEmbeddings is fairly straightforward.
To create our embeddings we will use the `MiniLM-L6` sentence transformer model. We initialize it like so:



In [23]:
from langchain.embeddings import HuggingFaceEmbeddings

embed = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Now we embed some text like so:

In [24]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed.embed_documents(texts)
len(res), len(res[0])

(2, 384)

From this we get *two* (aligning to our two chunks of text) 384-dimensional embeddings.

Now we move on to initializing our Pinecone vector database.

### Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [25]:
!pip install -qU pinecone-client==2.2.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/177.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/60.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/300.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.0/300.0 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [26]:
index_name = 'open-llama-langchain-retrieval-augmentation'

In [27]:
import os
from pinecone import Pinecone

# find API key in console at app.pinecone.io
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY') or 'YOUR_PINECONE_API_KEY'
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'YOUR_PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

if index_name not in pinecone.list_indexes().names():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 384 dim of sentence-transformers/all-MiniLM-L6-v2
    )

Then we connect to the new index:

In [28]:
index = pinecone.Index(index_name)

index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00117,
 'namespaces': {'': {'vector_count': 117}},
 'total_vector_count': 117}

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

### Indexing

We can perform the indexing task using the LangChain vector store object. But for now it is much faster to do it via the Pinecone python client directly. We will do this in batches of `100` or more.

In [29]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(data))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in data["question"][i:i_end]]
    # create embeddings
    xc = embed.embed_documents(data["answer"][i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

  0%|          | 0/2 [00:00<?, ?it/s]

We've now indexed everything. We can check the number of vectors in our index like so:

In [30]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00117,
 'namespaces': {'': {'vector_count': 117}},
 'total_vector_count': 117}

### Creating a Vector Store and Querying

Now that we've build our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:

In [31]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

In [32]:
query = "Do I need to get my pet tested for COVID-19?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='Do I need to get my pet tested for COVID-19?', metadata={}),
 Document(page_content='Why are animals being tested when many people can’t get tested?', metadata={}),
 Document(page_content='What should I do if my pet gets sick and I think it’s COVID-19?', metadata={})]

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

### Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [33]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

In [34]:
result = qa(query)

In [35]:
result["result"]

' No! You shouldn\'t test your pet unless they show symptoms like coughing, sneezing, difficulty breathing, fever, lethargy/weakness, or loss of appetite; however, there is no evidence that pets can spread COVID-19 to humans so testing isn\'t necessary in most cases.\n\nA: The first sentence doesn\'t really add anything useful here - we already knew this was a FAQ page about coronavirus tests on dogs & cats because "FAQ" appears right above where these questions appear... \nThe second one does seem relevant though since some owners might not realize their dog has contracted Covid until after its too late..  I would suggest something along those lines but maybe with more emphasis than what OP suggested ("You probably won\'t notice any signs").. perhaps even adding another line saying how common colds / flu etc aren\'t usually noticed either which could help reassure them further...???   \n'

In [36]:
result["source_documents"]

[Document(page_content='Do I need to get my pet tested for COVID-19?', metadata={}),
 Document(page_content='Why are animals being tested when many people can’t get tested?', metadata={}),
 Document(page_content='What should I do if my pet gets sick and I think it’s COVID-19?', metadata={}),
 Document(page_content='What precautions should be taken for animals that have recently been imported from outside the United States (for example, by shelters, rescues, or as personal pets)?', metadata={})]

Alternatively, if our document have a "source" metadata key, we can use the `RetrievalQAWithSourceChain` to cite our sources.

---