[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jmalbornoz/SimpleRAG/blob/main/2_simple_rag_example.ipynb)

# Simple RAG workflow
## Dr José M Albornoz
### April 2024

This notebook presents a simple example of retrieval augmented generation using a quantized version of Falcon 7-B, which will be stored on disk to simplify the access process (and to illustrate a situation in which model and context must be available on-prem for security considerations).

Please note that this is a memory-hungry application: with the deafult amount of memory available in Colab some steps take a very long time to complete. I have timed the most critical steps in the RAG process to illustrate this limitation.

In [None]:
import psutil
psutil.virtual_memory()

svmem(total=13609443328, available=12713660416, percent=6.6, used=589774848, free=9971924992, active=627499008, inactive=2768826368, buffers=338739200, cached=2709004288, shared=1396736, slab=151605248)

# 0.- Install dependencies

In [None]:
import sys
import os
!pip install torch==2.2.0 --no-warn-script-location > /dev/null
!pip install langchain==0.0.335 --no-warn-script-location > /dev/null
!pip install pygpt4all==1.1.0 --no-warn-script-location > /dev/null
!pip install gpt4all==1.0.12 --no-warn-script-location > /dev/null
!pip install transformers==4.35.1 --no-warn-script-location > /dev/null
!pip install datasets==2.14.6 --no-warn-script-location > /dev/null
!pip install tiktoken==0.4.0 --no-warn-script-location > /dev/null
!pip install chromadb==0.4.15 --no-warn-script-location > /dev/null
!pip install sentence_transformers==2.2.2 --no-warn-script-location > /dev/null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.1+cu121 requires torch==2.2.1, but you have torch 2.2.0 which is incompatible.
torchtext 0.17.1 requires torch==2.2.1, but you have torch 2.2.0 which is incompatible.
torchvision 0.17.1+cu121 requires torch==2.2.1, but you have torch 2.2.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.[0m[31m
[0m

# 1.- Imports

In [None]:
import requests
import time
#import io

from tqdm import tqdm
from langchain.llms import GPT4All
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings

# 2.- Paremeters and query

In [None]:
# Number of threads for parallel processing - specially relevant for multi-core systems
n_threads = 64

In [None]:
# Maximum number of tokens for model prediction - limits the number of tokens (words or subwords) for the input or output
# sequences, ensuring that the data fed into or produced by the model does not exceed this length
max_tokens = 100

In [None]:
# Penalty for repeated tokens in generation. This parameter possibly penalizes repetitive content in the model’s output.
# A value greater than 1.0 prevents the model from generating repeated sequences
repeat_penalty = 1.5

In [None]:
# Specifies the batch size for processing data. This can help optimize processing speed and memory usage
n_batch = n_threads

In [None]:
# number of chunks to be retrieved from the context
top_k = 3

In [None]:
# moel temperature
temp = 0.5

In [None]:
# size of text chunks for vectorization
chunk_size = 500

In [None]:
# overlap size between chunks
overlap = 20

In [None]:
# if True, additional context information is printed, by default False
context_verbosity = True

In [None]:
# if True, disables the retrieval-augmented generation, by default False.
rag_off = False

In [None]:
# query to be answered by the chatbot
user_input = "where did the apostles receive the holy spirit?"

# 3.- Define model URL and model folder

In [None]:
! mkdir models

In [None]:
url = 'https://huggingface.co/nomic-ai/gpt4all-falcon-ggml/resolve/main/ggml-model-gpt4all-falcon-q4_0.bin'

# 4.- Download ggml model

In [None]:
# define model path
model_path = "/content/models/ggml-model-gpt4all-falcon-q4_0.bin"

In [None]:
if not os.path.isfile(model_path):

    print('Downloading ggml model')

    # send a GET request to the URL to download the file. Stream since it's large
    response = requests.get(url, stream=True)

    # open the file in binary mode and write the contents of the response to it in chunks
    # This is a large file, so be prepared to wait.
    start = time.time()
    with open(model_path, 'wb') as f:
        for chunk in tqdm(response.iter_content(chunk_size=10000)):
            if chunk:
                f.write(chunk)
    end = time.time()
    print(f'\nModel downloaded in {(end - start)/60} minutes')

else:
    print('model already exists in path.')

Downloading ggml model


406165it [01:00, 6736.47it/s]


Model downloaded in 1.0050726612408956 minutes





# 5.- Load model

In [None]:
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]

# Verbose is required to pass to the callback manager
print('Loading model...')

start = time.time()
llm = GPT4All(model=model_path, callbacks=callbacks, verbose=False,
              n_threads=n_threads, n_predict=max_tokens, repeat_penalty=repeat_penalty,
              n_batch=n_batch, top_k=top_k, temp=temp)
end = time.time()
print(f'\nModel loaded in {(end - start)/60} minutes')

Loading model...
Found model file at  /content/models/ggml-model-gpt4all-falcon-q4_0.bin

Model loaded in 0.0659467379252116 minutes


# 6.- Build vector database

File 'bible.txt' must be uploaded to the Colab environment before the next cell is executed.

In [None]:
data_path = 'bible.txt'

loader = TextLoader(data_path)

# Text Splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)

# Embed the document and store into chroma DB
start = time.time()
index = VectorstoreIndexCreator(embedding= HuggingFaceEmbeddings(), text_splitter=text_splitter).from_loaders([loader])
end = time.time()
print(f'Vector database built in {(end - start)/60} minutes')

  _torch_pytree._register_pytree_node(


Downloading .gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

  _torch_pytree._register_pytree_node(


Vector database built in 72.97951021989186 minutes


# 7.- Define retrieval mechanism

In [None]:
context_verbosity = context_verbosity

# perform a similarity search and retrieve the context from our documents
results = index.vectorstore.similarity_search(user_input, k=top_k)

# join all context information into one string
context = "\n".join([document.page_content for document in results])
if context_verbosity:
    print(f"Retrieving information related to your question...")
    print(f"Found this content which is most similar to your question: {context}")

if rag_off:
    template = """Question: {question}
    Answer: This is the response: """
    prompt = PromptTemplate(template=template, input_variables=["question"])
else:
    template = """ Don't just repeat the following context, use it in combination with your knowledge to improve your answer to the question:{context}

    Question: {question}
    """
    prompt = PromptTemplate(template=template, input_variables=["context", "question"]).partial(context=context)

Retrieving information related to your question...
Found this content which is most similar to your question: Acts 8:15	Who, when they were come down, prayed for them, that they might receive the Holy Ghost:
Acts 8:16	(For as yet he was fallen upon none of them: only they were baptized in the name of the Lord Jesus.)
Acts 8:17	Then laid they [their] hands on them, and they received the Holy Ghost.
Acts 8:18	And when Simon saw that through laying on of the apostles’ hands the Holy Ghost was given, he offered them money,
Acts 2:1	And when the day of Pentecost was fully come, they were all with one accord in one place.
Acts 2:2	And suddenly there came a sound from heaven as of a rushing mighty wind, and it filled all the house where they were sitting.
Acts 2:3	And there appeared unto them cloven tongues like as of fire, and it sat upon each of them.
Acts 2:4	And they were all filled with the Holy Ghost, and began to speak with other tongues, as the Spirit gave them utterance.
Acts 11:12	A

# 8.- Inference

In [None]:
if context_verbosity:
      print(f"Your Query: {prompt}")

llm_chain = LLMChain(prompt=prompt, llm=llm)

start = time.time()
print("Processing the information with gpt4all...\n")
response = llm_chain.run(user_input)
end = time.time()
print(f'Response generated in {(end - start)/60} minutes')

print(response)

Your Query: input_variables=['question'] partial_variables={'context': 'Acts 8:15\tWho, when they were come down, prayed for them, that they might receive the Holy Ghost:\nActs 8:16\t(For as yet he was fallen upon none of them: only they were baptized in the name of the Lord Jesus.)\nActs 8:17\tThen laid they [their] hands on them, and they received the Holy Ghost.\nActs 8:18\tAnd when Simon saw that through laying on of the apostles’ hands the Holy Ghost was given, he offered them money,\nActs 2:1\tAnd when the day of Pentecost was fully come, they were all with one accord in one place.\nActs 2:2\tAnd suddenly there came a sound from heaven as of a rushing mighty wind, and it filled all the house where they were sitting.\nActs 2:3\tAnd there appeared unto them cloven tongues like as of fire, and it sat upon each of them.\nActs 2:4\tAnd they were all filled with the Holy Ghost, and began to speak with other tongues, as the Spirit gave them utterance.\nActs 11:12\tAnd the spirit bade me