# Retrieval Augmented Generation (RAG) with OpenAI / Claude and Qdrant

Based on basic RAG example from [qdrant sample use cases](https://qdrant.tech/documentation/examples/)

In the ever-evolving landscape of AI, the consistency and reliability of Large Language Models (LLMs) remain a challenge. While these models can understand statistical relationships between words, they often fail to provide accurate factual responses. Because their internal knowledge may not be accurate, outputs can range from spot-on to nonsensical. Retrieval Augmented Generation (RAG) is a framework designed to bolster the accuracy of LLMs by grounding them in external knowledge bases. In this example, we'll demonstrate a streamlined  implementation of the RAG pipeline using only Qdrant and OpenAI SDKs. By harnessing Flag embedding's power, we can bypass additional frameworks' overhead.
    
This example assumes you understand the architecture necessary to carry out RAG. If this is new to you, please look at some introductory readings:
* [Retrieval-Augmented Generation: To add knowledge](https://eugeneyan.com/writing/llm-patterns/#retrieval-augmented-generation-to-add-knowledge)

## Prerequisites

Let's start setting up all the pieces to implement the RAG pipeline. We will only use Qdrant and OpenAI SDKs, without any third-party libraries.

### Preparing the environment

We need just a few dependencies to implement the whole application, so let's start with installing the dependencies.

In [1]:
!pip install qdrant-client fastembed openai

Collecting qdrant-client
  Downloading qdrant_client-1.8.2-py3-none-any.whl.metadata (9.5 kB)
Collecting fastembed
  Downloading fastembed-0.2.5-py3-none-any.whl.metadata (4.8 kB)
Collecting openai
  Downloading openai-1.14.3-py3-none-any.whl.metadata (20 kB)
Collecting grpcio>=1.41.0 (from qdrant-client)
  Downloading grpcio-1.62.1-cp311-cp311-macosx_10_10_universal2.whl.metadata (4.0 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Downloading grpcio_tools-1.62.1-cp311-cp311-macosx_10_10_universal2.whl.metadata (6.2 kB)
Collecting numpy>=1.21 (from qdrant-client)
  Using cached numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)
Collecting huggingface-hub<0.21,>=0.20 (from fastembed)
  Using cached huggingface_hub-0.20.3-py3-none-any.whl.metadata (12 kB)
Collecting loguru<0.8.0,>=0.7.2 (from fastembed)
  Downloading loguru-0.7.2-py3-none-

[Qdrant](https://qdrant.tech) will act as a knowledge base providing the context information for the prompts we'll be sending to the LLM. There are various ways of running Qdrant, but we'll simply use the Docker container.

In [5]:
!docker run -p "6333:6333" -p "6334:6334" --name "rag-openai-qdrant" --rm -d qdrant/qdrant:latest

docker: Error response from daemon: Conflict. The container name "/rag-openai-qdrant" is already in use by container "ef9c570f276697c0a7ae49104905792a2cde578c2a5506ad2ae496ed3da80102". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


### Creating the collection

Qdrant [collection](https://qdrant.tech/documentation/concepts/collections/) is the basic unit of organizing your data. Each collection is a named set of points (vectors with a payload) among which you can search. After connecting to our running Qdrant container, we can check whether we already have some collections.

In [1]:
import qdrant_client

qdrant_client = qdrant_client.QdrantClient("http://localhost:6333", prefer_grpc=True)
qdrant_client.get_collections()

  from .autonotebook import tqdm as notebook_tqdm


CollectionsResponse(collections=[CollectionDescription(name='default_collection'), CollectionDescription(name='knowledge-base')])

### Building the knowledge base

Qdrant will use vector embeddings of our facts to enrich the original prompt with some context. Thus, we need to store the vector embeddings and the texts used to generate them. All our facts will have a JSON payload with a single attribute and look as follows:

```json
{
    "document": "Binary Quantization is a method of reducing the memory usage even up to 40 times!"
}
```

This structure is required by [FastEmbed](https://qdrant.github.io/fastembed/), a library that simplifies managing the vectors, as you don't have to calculate them on your own. It's also possible to use an existing collection, However, all the code snippets will assume this data structure. Adjust your examples to work with a different schema.

FastEmbed will automatically create the collection if it doesn't exist. Knowing that we are set to add our documents to a collection, which we'll call `knowledge-base`.

In [14]:
qdrant_client.add(
    collection_name="knowledge-base",
    documents=[
        "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
        "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
        "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
        "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
        "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
        "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
        "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
        "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
    ]
)

['69a21a6c4ed449f386ce2ff957b19052',
 '84f55a9325c84abab0dbfaf2dfbf675d',
 '65c8a7fd2f1a4afa96d432ee333101ce',
 'bcaf626371b24898b53da2dd1994cf89',
 '3e72bb6ec5514a4289bcad8ad8d66615',
 '0606d89c2f974b39bf0db4723d3766c2',
 'a952fd658c8448199a91ef90bd07ad03',
 '94cedf9f528d4531bcfd6fbf013d127f']

### Building collection from url link and file dir
`create_qdrant_collection_from_dir` function will re-create the collection if the name is the same. The default collection name is default_collection.

In [12]:
import qdrant_retrieval
# from qdrant_client import QdrantClient
import importlib
importlib.reload(qdrant_retrieval)
from qdrant_retrieval import create_qdrant_collection_from_dir, retrieve_docs

### Testing input data types

URLs with .md - using requests to download file from the URLs

In [27]:
create_qdrant_collection_from_dir(
    dir_path= 
    ["https://raw.githubusercontent.com/microsoft/flaml/main/README.md",
    "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md"],
    text_types=["url"]
)

Text files or text file dir

In [29]:
create_qdrant_collection_from_dir(
    dir_path=["txt_files/"],
    text_types=["txt"],
    collection_name="txt_collection"
)

PDF file

In [30]:
create_qdrant_collection_from_dir(
    dir_path=["json_files/test.json"],
    text_types=["json"],
    collection_name="json_collection"
)

max_tokens is too small to fit a single line of text. Breaking this line:
	{ ...
max_tokens is too small to fit a single line of text. Breaking this line:
	 ...


ZeroDivisionError: division by zero

## Retrieval Augmented Generation

RAG changes the way we interact with Large Language Models. We're converting a knowledge-oriented task, in which the model may create a counterfactual answer, into a language-oriented task. The latter expects the model to extract meaningful information and generate an answer. LLMs, when implemented correctly, are supposed to be carrying out language-oriented tasks.

The task starts with the original prompt sent by the user. The same prompt is then vectorized and used as a search query for the most relevant facts. Those facts are combined with the original prompt to build a longer prompt containing more information.

But let's start simply by asking our question directly.

In [8]:
prompt = """
What tools should I need to use to build a web service using vector embeddings for search?
"""

Using OpenAI API requires providing the API key. Our example demonstrates setting the `OPENAI_API_KEY` using an environmental variable.

In [9]:
import os

# Fill the environmental variable with your own OpenAI API key
# See: https://platform.openai.com/account/api-keys
os.environ["OPENAI_API_KEY"] = "<< PASS YOUR OWN KEY >>"

Now we can finally call the completion service.

In [None]:
# import openai

# completion = openai.ChatCompletion.create(
#     model="gpt-3.5-turbo",
#     messages=[
#         {"role": "user", "content": prompt},
#     ]
# )
# print(completion["choices"][0]["message"]["content"])

To build a web service using vector embeddings for search, you would need several tools. Here are some essential ones:

1. Programming Language: Depending on your preference and requirements, you can choose a programming language like Python, Java, or Node.js for building the web service.

2. Web Framework: A web framework helps in developing web applications efficiently. Popular choices include Flask (Python) or Spring Boot (Java).

3. Embedding Models: You would need vector embedding models to represent your data for search. Some popular models include Word2Vec, GloVe, or BERT, depending on your application domain and requirements.

4. Vector Embedding Libraries: To work with vector embeddings effectively, you may need libraries like TensorFlow, Gensim, or PyTorch to load and manipulate the embeddings.

5. Database: You would require a database to store and retrieve the information you want to search. Common choices include MySQL, PostgreSQL, or MongoDB.

6. Search Engine: You need a

### Claude alternative

In [11]:
import anthropic

API_KEY = "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"


client = anthropic.Anthropic(
    #api_key=os.environ["ANTHROPIC_API_KEY"],
    api_key=API_KEY
)

message = client.messages.create(
    model="claude-3-opus-20240229",
    #model="claude-2.1",
    max_tokens=200,
    temperature=0.0,
    system="Respond only in Yoda-speak.",
    messages=[
        {"role": "user", "content": prompt}
    ]
)

print(message.content)

[ContentBlock(text='To build a web service using vector embeddings for search, need the following tools, you will:\n\n1. A programming language, such as Python or Java, choose you must. Popular for machine learning and natural language processing, Python is.\n\n2. A web framework like Flask or Django, select you should. Easy to create web services and APIs, these frameworks make.\n\n3. A library for generating vector embeddings, require you will. Popular options include TensorFlow, PyTorch, or Gensim. Powerful and flexible, these libraries are.\n\n4. A database to store your vector embeddings and search indices, consider you must. Options like Elasticsearch, Faiss, or Annoy, explore you can. Efficient storage and retrieval of high-dimensional vectors, these databases provide.\n\n5. A machine learning model for generating vector embeddings, train or obtain you must. Pre-trained models like Word2Vec, Glo', type='text')]


### Extending the prompt

Even though the original answer sounds credible, it didn't answer our question correctly. Instead, it gave us a generic description of an application stack. To improve the results, enriching the original prompt with the descriptions of the tools available seems like one of the possibilities. Let's use a semantic knowledge base to augment the prompt with the descriptions of different technologies!

In [15]:
results = qdrant_client.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=3,
)
results

[QueryResponse(id='0e25a5b2-9fc8-4ba5-904d-7535cfa32e7f', embedding=None, sparse_embedding=None, metadata={'document': 'Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!'}, document='Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!', score=0.8290700912475586),
 QueryResponse(id='69a21a6c-4ed4-49f3-86ce-2ff957b19052', embedding=None, sparse_embedding=None, metadata={'document': 'Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the

We used the original prompt to perform a semantic search over the set of tool descriptions. Now we can use these descriptions to augment the prompt and create more context.

In [16]:
context = "\n".join(r.document for r in results)
context

'Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!\nQdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!\nFastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.'

Finally, let's build a metaprompt, the combination of the assumed role of the LLM, the original question, and the results from our semantic search that will force our LLM to use the provided context.

By doing this, we effectively convert the knowledge-oriented task into a language task and hopefully reduce the chances of hallucinations. It also should make the response sound more relevant.

In [17]:
metaprompt = f"""
You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: {prompt.strip()}

Context:
{context.strip()}

Answer:
"""

# Look at the full metaprompt
print(metaprompt)


You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: What tools should I need to use to build a web service using vector embeddings for search?

Context:
Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard

Our current prompt is much longer, and we also used a couple of strategies to make the responses even better:

1. The LLM has the role of software architect.
2. We provide more context to answer the question.
3. If the context contains no meaningful information, the model shouldn't make up an answer.

Let's find out if that works as expected.

In [None]:
# completion = openai.ChatCompletion.create(
#     model="gpt-3.5-turbo",
#     messages=[
#         {"role": "user", "content": metaprompt},
#     ],
#     timeout=10.0,
# )
# print(completion["choices"][0]["message"]["content"])

To build a web service using vector embeddings for search, you would need to use Qdrant as the vector database and vector similarity search engine. Additionally, you can use FastAPI as the web framework for building the API. To compute the vector embeddings, you can utilize SentenceTransformers, which is a Python framework for generating sentence/text embeddings. These tools would enable you to create a web service for matching, searching, recommending, and more based on vector embeddings.


### Claude alternative

In [18]:
message = client.messages.create(
    model="claude-3-opus-20240229",
    #model="claude-2.1",
    max_tokens=200,
    temperature=0.0,
    #system="Respond only in Yoda-speak.",
    messages=[
        {"role": "user", "content": metaprompt}
    ]
)

print(message.content)

[ContentBlock(text='Based on the provided context, you can use the following tools to build a web service using vector embeddings for search:\n\n1. Qdrant: Qdrant is a vector database and vector similarity search engine that can be deployed as an API service. It allows you to store and search for high-dimensional vectors, making it suitable for applications like matching, searching, and recommending based on vector embeddings.\n\n2. FastAPI: FastAPI is a modern web framework for building APIs with Python. It provides a fast and efficient way to create a web service that can interact with the Qdrant vector database. You can use FastAPI to define API endpoints, handle requests, and process responses.\n\nBy combining Qdrant and FastAPI, you can create a web service that utilizes vector embeddings for search functionality. Qdrant will handle the storage and retrieval of vector embeddings, while FastAPI will provide the API interface to interact with', type='text')]


### Testing out the RAG pipeline

By leveraging the semantic context we provided our model is doing a better job answering the question. Let's enclose the RAG as a function, so we can call it more easily for different prompts.

In [None]:
# def rag(question: str, n_points: int = 3) -> str:
#     results = client.query(
#         collection_name="knowledge-base",
#         query_text=question,
#         limit=n_points,
#     )

#     context = "\n".join(r.document for r in results)

#     metaprompt = f"""
#     You are a software architect.
#     Answer the following question using the provided context.
#     If you can't find the answer, do not pretend you know it, but answer "I don't know".

#     Question: {question.strip()}

#     Context:
#     {context.strip()}

#     Answer:
#     """

#     completion = openai.ChatCompletion.create(
#         model="gpt-3.5-turbo",
#         messages=[
#             {"role": "user", "content": metaprompt},
#         ],
#         timeout=10.0,
#     )
#     return completion["choices"][0]["message"]["content"]

### Claude alternative

In [21]:
def rag(question: str, n_points: int = 3) -> str:
    results = qdrant_client.query(
        collection_name="knowledge-base",
        query_text=question,
        limit=n_points,
    )

    context = "\n".join(r.document for r in results)

    metaprompt = f"""
    You are a software architect.
    Answer the following question using the provided context.
    If you can't find the answer, do not pretend you know it, but answer "I don't know".

    Question: {question.strip()}

    Context:
    {context.strip()}

    Answer:
    """

    message = client.messages.create(
        model="claude-3-opus-20240229",
        #model="claude-2.1",
        max_tokens=200,
        temperature=0.0,
        #system="Respond only in Yoda-speak.",
        messages=[
            {"role": "user", "content": metaprompt}
        ]
    )

    return message.content

Now it's easier to ask a broad range of questions.

In [22]:
rag("What can the stack for a web api look like?")

[ContentBlock(text='Based on the provided context, a possible stack for a web API can include:\n\n1. FastAPI: A modern, fast, and high-performance web framework for building APIs using Python 3.7+. FastAPI leverages Python type hints and provides automatic API documentation and validation.\n\n2. Python 3.7+: The programming language used to develop the web API, taking advantage of modern Python features and syntax.\n\n3. Docker: A containerization platform that allows developers to package the web API along with its dependencies into a container. Docker enables easy deployment, scalability, and portability of the application across different environments without the need for tedious environment configuration or management.\n\nSo, a typical stack for a web API can consist of using FastAPI as the web framework, Python 3.7+ as the programming language, and Docker for containerization and deployment. This stack provides a fast, efficient, and scalable solution for building and deploying we

In [23]:
rag("Where is the nearest grocery store?")

[ContentBlock(text="I don't know.", type='text')]

Our model can now:

1. Take advantage of the knowledge in our vector datastore.
2. Answer, based on the provided context, that it can not provide an answer.

We have just shown a useful mechanism to mitigate the risks of hallucinations in Large Language Models.

### Create Qdrant collection from dir

In [9]:
import sys
sys.path.append(os.getcwd() + "/qdrant_retrieval.py")
import qdrant_retrieval
from qdrant_client import QdrantClient
import importlib
importlib.reload(qdrant_retrieval)
from qdrant_retrieval import create_qdrant_collection_from_dir, retrieve_docs
# create_qdrant_collection_from_dir("https://raw.githubusercontent.com/microsoft/flaml/main/README.md",
#         "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md")
res = retrieve_docs(query_texts="Is there a function called tune_automl?", client=QdrantClient(), collection_name="knowledge-base")
print(res)

{'ids': [['0606d89c-2f97-4b39-bf0d-b4723d3766c2', 'e96afda8-c213-4744-9270-bcc20f534692', '415842ee-83b7-442f-88da-9e2ed19c0265', '65c8a7fd-2f1a-4afa-96d4-32ee333101ce', '84f55a93-25c8-4aba-b0db-faf2dfbf675d', '2420b07b-907e-47cb-98af-d64ee5b30774', 'e271d182-86de-4f5f-857e-f3e72791beba', 'a952fd65-8c84-4819-9a91-ef90bd07ad03', '94cedf9f-528d-4531-bcfd-6fbf013d127f', '94d083c8-26bb-4e87-b679-f766993bfbcc', 'bd2e224f-66b7-4770-acd5-b3c0e692fa6b', 'bcaf6263-71b2-4898-b53d-a2dd1994cf89', '0e25a5b2-9fc8-4ba5-904d-7535cfa32e7f', '69a21a6c-4ed4-49f3-86ce-2ff957b19052', '3e72bb6e-c551-4a42-89bc-ad8ad8d66615', '7fb937a9-1fda-4e08-92fc-023bc5eba212'], ['0606d89c-2f97-4b39-bf0d-b4723d3766c2', 'e96afda8-c213-4744-9270-bcc20f534692', '415842ee-83b7-442f-88da-9e2ed19c0265', '65c8a7fd-2f1a-4afa-96d4-32ee333101ce', '84f55a93-25c8-4aba-b0db-faf2dfbf675d', '2420b07b-907e-47cb-98af-d64ee5b30774', 'e271d182-86de-4f5f-857e-f3e72791beba', 'a952fd65-8c84-4819-9a91-ef90bd07ad03', '94cedf9f-528d-4531-bcfd-6fb

### Evaluation

### Cleaning up the environment

If you wish to continue playing with the RAG application we created, don't do the code below. However, it's always good to clean up the environment, so nothing is left dangling. We'll show you how to remove the Qdrant container.

In [None]:
!docker kill rag-openai-qdrant
!docker rm rag-openai-qdrant