# Retrieval Augmentation for GPT-4 using Qdrant Vector DB

#### Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a GPT-4 model to generate an answer backed by real data sources.

GPT-4 is a big step up from previous OpenAI completion models. It also exclusively uses the `ChatCompletion` endpoint, so we must use it in a slightly different way to usual. However, the power of the model makes the change worthwhile, particularly when augmented with an external knowledge base like the Pinecone vector database.

Required installs for this notebook are:

!pip install -qU bs4 tiktoken openai langchain pinecone-client[grpc]

## Preparing the Data

In this example, we will download the whole paraceltech.com. We get all `.html` files located on the site and save it to rtdocs

Let use the linux command: wget -r -A .html -P rtdocs https://paraceltech.com

This downloads all HTML into the `rtdocs` directory. Now we can use LangChain itself to process these docs. We do this using the `DirectoryLoader` like so:

In [1]:
from langchain.document_loaders import DirectoryLoader

# loader = ReadTheDocsLoader('rtdocs')
loader =DirectoryLoader('rtdocs/', glob="**/*.html")
docs = loader.load()
len(docs)

211

This leaves us with hundreds of processed doc pages. Let's take a look at the format each one contains:

In [2]:
docs[0]

Document(page_content='SERVICES                                    \n                                                                            \n                                                                                            \n                                                    AI & Data Analysis\n                                                \n                                                                                            \n                                                    Web App Development\n                                                \n                                                                                            \n                                                    Mobile App Development\n                                                \n                                                                                            \n                                                    IOT Development\n                                     

We access the plaintext page content like so:

In [3]:
print(docs[0].page_content)

SERVICES                                    
                                                                            
                                                                                            
                                                    AI & Data Analysis
                                                
                                                                                            
                                                    Web App Development
                                                
                                                                                            
                                                    Mobile App Development
                                                
                                                                                            
                                                    IOT Development
                                                
                        

In [4]:
print(docs[5].page_content)

SERVICES                                    
                                                                            
                                                                                            
                                                    AI & Data Analysis
                                                
                                                                                            
                                                    Web App Development
                                                
                                                                                            
                                                    Mobile App Development
                                                
                                                                                            
                                                    IOT Development
                                                
                        

We can also find the source of each document:

In [5]:
docs[5].metadata['source'].replace('rtdocs/', 'https://')

'rtdocs\\ai-data-analys-en\\index.html'

We can use these to create our `data` list:

In [6]:
data = []

for doc in docs:
    data.append({
        'url': doc.metadata['source'].replace('rtdocs/', 'https://'),
        'text': doc.page_content
    })

In [7]:
data[3]

{'url': 'rtdocs\\ai-and-big-data-management-to-its-questa-verification-eda-tools\\index.html',
 'text': 'SERVICES                                    \n                                                                            \n                                                                                            \n                                                    AI & Data Analysis\n                                                \n                                                                                            \n                                                    Web App Development\n                                                \n                                                                                            \n                                                    Mobile App Development\n                                                \n                                                                                            \n                        

We will chunk everything into ~400 token chunks, we can do this easily with `langchain` and `tiktoken`:

In [8]:
import tiktoken

tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

Process the `data` into more chunks using this approach.

In [10]:
from uuid import uuid4
from tqdm.auto import tqdm

chunks = []

for idx, record in enumerate(tqdm(data)):
    texts = text_splitter.split_text(record['text'])
    chunks.extend([{
        'id': str(uuid4()),
        'text': texts[i],
        'chunk': i,
        'url': record['url']
    } for i in range(len(texts))])

  from .autonotebook import tqdm as notebook_tqdm
  0%|          | 0/211 [00:00<?, ?it/s]

100%|██████████| 211/211 [00:02<00:00, 80.30it/s] 


Our chunks are ready so now we move onto embedding and indexing everything.

## Initialize Embedding Model

We use `text-embedding-3-small` as the embedding model. We can embed text like so:

In [11]:
from openai import OpenAI
from decouple import config
# initialize openai API key
OPENAI_API_KEY = config("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)


In [12]:

response = client.embeddings.create(
    input=["Your text string goes here", "The second input is here"],
    model="text-embedding-3-small"
)

print(response.data[0].embedding)

[0.005168174393475056, 0.017187591642141342, -0.01868520863354206, -0.01855923980474472, -0.047251880168914795, -0.03028823249042034, 0.027670903131365776, 0.0036775567568838596, 0.011253113858401775, 0.0064068567007780075, -0.0016953152371570468, 0.015801947563886642, -0.0013007913948968053, -0.007872981019318104, 0.059904634952545166, 0.05024711415171623, -0.027516942471265793, 0.00992345530539751, -0.0403936393558979, 0.050051163882017136, -0.0004255783569533378, 0.030232246965169907, -0.01368848979473114, 0.03291955590248108, 0.017299562692642212, 0.016795692965388298, -0.0017547999741509557, 0.020434759557247162, 0.0407855398952961, -0.03770633041858673, -0.02611730247735977, -0.04999517649412155, 0.024185797199606895, -0.055257827043533325, -0.03224772959947586, 0.04235313832759857, 0.06466341763734818, 0.014724223874509335, -0.01564798690378666, -0.04140138253569603, 0.022198306396603584, 0.0073411171324551105, 0.044956471771001816, 0.007106677163392305, -0.024087822064757347, 0

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [13]:
response.data

[Embedding(embedding=[0.005168174393475056, 0.017187591642141342, -0.01868520863354206, -0.01855923980474472, -0.047251880168914795, -0.03028823249042034, 0.027670903131365776, 0.0036775567568838596, 0.011253113858401775, 0.0064068567007780075, -0.0016953152371570468, 0.015801947563886642, -0.0013007913948968053, -0.007872981019318104, 0.059904634952545166, 0.05024711415171623, -0.027516942471265793, 0.00992345530539751, -0.0403936393558979, 0.050051163882017136, -0.0004255783569533378, 0.030232246965169907, -0.01368848979473114, 0.03291955590248108, 0.017299562692642212, 0.016795692965388298, -0.0017547999741509557, 0.020434759557247162, 0.0407855398952961, -0.03770633041858673, -0.02611730247735977, -0.04999517649412155, 0.024185797199606895, -0.055257827043533325, -0.03224772959947586, 0.04235313832759857, 0.06466341763734818, 0.014724223874509335, -0.01564798690378666, -0.04140138253569603, 0.022198306396603584, 0.0073411171324551105, 0.044956471771001816, 0.007106677163392305, -0.

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-3-small` model.

## Initializing the Vector Database with Qdrant

In [14]:
# Install qdrant vector database #pip install qdrant-client
from qdrant_client import QdrantClient
vector_db = QdrantClient(":memory:") # Create in-memory Qdrant instance, for testing, CI/CD
# OR
# qdrant = QdrantClient(path="path/to/db")  # Persists changes to disk, fast prototyping 


In [15]:
from qdrant_client.http.models import Distance, VectorParams

vector_db.create_collection(
    collection_name="paracel_collection",
    vectors_config=VectorParams(size=1536, distance=Distance.DOT),
)

True

Embed all data with text-embedding-3-small model and then upsert to vector_db

In [16]:
from tqdm.auto import tqdm
from qdrant_client.http.models import PointStruct
import datetime
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once
for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(len(chunks), i+batch_size)
    meta_batch = chunks[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = client.embeddings.create(
            input=texts,
            model="text-embedding-3-small"
        )
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = client.embeddings.create(
                    input=texts,
                    model="text-embedding-3-small"
                )
                done = True
            except:
                pass
    
    embeds = [record.embedding for record in res.data]
    # cleanup metadata
    meta_batch = [{
        'text': x['text'],
        'chunk': x['chunk'],
        'url': x['url']
    } for x in meta_batch]
    list_batch = list(zip(ids_batch, embeds, meta_batch))
    points_list = [PointStruct(id=p[0], 
                               vector=p[1], 
                               payload=p[2]) for p in list_batch]
    # upsert to qdrant
    operation_info = vector_db.upsert(
        collection_name="paracel_collection",
        wait=True,
        points=points_list)

  0%|          | 0/17 [00:00<?, ?it/s]

100%|██████████| 17/17 [05:57<00:00, 21.06s/it]


Now we've added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation using GPT-4.

## Retrieval

To search through our documents we first need to create a query vector `xq`. Using `xq` we will retrieve the most relevant chunks from the LangChain docs, like so:

In [59]:
# query = "tell me about ADAS that Paracel has done"
query = "what is the AI consultant of Paracel"
response = client.embeddings.create(
    input=[query],
    model="text-embedding-3-small"
)

# retrieve from Pinecone
xq = response.data[0].embedding

# get relevant contexts (including the questions)
search_result = vector_db.search(
    collection_name="paracel_collection", query_vector=xq, limit=10
)

print(search_result)

[ScoredPoint(id='85fbec57-b7ce-4e94-b775-2381a7d6ddff', version=0, score=0.5938461135018108, payload={'text': 'Adobe Illustrator\n\nAdobe After Effects\n\nCareers\n\nAvailable technologies\n\nYou can see job interviews, employee interviews, etc.\n\n20+\n\nYou can see the position jobs that are currently being recruited.\n\nView More\n\nQ&A\n\nWe will answer frequently asked questions about contracts, development systems, deliverables, etc.\n\nView More\n\nContact\n\nPlease feel free to contact us for consultation on projects, collaboration with partners, requests for interviews, etc.\n\nView More\n\nParacel Viet Nam\n\n157 Bui Ta Han Street, Thuan Nhi Building - 3rd Floor\n\nParacel Japan\n\nParacel Korea', 'chunk': 4, 'url': 'rtdocs\\iot-development-en\\index.html'}, vector=None, shard_key=None), ScoredPoint(id='8721ac15-97c3-4ad9-8d53-ab770e1ff2e3', version=0, score=0.5801081318214558, payload={'text': 'About Paracel\n\nContact US\n\ncompany profile\n\nParacel Vietnam\n\nCOMPANY NAME

In [60]:
search_result

[ScoredPoint(id='85fbec57-b7ce-4e94-b775-2381a7d6ddff', version=0, score=0.5938461135018108, payload={'text': 'Adobe Illustrator\n\nAdobe After Effects\n\nCareers\n\nAvailable technologies\n\nYou can see job interviews, employee interviews, etc.\n\n20+\n\nYou can see the position jobs that are currently being recruited.\n\nView More\n\nQ&A\n\nWe will answer frequently asked questions about contracts, development systems, deliverables, etc.\n\nView More\n\nContact\n\nPlease feel free to contact us for consultation on projects, collaboration with partners, requests for interviews, etc.\n\nView More\n\nParacel Viet Nam\n\n157 Bui Ta Han Street, Thuan Nhi Building - 3rd Floor\n\nParacel Japan\n\nParacel Korea', 'chunk': 4, 'url': 'rtdocs\\iot-development-en\\index.html'}, vector=None, shard_key=None),
 ScoredPoint(id='8721ac15-97c3-4ad9-8d53-ab770e1ff2e3', version=0, score=0.5801081318214558, payload={'text': 'About Paracel\n\nContact US\n\ncompany profile\n\nParacel Vietnam\n\nCOMPANY NAM

With retrieval complete, we move on to feeding these into GPT to produce answers.

## Retrieval Augmented Generation

GPT is currently accessed via the `ChatCompletions` endpoint of OpenAI. To add the information we retrieved into the model, we need to pass it into our user prompts *alongside* our original query. We can do that like so:

In [61]:
# get list of retrieved text
contexts = [point.payload['text'] for point in search_result]

augmented_query = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query

In [62]:
print(augmented_query)

Adobe Illustrator

Adobe After Effects

Careers

Available technologies

You can see job interviews, employee interviews, etc.

20+

You can see the position jobs that are currently being recruited.

View More

Q&A

We will answer frequently asked questions about contracts, development systems, deliverables, etc.

View More

Contact

Please feel free to contact us for consultation on projects, collaboration with partners, requests for interviews, etc.

View More

Paracel Viet Nam

157 Bui Ta Han Street, Thuan Nhi Building - 3rd Floor

Paracel Japan

Paracel Korea

---

About Paracel

Contact US

company profile

Paracel Vietnam

COMPANY NAME Paracel Technology Solutions .Ltd Company Business Domain Software development, Mobile application and Game development Address 3rd Floor, Thuan Nhi Building, Lot 15B3.1 Bui Ta Han Street, Khue My Ward, Ngu Hanh Son District, Da Nang City. Establishment June 2018

DIRECTOR CEO - NGUYEN THANH VUONG CTO - NGUYEN XUAN TRUC CHRO - NGUYEN TRAN TAN HOANG

Now we ask the question:

In [63]:
# system message to 'prime' the model
primer = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ],
    model="gpt-3.5-turbo",
)
# res = client.ChatCompletion.create(
#     model="gpt-4",
#     messages=[
#         {"role": "system", "content": primer},
#         {"role": "user", "content": augmented_query}
#     ]
# )

In [64]:
#chat_completion


In [65]:
chat_completion.choices[0].message.content

'The Chief Executive Officer of Paracel Technology Solutions, Nguyen Thanh Vuong, has experience in Finance, Securities, Supply Chain, Automotive, and Embedded Systems, which could make him serve as an AI consultant for Paracel.'