# Description

This notebook demonstrates how to use the [OpenAI API](https://platform.openai.com) to embed documents and index them in a [Qdrant vector database](https://qdrant.tech/). We will use the `text-embedding-3-small` model from OpenAI to embed the documents and the Qdrant to index the documents, then we will search for similar documents to a sample query in the indexed documents.

# Importing Libraries

In [1]:
import os
import openai
import qdrant_client

from pathlib import Path

In [2]:
from qdrant_client.http.models import PointStruct
from qdrant_client.http.models import VectorParams, Distance

# Global Variables and Settings

We will define the global variables and settings for the notebook here. We will use the `BASE_DATA_DIR` environment variable to get the base data directory. If the `BASE_DATA_DIR` environment variable is not set, we will use the `../data` directory as the default base data directory. We will also define the `INPUT_DATA_DIR` and `OUTPUT_DATA_DIR` variables to store the input and output data directories, respectively. We will use the `EMBEDDING_MODEL` variable to store the OpenAI embedding model name. We will use the `COLLECTION_NAME` variable to store the name of the Qdrant collection where we will index the documents. We will use the `OPENAI_SECRET_KEY_FILE` variable to store the path to the OpenAI secret key file and use the `MAX_DOCUMENTS_TO_LOAD` variable to store the maximum number of documents to load from the dataset.

In [3]:
# Get the base data directory from an environment variable, or use '../data' as a default
BASE_DATA_DIR = Path(os.getenv('BASE_DATA_DIR', '../../data'))
INPUT_DATA_DIR = BASE_DATA_DIR / 'input'
OUTPUT_DATA_DIR = BASE_DATA_DIR / 'output'

EMBEDDING_MODEL = 'text-embedding-3-small'
COLLECTION_NAME = "my_document_collection"
OPENAI_SECRET_KEY_FILE = "../../secrets/openai_api_key.json"
MAX_DOCUMENTS_TO_LOAD = 100

In [4]:
# Check if the OpenAI client is already loaded or not
try:
    openai_client
except NameError:
    openai_client = None

# Helper Functions

We will define the helper functions for loading the data, getting the OpenAI API key, and embedding the documents here.

In [5]:
from src.helper_functions import load_data, get_openai_api_key, embed_documents

# Loading the OpenAI secret key and Creating an In-Memory Qdrant Database

We will load the OpenAI secret key from the `secrets/openai_api_key.json` file and create an in-memory Qdrant database to index the documents. Please make sure that the OpenAI secret key is stored in the `secrets/openai_api_key.json` file.

In [6]:
# Loading the OpenAI secret key if it is not loaded already
if openai_client is None:
    openai_client = openai.Client(
        api_key=get_openai_api_key(OPENAI_SECRET_KEY_FILE),
    )

In [7]:
qdrant_client = qdrant_client.QdrantClient(":memory:")

# Loading the Documents

We will use the feedback-prize-2021 dataset from Kaggle as an example. To download the dataset, you can use the following command in the root directory of the project:

`bash bin/download_kaggle_dataset.sh "competition" "feedback-prize-2021" "data/input/kaggle_competitions/fp1"`

In [8]:
data = load_data(INPUT_DATA_DIR / 'kaggle_competitions' / 'fp1' / 'train')[:MAX_DOCUMENTS_TO_LOAD]

In [9]:
data.head()

Unnamed: 0,id,text
0,A4ADCC04C319,On a hot summer day I remembered that I have t...
1,CC96CF4D3854,There has been a lot of dispute about the elec...
2,8E89A7A62A82,When asking for advice is can be easier it mak...
3,B154F5F9DADF,Cars are the most common use of transportation...
4,FA87416EF173,do you thing stuudent would benefit from being...


# Embedding the Documents

We will embed the documents using OpenAI's API using the `text-embedding-3-small` model.

In [10]:
embeddings = embed_documents(data['text'], openai_client, EMBEDDING_MODEL)

In [11]:
# Checking the embedding of the first document (first 10 values)
embeddings.data[0].embedding[:10]

[0.04257804900407791,
 -0.008452609181404114,
 -0.010277007706463337,
 0.0207771435379982,
 0.04937688633799553,
 -0.028245365247130394,
 -0.005213973578065634,
 -0.021262774243950844,
 0.016025831922888756,
 0.022732794284820557]

In [12]:
data_with_embeddings = data.copy()
data_with_embeddings['embedding'] = embeddings.data

In [13]:
data_with_embeddings.head()

Unnamed: 0,id,text,embedding
0,A4ADCC04C319,On a hot summer day I remembered that I have t...,"Embedding(embedding=[0.04257804900407791, -0.0..."
1,CC96CF4D3854,There has been a lot of dispute about the elec...,"Embedding(embedding=[0.018557794392108917, -1...."
2,8E89A7A62A82,When asking for advice is can be easier it mak...,"Embedding(embedding=[0.08658095449209213, -0.0..."
3,B154F5F9DADF,Cars are the most common use of transportation...,"Embedding(embedding=[0.01881640963256359, 0.01..."
4,FA87416EF173,do you thing stuudent would benefit from being...,"Embedding(embedding=[-0.00868830643594265, -0...."


# Indexing the Documents in Qdrant Database

We will index the documents in the Qdrant database with the document embeddings and the document text as the payload. We will use the cosine distance metric for the vectors and a vector size of 1536 for the embeddings.

In [14]:
# Create a list of Qdrant points from document embeddings and their text
points = [
    PointStruct(
        id=idx,
        vector=data.embedding,
        payload={"text": text},
    )
    for idx, (data, text) in enumerate(zip(data_with_embeddings['embedding'], data_with_embeddings['text']))
]

In [15]:
# Create a collection with the specified vector configuration
qdrant_client.create_collection(
    COLLECTION_NAME,
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
)

qdrant_client.upsert(COLLECTION_NAME, points)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# Searching for Similar Documents

Let's search for similar documents to a sample query in the indexed documents. We will use cosine similarity as the distance metric and limit the search results to 3 documents with a score threshold of 0.1 or higher.

In [16]:
search_results = qdrant_client.search(
    collection_name=COLLECTION_NAME,
    query_vector=openai_client.embeddings.create(
        input=["People are eating lunch at a restaurant"],
        model=EMBEDDING_MODEL,
    ).data[0].embedding,
    limit=3,
    score_threshold=0.1,
)

In [17]:
search_results

[ScoredPoint(id=66, version=0, score=0.2050268693109877, payload={'text': "Dear principle,\n\nHi I am STUDENT_NAME I agree with the choice that says we should be allowed to have our phones out in class during lunch and during free time. I think This because kids want to have their cell phones out any ways. We also have our cell phones out even if the rule isn't passed. Cell phones basically is what the world revolves around. Everybody and their mamas have a cell phone. This generation is all about technology.\n\nI know that you are probably thinking that we are going to do things that we are not supposed to be doing, but we do it anyways so why not just let us do it in the open. I personally don't see why the rule says that we can't have them out it will benefit us also. I do understand why you wouldn't let us have them out because you think that we are going to have it out during class and stuff. Well why punish us all for what a couple of people who got caught did. Why not just take 