# Description

This notebook demonstrates how to use the [OpenAI API](https://platform.openai.com) to embed documents and index them in a [Qdrant vector database](https://qdrant.tech/) and perform a similairty search on the indexed documents.

# Importing Libraries

In [1]:
import os
import openai
import qdrant_client

from pathlib import Path

In [2]:
from qdrant_client.http.models import PointStruct
from qdrant_client.http.models import VectorParams, Distance

# Global Variables

We will define a few global variables here:

- `BASE_DATA_DIR`: The base directory where the data is stored.
- `EMBEDDING_MODEL`: The OpenAI embedding model to use for embedding the documents.
- `COLLECTION_NAME`: The name of the Qdrant collection to store the documents.
- `OPENAI_SECRET_KEY_FILE`: The path to a JSON file containing the OpenAI secret key.
- `MAX_DOCUMENTS_TO_LOAD`: The maximum number of documents to load from the dataset.

In [3]:
BASE_DATA_DIR = Path('../../data')
EMBEDDING_MODEL = 'text-embedding-3-small'
COLLECTION_NAME = "my_document_collection"
OPENAI_SECRET_KEY_FILE = "../../secrets/openai_api_key.json"
MAX_DOCUMENTS_TO_LOAD = 100

In [4]:
# Check if the OpenAI client is already loaded or not to avoid re-loading it
try:
    openai_client
except NameError:
    openai_client = None

# Helper Functions

We will import the helper functions to load the data, get the OpenAI secret key, and embed the documents.
The helper functions are defined in the `src/helper_functions.py` file.

In [5]:
from src.helper_functions import load_data, get_openai_api_key, embed_documents

# Loading the OpenAI secret key and Creating an In-Memory Qdrant Database

We will load the OpenAI secret key from the `secrets/openai_api_key.json` file and create an in-memory Qdrant database to index the documents.

In [6]:
# Loading the OpenAI secret key if it is not loaded already
if openai_client is None:
    openai_client = openai.Client(
        api_key=get_openai_api_key(OPENAI_SECRET_KEY_FILE),
    )

In [7]:
# Creating an in-memory Qdrant database instance
qdrant_client = qdrant_client.QdrantClient(":memory:")

# Loading the Documents

We will use the [feedback-prize-2021](https://www.kaggle.com/competitions/feedback-prize-2021/data) dataset from Kaggle as an example. To download the dataset, you can use the following command in the root directory of the project:

`bash bin/download_kaggle_dataset.sh "competition" "feedback-prize-2021" "data/input/kaggle_competitions/fp1"`

In [8]:
data = load_data(BASE_DATA_DIR / 'input/kaggle_competitions/fp1/train')[:MAX_DOCUMENTS_TO_LOAD]

In [9]:
data.head()

Unnamed: 0,id,text
0,A4ADCC04C319,On a hot summer day I remembered that I have t...
1,CC96CF4D3854,There has been a lot of dispute about the elec...
2,8E89A7A62A82,When asking for advice is can be easier it mak...
3,B154F5F9DADF,Cars are the most common use of transportation...
4,FA87416EF173,do you thing stuudent would benefit from being...


# Embedding the Documents

We will embed the documents using OpenAI's API using the `text-embedding-3-small` model.

In [10]:
embeddings = embed_documents(data['text'], openai_client, EMBEDDING_MODEL)

In [11]:
# Checking the embedding of the first document (first 10 values)
embeddings.data[0].embedding[:10]

[0.04261812940239906,
 -0.008463268168270588,
 -0.010300256311893463,
 0.0207317266613245,
 0.04936250299215317,
 -0.028263378888368607,
 -0.005228856112807989,
 -0.02124345861375332,
 0.016034284606575966,
 0.022791776806116104]

In [12]:
data_with_embeddings = data.copy()
data_with_embeddings['embedding'] = embeddings.data

In [13]:
data_with_embeddings.head()

Unnamed: 0,id,text,embedding
0,A4ADCC04C319,On a hot summer day I remembered that I have t...,"Embedding(embedding=[0.04261812940239906, -0.0..."
1,CC96CF4D3854,There has been a lot of dispute about the elec...,"Embedding(embedding=[0.018557794392108917, -1...."
2,8E89A7A62A82,When asking for advice is can be easier it mak...,"Embedding(embedding=[0.08663269132375717, -0.0..."
3,B154F5F9DADF,Cars are the most common use of transportation...,"Embedding(embedding=[0.01881640963256359, 0.01..."
4,FA87416EF173,do you thing stuudent would benefit from being...,"Embedding(embedding=[-0.00868830643594265, -0...."


# Indexing the Documents in Qdrant Database

We will index the documents in the Qdrant database with the document embeddings and the document text as the payload.

In [14]:
# Create a list of Qdrant points from document embeddings and their text
points = [
    PointStruct(
        id=idx, # Document ID (should be an integer)
        vector=data.embedding, # Document embedding (vector)
        payload={"text": text}, # Document text
    )
    for idx, (data, text) in enumerate(zip(data_with_embeddings['embedding'], data_with_embeddings['text']))
]

In [15]:
# Create a collection with the specified vector configuration
qdrant_client.create_collection(
    COLLECTION_NAME, # Name of the collection
    vectors_config=VectorParams(
        size=1536, # Size of the embeddings by the OpenAI model
        distance=Distance.COSINE, # Cosine similarity measure; not a distance actually
    ),
)

qdrant_client.upsert(COLLECTION_NAME, points)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# Searching for Similar Documents

Let's search for similar documents to a sample query. We will use cosine similarity and limit the search results to 3 documents with a similarity (or score) threshold of 0.1 or higher.

In [16]:
search_results = qdrant_client.search(
    collection_name=COLLECTION_NAME,
    query_vector=openai_client.embeddings.create(
        input=["I'm writing a document about machine learning."],
        model=EMBEDDING_MODEL,
    ).data[0].embedding,
    limit=3,
    score_threshold=0.1,
)

In [17]:
search_results

[ScoredPoint(id=28, version=0, score=0.3291600111374232, payload={'text': 'The author describes "Making Mona Lisa Smile" and how a new technology called the Facial Action Cording System enables compures to identify human emotions. Because the author is trying to tell us how the technology can make us feel differents emotions. In this article we can find too many differents emotions that the Mona Lise had. And how the techonology can do it too.\n\nFor example my first evidence that i found in this article is when said that "Mona Lise is 83 percent happy, 9 percent disgusted, 6 percent fearful, and 2 percent angry". So, is trying to tell us that to making Mona Lise smile can be a difficult way, because she had too many differents emotions at the same time,but nothing is impossible. But also the people trying to describe a new technology to identify humans emotion. The new technology is called FACS ( Facial Action Coding System). This is the new technolody that the author wanted to used t