# A Complete Noobs Guide to Vector Search
## Part 1: From Real World Data to Vectors

Over the last two blogs, I introduced you to vector databases, showed you how to set up your enviornment, spin up a Qdrant cloud instance, and create your first collection.

Now, it's time to get practical. 

In this post, you'll gain an intuition for the process of turning real-world text data into vectors and adding them to your Qdrant collection. To follow along with this, [you'll need an OpenAI API Key](https://platform.openai.com/). OpenAI requires to put some money down upfront. I recommend putting down $20. If you can't afford that, then [Cohere is a good option](https://dashboard.cohere.com/welcome/login). Cohere lets you access their models for free, without having to enter any credit card information. However, there are rate limits to their free tier. At the time of this writing, you can't send more than 5 requests per minute, 100 requests per hour, and 1000 requests per month.

I primarily use OpenAI because I've spent enough with them that I'm at teir 4 usage limits. This means I can experiment, explore, and hack around as much as I need before commiting to something that I'm going to present to you. No other reason than that. You're free to use whatever language model provider you'd like. 

### From Text to Vectors

To get the most out vector search, you need to transform human-readable text into a format that machines can understand and process.

**The process**

1. **Tokenization:** Breaking down the text into smaller units called tokens. These can be words, subwords, or even characters, depending on the chosen tokenizer.

2. **Embedding:**  Mapping each token to a vector in a high-dimensional space. Each dimension captures some aspect of the token's meaning and its relationship to other tokens. 

3. **Vector Representation:** The resulting set of vectors represents the entire text, capturing its semantic meaning and relationships within the text.

#### Tokenization

<img src="https://qdrant.tech/docs/gettingstarted/tokenization.png" style="display: block; margin-left: auto; margin-right: auto;">

[Image Source: Qdrant Blog](https://qdrant.tech/documentation/overview/vector-search/)

Tokenization involves:

 - Splitting the text into words, subwords, or characters

 - Converting each token into a unique integer ID

As of the time of this writing, OpenAI uses a tokenizer called `cl100k_base` for it's new models. This includes `text-embedding-3-large`, which will be used in this tutorial. The  `cl100k_base` tokenizer is based on [byte-pair encoding (BPE) algorithm](https://youtu.be/HEikzVL-lZU). BPE iteratively replaces the most frequent pair of bytes with a single, unused byte. This allows for effective encoding of rare words and subwords.

For words in the English language, tokens are typically single characters, partial, or complete words. 

For instance, the sentence "Coding is fun!" would be split into the following tokens: "Coding," "is," "fun," "!". However, the concept of a token can differ across languages. Some languages might have tokens smaller than a single character or larger than one word, depending on the language's structure. When it comes to spaces, they are usually considered part of the preceding word during tokenization. For example, "learning" would be a token, not "learning " or " + learning." For an excellent deep dive into tokenization, including a comparison of different tokenizers on the same piece of text, I recommend checking out [this video](https://www.youtube.com/watch?v=rT6wVLEDC_w) by Jay Alammar.

[There's also this cool web app](https://tiktokenizer.vercel.app/?model=text-embedding-3-large) shows you how the `cl100k_base` tokenizer does its thing. Take a look below.

<div style="text-align: center;">
    <img src="/Users/harpreetsahota/workspace/practical-rag-book/exploring-qdrant/image_assets/tokenization.gif" style="width: 75%; height: auto;">
</div>

Time to see tokenization in action for yourself. Start by installing `tiktoken`:

```shell

pip install tiktoken==0.6.0
```

In [1]:
import os
import tiktoken

from dotenv import load_dotenv

load_dotenv("./.env")

True

In [2]:
example_text = "def hello_world(): print('Hello, world! 🌍') # Bonjour, 世界! Hola, mundo! 1 + 1 = 2, π ≈ 3.14159, e^(i*π) + 1 = 0."

You can count the number of "words" in the string by blank space, and see how this differs from the number of tokens

In [3]:
len(example_text.split())

23

Now, count the number of tokens.

In [4]:
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """
    Calculate the number of tokens in a given text string using a specified encoding.

    Args:
        string (str): The input text string to be tokenized.
        encoding_name (str, optional): The name of the encoding to use for tokenization.
            Defaults to "cl100k_base". Other supported encodings include "p50k_base",
            "p50k_edit", "r50k_base", etc.

    Returns:
        int: The number of tokens in the input text string.

    Note:
        The number of tokens returned by this function depends on the chosen encoding.
        Different encodings may have different tokenization rules and vocabulary sizes.

    Raises:
        ValueError: If an invalid encoding name is provided.
    """
    try:
        encoding = tiktoken.get_encoding(encoding_name)
        num_tokens = len(encoding.encode(string))
        return num_tokens
    except KeyError:
        raise ValueError(f"Unsupported encoding: {encoding_name}")

num_tokens_from_string(example_text)

59

As you can see, that's quite a difference!

The code below will show you the text, token, and integer representation of each token in the string.

In [5]:
def from_text_to_tokens(text:str, encoding_name: str =  "cl100k_base" ):
    """
    Tokenize the given text using the cl100k_base encoding.

    Args:
        text (str): The input text to be tokenized.

    Returns:
        None
    """
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    subwords = [encoding.decode([token]) for token in tokens]
    print(f"Original text: {text}")
    print(f"\nTokens: {tokens}")
    print(f"\nSubwords: {subwords}")
    print("\nToken to subword mapping:")
    for token, subword in zip(tokens, subwords):
        print(f"Token: {token}, Subword: {subword.encode('utf-8')}")

from_text_to_tokens(example_text)

Original text: def hello_world(): print('Hello, world! 🌍') # Bonjour, 世界! Hola, mundo! 1 + 1 = 2, π ≈ 3.14159, e^(i*π) + 1 = 0.

Tokens: [755, 24748, 32892, 4658, 1194, 493, 9906, 11, 1917, 0, 11410, 234, 235, 873, 674, 13789, 30362, 11, 220, 3574, 244, 98220, 0, 473, 8083, 11, 29452, 0, 220, 16, 489, 220, 16, 284, 220, 17, 11, 52845, 21784, 230, 220, 18, 13, 9335, 2946, 11, 384, 13571, 72, 9, 49345, 8, 489, 220, 16, 284, 220, 15, 13]

Subwords: ['def', ' hello', '_world', '():', ' print', "('", 'Hello', ',', ' world', '!', ' �', '�', '�', "')", ' #', ' Bon', 'jour', ',', ' ', '�', '�', '界', '!', ' H', 'ola', ',', ' mundo', '!', ' ', '1', ' +', ' ', '1', ' =', ' ', '2', ',', ' π', ' �', '�', ' ', '3', '.', '141', '59', ',', ' e', '^(', 'i', '*', 'π', ')', ' +', ' ', '1', ' =', ' ', '0', '.']

Token to subword mapping:
Token: 755, Subword: b'def'
Token: 24748, Subword: b' hello'
Token: 32892, Subword: b'_world'
Token: 4658, Subword: b'():'
Token: 1194, Subword: b' print'
Token: 493, Sub

I encourage you to play around with it if you'd like!

In [6]:
more_example_text = "Harpreet Sahota is writing a book RAG and is so happy you're joining him on the journey!"
from_text_to_tokens(more_example_text)

Original text: Harpreet Sahota is writing a book RAG and is so happy you're joining him on the journey!

Tokens: [27588, 1762, 295, 43059, 6217, 374, 4477, 264, 2363, 432, 1929, 323, 374, 779, 6380, 499, 2351, 18667, 1461, 389, 279, 11879, 0]

Subwords: ['Har', 'pre', 'et', ' Sah', 'ota', ' is', ' writing', ' a', ' book', ' R', 'AG', ' and', ' is', ' so', ' happy', ' you', "'re", ' joining', ' him', ' on', ' the', ' journey', '!']

Token to subword mapping:
Token: 27588, Subword: b'Har'
Token: 1762, Subword: b'pre'
Token: 295, Subword: b'et'
Token: 43059, Subword: b' Sah'
Token: 6217, Subword: b'ota'
Token: 374, Subword: b' is'
Token: 4477, Subword: b' writing'
Token: 264, Subword: b' a'
Token: 2363, Subword: b' book'
Token: 432, Subword: b' R'
Token: 1929, Subword: b'AG'
Token: 323, Subword: b' and'
Token: 374, Subword: b' is'
Token: 779, Subword: b' so'
Token: 6380, Subword: b' happy'
Token: 499, Subword: b' you'
Token: 2351, Subword: b"'re"
Token: 18667, Subword: b' joining'
Token: 

When a large language model (LLM) is being pretrained or fine-tuned, each token is then mapped to a vector representation called a token embedding. Imagine each token being plotted as a point in a high-dimensional space, where the location reflects its meaning. These embeddings capture the semantic meaning of each token and its relationship to other tokens. This is especially useful for the attention mechanism in the Transformer architecture that modern LLMs use. 

<div style="text-align: center;">
    <img src="https://cdn.openai.com/new-and-improved-embedding-model/draft-20221214a/vectors-2.svg" style="width: 75%; height: auto;">
</div>

[Image Source: OpenAI Blog](https://openai.com/blog/new-embedding-models-and-api-updates)

For retrieval, you're not interested in retrieving indiviual tokens.  

You want to retrieve chunks of text, and that means an entire sequence of tokens must be represented as a vector.  We don't have access to the source code for `text-embedding-3-large`, but in general, the process of going from embedding a token to embedding a sequence of tokens is as follows:

**Pooling**

- After obtaining the token embeddings for the input text, a pooling operation is applied to combine them into a single vector representation.

- Common pooling methods include:
  - Average pooling: Taking the element-wise average of the token embeddings.
  - Max pooling: Taking the element-wise maximum of the token embeddings.
  - Last token pooling: Using the embedding of the last token as the representative vector.

**Normalization**

- After obtaining the pooled embedding vector, it is typically normalized to have a unit length.

- Normalization is done to ensure that the embeddings are scale-invariant and can be compared using some similarity metric.

- L2 normalization (also known as Euclidean normalization) is commonly used, where each element of the vector is divided by the Euclidean norm (square root of the sum of squared elements) of the vector.

**Output**

- The final output of the text embedding model is a dense vector of floating-point numbers that represents the input text.

- The dimensionality of the output vector can vary depending on the specific model and configuration.

- For the `text-embedding-3-large` model, the output vector defaults to a dimensionality of 3072.

- However, the dimensionality can be reduced using the "dimensions" parameter to trade off performance for a more compact representation.


In [7]:
from openai import OpenAI

openai_client = OpenAI()

def get_text_embedding(text: str, openai_client: OpenAI= openai_client, model: str = "text-embedding-3-large") -> list:
    """
    Get the vector representation of the input text using the specified OpenAI embedding model.

    Args:
        openai_client (OpenAI): An instance of the OpenAI client.
        text (str): The input text to be embedded.
        model (str, optional): The name of the OpenAI embedding model to use. Defaults to "text-embedding-3-large".

    Returns:
        list: The vector representation of the input text as a list of floats.

    Raises:
        OpenAIError: If an error occurs during the API call.
    """
    try:
        embedding = openai_client.embeddings.create(
            input=text, 
            model=model
        ).data[0].embedding
        return embedding
    except openai_client.OpenAIError as e:
        raise e

You can confirm the length of the embedding.

In [11]:
print(f"This string has {num_tokens_from_string(example_text)} tokens")

vector = get_text_embedding(example_text)

print(f"The vector representation of the text has: {len(vector)} elements")

This string has 59 tokens
The vector representation of the text has: 3072 elements


You can inspect the first few elements of the vector as well:

In [None]:
vector[:10]

It doesn't matter how many tokens the input text has, it will still have the same dimensionality as a vector representation (as long as you're embedding it with the same model).

In [12]:
print(f"This string has {num_tokens_from_string(more_example_text)} tokens")

vector = get_text_embedding(more_example_text)

print(f"The vector representation of the text has: {len(vector)} elements")

This string has 23 tokens
The vector representation of the text has: 3072 elements


This is important because, as was discussed in the previous post, all vectors in our collection must have the same dimensionality.

Let's download a dataset from Hugging Face and get it into our collection. We'll use the [`ai-arxiv-chunked dataset`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked), because it's nicely chunked already and has some columns that will serve well as metadata. This dataset has 41.6k rows and is 153 MB large. For the sake of demonstration, time, and keeping your OpenAI bill as low as possible, just randomly sample 100 rows from the dataset.

In [13]:
from datasets import load_dataset

arxiv_chunked_dataset = load_dataset("jamescalam/ai-arxiv-chunked", split="train")

sampled_dataset = arxiv_chunked_dataset.shuffle(seed=51).select(range(100)).to_list()

  from .autonotebook import tqdm as notebook_tqdm


You can take a peek at a row of the dataset like so:


In [14]:
sampled_dataset[0]

{'doi': '2210.02406',
 'chunk-id': '4',
 'chunk': 'Figure 1: While standard approaches only provide labeled examples (shown as a grey input box\nwith green label box), Chain-of-Thought prompting also describes the reasoning steps to arrive at\nthe answer for every example in the prompt. Decomposed Prompting, on the other hand, uses the\ndecomposer prompt to only describe the procedure to solve the complex tasks using certain subtasks. Each sub-task, indicated here with A, B and C is handled by sub-task speciﬁc handlers which\ncan vary from a standard prompt (sub-task A), a further decomposed prompt (sub-task B) or a\nsymbolic function such as retrieval (sub-task C)\nprompt only describes a sequence of sub-tasks (A, B, and C) needed to solve the complex tasks, indicated with the dashed lines. Each sub-task is then delegated to the corresponding sub-task handler\nshown on the right.\nUsing a software engineering analogy, the decomposer deﬁnes the top-level program for the complex task us

Time to get this data into Qdrant. Start by instantiating the client and updating the collection so it's ready for the vectors we're going to give it. Recall that, over the next few blogs, you'll work exclusively with text data. For that I'll use OpenAI's `text-embedding-3-large` embedding model, which has a default dimensionality of `3072`. I'll also use cosine similarity as the distance metric. This information will go into the vectors config in `create_collection`.



In [None]:
from qdrant_client import QdrantClient, AsyncQdrantClient

from qdrant_client.models import Distance, VectorParams

from qdrant_client.http.models import CollectionStatus, UpdateStatus

q_client = QdrantClient(
    url=os.getenv('QDRANT_URL'),
    api_key=os.getenv('QDRANT_API_KEY')
)

q_client.create_collection(
    collection_name="arxiv_chunks",
    vectors_config={
        "chunk": VectorParams(size=3072, distance=Distance.COSINE),
        "summary": VectorParams(size=3072, distance=Distance.COSINE),
    }
    )

As we discussed in the previous post, `Points` are the main data structure (for lack of better word) that Qdrant uses to store and retrieve data. 

These are defined by some vector embedding and any additional metadata you want to include. 


The `add_data_to_collection` function takes a list of dictionaries as input, where each dictionary represents a document to be inserted into the Qdrant vector database. The function iterates over each dictionary in the list and performs the following steps:

 - Extracts the relevant key-value pairs from the dictionary, including the `summary`, `chunk`, `title`, `source`, and `authors`.

 - Converts the `summary` and `chunk` texts into vector embeddings using the OpenAI embeddings endpoint.

 - Generates a unique ID for each document using the `uuid` module.

 - Creates a payload dictionary containing the `title`, `source`, and `authors` metadata.

 - Constructs a `PointStruct` object using the generated ID, the concatenated summary and chunk vectors, and the payload metadata.

 - Appends the `PointStruct` object to the points list.

 - After processing all the documents, the function uses the self.client.upsert method to insert the points list into the specified Qdrant collection. The wait parameter is set to True to ensure that the insertion operation is completed before proceeding.

 - Finally, the function checks the status of the insertion operation. If the status is UpdateStatus.COMPLETED, it prints a success message. Otherwise, it prints a failure message.

The `PointStruct` objects is the fundamental units of data storage in Qdrant. It encapsulates the vector embeddings along with any associated metadata. This enables efficient retrieval and similarity search operations. By converting the `summary` and `chunk` texts into vector embeddings and storing them along with the relevant metadata, you can insert the data into the Qdrant vector database for retrieval tasks.

In [11]:
from typing import List
import uuid

from qdrant_client.models import PointStruct

def add_data_to_collection(data: List[dict], qdrant_client: QdrantClient = q_client, collection_name: str = "arxiv_chunks"):
    """
    Inserts data into the Qdrant vector database.

    Args:
        data (List[dict]): A list of dictionaries containing the data to be inserted.
            Each dictionary should have the following keys:
            - 'summary': The summary text to be converted into a vector embedding.
            - 'chunk': The chunk text to be converted into a vector embedding.
            - 'title': The title of the document.
            - 'source': The source URL of the document.
            - 'authors': A list of authors of the document.
        qdrant_client (QdrantClient): An instance of the QdrantClient. Defaults to qdrant_client.
        collection_name (str): The name of the collection in which to insert the data. Defaults to "arxiv_chunks".

    Returns:
        None
    """
    # instantiate an empty list for the points
    points = []

    # get the relevent data from the input dictionary
    for item in data:
        text_id = str(uuid.uuid4())
        summary = item.get("summary")
        chunk = item.get("chunk")
        title = item.get("title")
        source = item.get("source")
        authors = item.get("authors")

        # get the vector embeddings for the summary and chunk
        summary_vector = get_text_embedding(summary)
        chunk_vector = get_text_embedding(chunk)

        # create a dictionary with the vector embeddings
        vector_dict = {"summary": summary_vector, "chunk": chunk_vector}
        
        # create a dictionary with the payload data
        payload = {"title": title, "source": source, "authors": authors}

        # create a PointStruct object and append it to the list of points
        point = PointStruct(id=text_id, vector=vector_dict, payload=payload)
        points.append(point)

    operation_info = qdrant_client.upsert(
        collection_name=collection_name,
        wait=True,
        points=points)

    if operation_info.status == UpdateStatus.COMPLETED:
        print("Data inserted successfully!")
    else:
        print("Failed to insert data")

In [12]:
add_data_to_collection(sampled_dataset)

Data inserted successfully!


You can verify that the collection exists, via the UI and programatically. Notice that you can do some visualization via the UI as well.

<div style="text-align: center;">
    <img src="/Users/harpreetsahota/workspace/practical-rag-book/exploring-qdrant/image_assets/arxix-collection-verification.gif" style="width: 75%; height: auto;">
</div>



In [None]:
q_client.get_collections()

You can programatically verify the number of points that were created as well. 

In [22]:
arxiv_collection = q_client.get_collection("arxiv_chunks")

print(f"This collection has {arxiv_collection.points_count} points")

This collection has 100 points


Go ahead and close the connection to the client.

In [23]:
q_client.close()

# That's it for this one!

In the next blog in this series, I'll teach you the basics of querying the vectors in your collection. After that blog, you'll have a solid foundation that we'll be able to build on as we start doing some more interesting things and work our way towards multimodal and crossmodal retrieval! 