# From Real World Data to Vectors

In this video, you'll gain an intuition for the process of turning real-world text data into vectors and adding them to your Qdrant collection. 

To follow along with this, [you'll need an OpenAI API Key](https://platform.openai.com/). OpenAI requires to put some money down upfront. I recommend putting down $20. If you can't afford that, then [Cohere is a good option](https://dashboard.cohere.com/welcome/login). Cohere lets you access their models for free, without having to enter any credit card information. However, there are rate limits to their free tier. 

At the time of this writing, you can't send more than 5 requests per minute, 100 requests per hour, and 1000 requests per month.

#### üî† Tokenization: Breaking down text into smaller units (tokens) 

- Words, subwords, or characters depending on tokenizer

- OpenAI uses `cl100k_base` tokenizer based on [byte-pair encoding (BPE) algorithm](https://youtu.be/HEikzVL-lZU)

- BPE iteratively replaces most frequent byte pairs with unused byte

- Tokens can be characters, partial words, complete words

- Spaces usually part of preceding word token

- Tokens differ across languages

#### üß¨ Embedding: Mapping tokens to high-dimensional vector space

- Each dimension captures token meaning and relationships

üéØ Resulting set of vectors represents the text's semantics 

#### ‚öôÔ∏è Process in action:

1. Install `tiktoken` Python library 

```shell
pip install tiktoken==0.6.0
```

2. Check out [web app](https://tiktokenizer.vercel.app/?model=text-embedding-3-large) demo of `cl100k_base` tokenizer

üî• Ready to see tokenization in action yourself! Let's go üí™

In [1]:
import os
import tiktoken

from dotenv import load_dotenv

load_dotenv(".env")

True

In [2]:
example_text = "def hello_world(): print('Hello, world! üåç') # Bonjour, ‰∏ñÁïå! Hola, mundo! 1 + 1 = 2, œÄ ‚âà 3.14159, e^(i*œÄ) + 1 = 0."

You can count the number of "words" in the string by blank space, and see how this differs from the number of tokens

In [3]:
len(example_text.split())

23

Now, count the number of tokens.

In [4]:
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """
    Calculate the number of tokens in a given text string using a specified encoding.

    Args:
        string (str): The input text string to be tokenized.
        encoding_name (str, optional): The name of the encoding to use for tokenization.
            Defaults to "cl100k_base". Other supported encodings include "p50k_base",
            "p50k_edit", "r50k_base", etc.

    Returns:
        int: The number of tokens in the input text string.

    Note:
        The number of tokens returned by this function depends on the chosen encoding.
        Different encodings may have different tokenization rules and vocabulary sizes.

    Raises:
        ValueError: If an invalid encoding name is provided.
    """
    try:
        encoding = tiktoken.get_encoding(encoding_name)
        num_tokens = len(encoding.encode(string))
        return num_tokens
    except KeyError:
        raise ValueError(f"Unsupported encoding: {encoding_name}")

num_tokens_from_string(example_text)

59

As you can see, that's quite a difference!

The code below will show you the text, token, and integer representation of each token in the string.

In [5]:
def from_text_to_tokens(text:str, encoding_name: str =  "cl100k_base" ):
    """
    Tokenize the given text using the cl100k_base encoding.

    Args:
        text (str): The input text to be tokenized.

    Returns:
        None
    """
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    subwords = [encoding.decode([token]) for token in tokens]
    print(f"Original text: {text}")
    print(f"\nTokens: {tokens}")
    print(f"\nSubwords: {subwords}")
    print("\nToken to subword mapping:")
    for token, subword in zip(tokens, subwords):
        print(f"Token: {token}, Subword: {subword.encode('utf-8')}")

from_text_to_tokens(example_text)

Original text: def hello_world(): print('Hello, world! üåç') # Bonjour, ‰∏ñÁïå! Hola, mundo! 1 + 1 = 2, œÄ ‚âà 3.14159, e^(i*œÄ) + 1 = 0.

Tokens: [755, 24748, 32892, 4658, 1194, 493, 9906, 11, 1917, 0, 11410, 234, 235, 873, 674, 13789, 30362, 11, 220, 3574, 244, 98220, 0, 473, 8083, 11, 29452, 0, 220, 16, 489, 220, 16, 284, 220, 17, 11, 52845, 21784, 230, 220, 18, 13, 9335, 2946, 11, 384, 13571, 72, 9, 49345, 8, 489, 220, 16, 284, 220, 15, 13]

Subwords: ['def', ' hello', '_world', '():', ' print', "('", 'Hello', ',', ' world', '!', ' ÔøΩ', 'ÔøΩ', 'ÔøΩ', "')", ' #', ' Bon', 'jour', ',', ' ', 'ÔøΩ', 'ÔøΩ', 'Áïå', '!', ' H', 'ola', ',', ' mundo', '!', ' ', '1', ' +', ' ', '1', ' =', ' ', '2', ',', ' œÄ', ' ÔøΩ', 'ÔøΩ', ' ', '3', '.', '141', '59', ',', ' e', '^(', 'i', '*', 'œÄ', ')', ' +', ' ', '1', ' =', ' ', '0', '.']

Token to subword mapping:
Token: 755, Subword: b'def'
Token: 24748, Subword: b' hello'
Token: 32892, Subword: b'_world'
Token: 4658, Subword: b'():'
Token: 1194, Subwo

I encourage you to play around with it if you'd like!

In [6]:
more_example_text = "Harpreet Sahota is writing a book RAG and is so happy you're joining him on the journey!"
from_text_to_tokens(more_example_text)

Original text: Harpreet Sahota is writing a book RAG and is so happy you're joining him on the journey!

Tokens: [27588, 1762, 295, 43059, 6217, 374, 4477, 264, 2363, 432, 1929, 323, 374, 779, 6380, 499, 2351, 18667, 1461, 389, 279, 11879, 0]

Subwords: ['Har', 'pre', 'et', ' Sah', 'ota', ' is', ' writing', ' a', ' book', ' R', 'AG', ' and', ' is', ' so', ' happy', ' you', "'re", ' joining', ' him', ' on', ' the', ' journey', '!']

Token to subword mapping:
Token: 27588, Subword: b'Har'
Token: 1762, Subword: b'pre'
Token: 295, Subword: b'et'
Token: 43059, Subword: b' Sah'
Token: 6217, Subword: b'ota'
Token: 374, Subword: b' is'
Token: 4477, Subword: b' writing'
Token: 264, Subword: b' a'
Token: 2363, Subword: b' book'
Token: 432, Subword: b' R'
Token: 1929, Subword: b'AG'
Token: 323, Subword: b' and'
Token: 374, Subword: b' is'
Token: 779, Subword: b' so'
Token: 6380, Subword: b' happy'
Token: 499, Subword: b' you'
Token: 2351, Subword: b"'re"
Token: 18667, Subword: b' joining'
Token: 

When a large language model (LLM) is being pretrained or fine-tuned, each token is mapped to a vector representation called a token embedding. 

These embeddings capture the semantic meaning of each token and its relationship to other tokens. This is especially useful for the attention mechanism in the Transformer architecture that modern LLMs use. 

<div style="text-align: center;">
    <img src="https://cdn.openai.com/new-and-improved-embedding-model/draft-20221214a/vectors-2.svg" style="width: 75%; height: auto;">
</div>

[Image Source: OpenAI Blog](https://openai.com/blog/new-embedding-models-and-api-updates)

You want to retrieve chunks of text, and that means an entire sequence of tokens must be represented as a vector. We don't have access to the source code for `text-embedding-3-large`, but in general, the process of going from embedding a token to embedding a sequence of tokens is as follows:

#### üé∞ Token Embeddings

- Each token mapped to vector representation during LLM pretraining/fine-tuning 

- Tokens plotted as points in high-dimensional space reflecting meaning 

- Captures semantic meaning and token relationships 

- Useful for attention mechanism in Transformer architecture 

#### üéØ Embedding Sequences for Retrieval

- Goal is to retrieve chunks of text, not individual tokens

- Entire token sequence must be represented as a single vector

#### üèä‚Äç‚ôÇÔ∏è Pooling Methods

- Combine token embeddings into one vector

- Average pooling: Element-wise average of token embeddings

- Max pooling: Element-wise maximum of token embeddings

- üîö Last token pooling: Use embedding of last token as representative 

#### üìè Normalization

- Normalize pooled embedding vector to unit length

- ‚öñÔ∏è Ensures scale-invariance for comparison using similarity metrics 

- üìê L2 normalization (Euclidean normalization) commonly used 

Output:

- Dense vector of floating-point numbers representing input text

- üìè Dimensionality varies by model and configuration 

- `text-embedding-3-large` defaults to 3072 dimensions 

- üóúÔ∏è Can reduce dimensionality using "dimensions" parameter for compactness 

In [7]:
from openai import OpenAI

openai_client = OpenAI()

def get_text_embedding(text: str, openai_client: OpenAI= openai_client, model: str = "text-embedding-3-large") -> list:
    """
    Get the vector representation of the input text using the specified OpenAI embedding model.

    Args:
        openai_client (OpenAI): An instance of the OpenAI client.
        text (str): The input text to be embedded.
        model (str, optional): The name of the OpenAI embedding model to use. Defaults to "text-embedding-3-large".

    Returns:
        list: The vector representation of the input text as a list of floats.

    Raises:
        OpenAIError: If an error occurs during the API call.
    """
    try:
        embedding = openai_client.embeddings.create(
            input=text, 
            model=model
        ).data[0].embedding
        return embedding
    except openai_client.OpenAIError as e:
        raise e

You can confirm the length of the embedding.

In [8]:
print(f"This string has {num_tokens_from_string(example_text)} tokens")

vector = get_text_embedding(example_text)

print(f"The vector representation of the text has: {len(vector)} elements")

This string has 59 tokens
The vector representation of the text has: 3072 elements


You can inspect the first few elements of the vector as well:

In [9]:
vector[:10]

[0.022949593141674995,
 -0.022350121289491653,
 -0.011418242007493973,
 0.02158098667860031,
 -0.016502441838383675,
 -0.009930873289704323,
 0.03696366026997566,
 0.03863765671849251,
 0.017000116407871246,
 0.007159729953855276]

It doesn't matter how many tokens the input text has, it will still have the same dimensionality as a vector representation (as long as you're embedding it with the same model).

In [10]:
print(f"This string has {num_tokens_from_string(more_example_text)} tokens")

vector = get_text_embedding(more_example_text)

print(f"The vector representation of the text has: {len(vector)} elements")

This string has 23 tokens
The vector representation of the text has: 3072 elements


This is important because, as was discussed in the previous post, all vectors in our collection must have the same dimensionality.

Let's download a dataset from Hugging Face and get it into our collection. We'll use the [`ai-arxiv-chunked dataset`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked), because it's nicely chunked already and has some columns that will serve well as metadata. This dataset has 41.6k rows and is 153 MB large. For the sake of demonstration, time, and keeping your OpenAI bill as low as possible, just randomly sample 100 rows from the dataset.

In [11]:
from datasets import load_dataset

arxiv_chunked_dataset = load_dataset("jamescalam/ai-arxiv-chunked", split="train")

sampled_dataset = arxiv_chunked_dataset.shuffle(seed=51).select(range(100)).to_list()

  from .autonotebook import tqdm as notebook_tqdm


You can take a peek at a row of the dataset like so:


In [12]:
sampled_dataset[0]

{'doi': '2210.02406',
 'chunk-id': '4',
 'chunk': 'Figure 1: While standard approaches only provide labeled examples (shown as a grey input box\nwith green label box), Chain-of-Thought prompting also describes the reasoning steps to arrive at\nthe answer for every example in the prompt. Decomposed Prompting, on the other hand, uses the\ndecomposer prompt to only describe the procedure to solve the complex tasks using certain subtasks. Each sub-task, indicated here with A, B and C is handled by sub-task speciÔ¨Åc handlers which\ncan vary from a standard prompt (sub-task A), a further decomposed prompt (sub-task B) or a\nsymbolic function such as retrieval (sub-task C)\nprompt only describes a sequence of sub-tasks (A, B, and C) needed to solve the complex tasks, indicated with the dashed lines. Each sub-task is then delegated to the corresponding sub-task handler\nshown on the right.\nUsing a software engineering analogy, the decomposer deÔ¨Ånes the top-level program for the complex tas

üîß **Setting Up Qdrant for Text Data Embeddings**

1. **Initialize the Qdrant Client**:
   - Start by creating an instance of the Qdrant client to interact with the database.

2. **Prepare Your Collection**:
   - Update your collection settings in preparation for the data.
   - We're focusing on text data in the upcoming sessions.

3. **Embedding Model Selection**:
   - Utilize OpenAI‚Äôs `text-embedding-3-large` for embedding the text.
   - This model features a dimensionality of `3072`.

4. **Configuring the Collection**:
   - Set up the `create_collection` with the right vectors config.
   - Use **cosine similarity** as the distance metric for optimal text data handling.

In [13]:
from qdrant_client import QdrantClient, AsyncQdrantClient

from qdrant_client.models import Distance, VectorParams

from qdrant_client.http.models import CollectionStatus, UpdateStatus

q_client = QdrantClient(
    url=os.getenv('QDRANT_URL'),
    api_key=os.getenv('QDRANT_API_KEY')
)

q_client.create_collection(
    collection_name="arxiv_chunks",
    vectors_config={
        "chunk": VectorParams(size=3072, distance=Distance.COSINE),
        "summary": VectorParams(size=3072, distance=Distance.COSINE),
    }
    )

True

### üöÄ **Adding Data to Qdrant Collection**

1. **Understanding Points**

   - Remember, `Points` are key in Qdrant for storing and retrieving data.

   - They consist of vector embeddings and any additional metadata you choose to include.

2. **Data Insertion with `add_data_to_collection`**

   - Takes a list of dictionaries; each dictionary represents a document.

3. **Processing Each Document**

   - **Extract Key Info**: Grab `summary`, `chunk`, `title`, `source`, `authors` from each dictionary.

   - **Vector Embedding**: Convert `summary` and `chunk` texts into vectors using OpenAI‚Äôs embeddings endpoint.

   - **Unique ID Generation**: Use `uuid` to create a distinct ID for each document.

   - **Create Payload**: Formulate a dictionary with metadata like `title`, `source`, `authors`.

   - **Build PointStruct**: Use the ID, concatenated vectors, and payload to construct a `PointStruct`.

   - **Append to List**: Add each `PointStruct` to a list of points.

4. **Insert Points into Collection**

   - **Upsert Operation**: Use `client.upsert` to add points to the collection, setting `wait` to True to ensure completion.

   - **Check Status**: Verify the insertion status. If `UpdateStatus.COMPLETED`, print a success message. If not, print a failure message.

5. **Purpose of PointStruct**

   - Serves as the fundamental unit for data storage in Qdrant.

   - Encapsulates vector embeddings and metadata for efficient retrieval and similarity searches.

In [14]:
from typing import List
import uuid

from qdrant_client.models import PointStruct

def add_data_to_collection(data: List[dict], qdrant_client: QdrantClient = q_client, collection_name: str = "arxiv_chunks"):
    """
    Inserts data into the Qdrant vector database.

    Args:
        data (List[dict]): A list of dictionaries containing the data to be inserted.
            Each dictionary should have the following keys:
            - 'summary': The summary text to be converted into a vector embedding.
            - 'chunk': The chunk text to be converted into a vector embedding.
            - 'title': The title of the document.
            - 'source': The source URL of the document.
            - 'authors': A list of authors of the document.
        qdrant_client (QdrantClient): An instance of the QdrantClient. Defaults to qdrant_client.
        collection_name (str): The name of the collection in which to insert the data. Defaults to "arxiv_chunks".

    Returns:
        None
    """
    # instantiate an empty list for the points
    points = []

    # get the relevent data from the input dictionary
    for item in data:
        text_id = str(uuid.uuid4())
        summary = item.get("summary")
        chunk = item.get("chunk")
        title = item.get("title")
        source = item.get("source")
        authors = item.get("authors")

        # get the vector embeddings for the summary and chunk
        summary_vector = get_text_embedding(summary)
        chunk_vector = get_text_embedding(chunk)

        # create a dictionary with the vector embeddings
        vector_dict = {"summary": summary_vector, "chunk": chunk_vector}
        
        # create a dictionary with the payload data
        payload = {
            "text_id":text_id,
            "title": title,     
            "source": source, 
            "authors": authors,
            "chunk": chunk,
            "summary": summary,
            }

        # create a PointStruct object and append it to the list of points
        point = PointStruct(id=text_id, vector=vector_dict, payload=payload)
        points.append(point)

    operation_info = qdrant_client.upsert(
        collection_name=collection_name,
        wait=True,
        points=points)

    if operation_info.status == UpdateStatus.COMPLETED:
        print("Data inserted successfully!")
    else:
        print("Failed to insert data")

In [16]:
add_data_to_collection(sampled_dataset)

Data inserted successfully!


You can verify that the collection exists, via the UI and programatically. Notice that you can do some visualization via the UI as well.

In [17]:
q_client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='arxiv_chunks')])

You can programatically verify the number of points that were created as well. 

In [18]:
arxiv_collection = q_client.get_collection("arxiv_chunks")

print(f"This collection has {arxiv_collection.points_count} points")

This collection has 100 points


Go ahead and close the connection to the client.

In [19]:
q_client.close()

# That's it for this one!