# Using Vector Databases for Embeddings Search

This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

### What is a Vector Database

A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.

### Why use a Vector Database

Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.


### Demo Flow
The demo flow is:
- **Setup**: Import packages and set any required variables
- **Load data**: Load a dataset and embed it using OpenAI embeddings
- **Pinecone**
    - *Setup*: Here we'll set up the Python client for Pinecone. For more details go [here](https://docs.pinecone.io/docs/quickstart)
    - *Index Data*: We'll create an index with namespaces for __titles__ and __content__
    - *Search Data*: We'll test out both namespaces with search queries to confirm it works
- **Weaviate**
    - *Setup*: Here we'll set up the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)
    - *Index Data*: We'll create an index with __title__ search vectors in it
    - *Search Data*: We'll run a few searches to confirm it works
- **Qdrant**
    - *Setup*: Here we'll set up the Python client for Qdrant. For more details go [here](https://github.com/qdrant/qdrant_client)
    - *Index Data*: We'll create a collection with vectors for __titles__ and __content__
    - *Search Data*: We'll run a few searches to confirm it works

Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings.

## Setup

Import the required libraries and set the embedding model that we'd like to use.

In [None]:
# We'll need to install the clients for all vector databases
!pip install pinecone-client
!pip install weaviate-client
!pip install qdrant-client

In [3]:
import openai

import tiktoken
from tenacity import retry, wait_random_exponential, stop_after_attempt
from typing import List, Iterator
import concurrent
from tqdm import tqdm
import pandas as pd
from datasets import load_dataset
import numpy as np
import os

# Pinecone's client library for Python
import pinecone

# Weaviate's client library for Python
import weaviate

# Qdrant's client library for Python
import qdrant_client

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Load data

In this section we'll source the data for this task, embed it and format it for insertion into a vector database

In [4]:
# Simple function to take in a list of text objects and return them as a list of embeddings
def get_embeddings(input: List):
    response = openai.Embedding.create(
        input=input,
        model=EMBEDDING_MODEL,
    )["data"]
    return [data["embedding"] for data in response]

def batchify(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]

# Function for batching and parallel processing the embeddings
def embed_corpus(
    corpus: List[str],
    batch_size=64,
    num_workers=8,
    max_context_len=8191,
):

    # Encode the corpus, truncating to max_context_len
    encoding = tiktoken.get_encoding("cl100k_base")
    encoded_corpus = [
        encoded_article[:max_context_len] for encoded_article in encoding.encode_batch(corpus)
    ]

    # Calculate corpus statistics: the number of inputs, the total number of tokens, and the estimated cost to embed
    num_tokens = sum(len(article) for article in encoded_corpus)
    cost_to_embed_tokens = num_tokens / 1_000 * 0.0004
    print(
        f"num_articles={len(encoded_corpus)}, num_tokens={num_tokens}, est_embedding_cost={cost_to_embed_tokens:.2f} USD"
    )

    # Embed the corpus
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        
        futures = [
            executor.submit(get_embeddings, text_batch)
            for text_batch in batchify(encoded_corpus, batch_size)
        ]

        with tqdm(total=len(encoded_corpus)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(batch_size)

        embeddings = []
        for future in futures:
            data = future.result()
            embeddings.extend(data)

        return embeddings

In [None]:
# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding
dataset = list(load_dataset("wikipedia", "20220301.simple")["train"])
# Limited to 25k articles for demo purposes
dataset = dataset[:25_000]  

In [6]:
%%time
# Embed the article text
dataset_embeddings = embed_corpus([article["text"] for article in dataset])

num_articles=25000, num_tokens=12896881, est_embedding_cost=5.16 USD


25024it [01:06, 377.31it/s]                                                                                                                                           

CPU times: user 16.3 s, sys: 2.24 s, total: 18.5 s
Wall time: 1min 8s





In [7]:
# Embed the article titles separately
title_embeddings = embed_corpus([article["title"] for article in dataset])

num_articles=25000, num_tokens=88300, est_embedding_cost=0.04 USD


25024it [00:36, 683.22it/s]                                                                                                                                           


In [122]:
# We will then store the result in another dataframe, and prep the data for insertion into a vector DB
article_df = pd.DataFrame(dataset)
article_df['title_vector'] = title_embeddings
article_df['content_vector'] = dataset_embeddings
article_df['vector_id'] = article_df.index
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


## Pinecone

We'll index these embedded documents in a vector database and search them. The first option we'll look at is **Pinecone**, a managed vector database which offers a cloud-native option.

Before you proceed with this step you'll need to navigate to [Pinecone](pinecone.io), sign up and then save your API key as an environment variable titled ```PINECONE_API_KEY```.

For section we will:
- Create an index with multiple namespaces for article titles and content
- Store our data in the index with separate searchable "namespaces" for article **titles** and **content**
- Fire some similarity search queries to verify our setup is working

In [11]:
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)

### Create Index

First we will need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [Pinecone documentation](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.).

If you want to batch insert to your index in parallel to increase insertion speed then there is a great guide in the Pinecone documentation on [batch inserts in parallel](https://docs.pinecone.io/docs/insert-data#sending-upserts-in-parallel).

In [108]:
# Models a simple batch generator that make chunks out of an input DataFrame
class BatchGenerator:
    
    
    def __init__(self, batch_size: int = 10) -> None:
        self.batch_size = batch_size
    
    # Makes chunks out of an input DataFrame
    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
        splits = self.splits_num(df.shape[0])
        if splits <= 1:
            yield df
        else:
            for chunk in np.array_split(df, splits):
                yield chunk

    # Determines how many chunks DataFrame contains
    def splits_num(self, elements: int) -> int:
        return round(elements / self.batch_size)
    
    __call__ = to_batches

df_batcher = BatchGenerator(300)

In [124]:
# Pick a name for the new index
index_name = 'wikipedia-articles'

# Check whether the index with the same name already exists - if so, delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
# Creates new index
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)

# Confirm our index was created
pinecone.list_indexes()

['wikipedia-articles']

In [126]:
# Upsert content vectors in content namespace - this can take a few minutes
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')

Uploading vectors to content namespace..


In [127]:
# Upsert title vectors in title namespace - this can also take a few minutes
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')

Uploading vectors to title namespace..


In [128]:
# Check index size for each namespace to confirm all of our docs have loaded
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'content': {'vector_count': 25000},
                'title': {'vector_count': 25000}},
 'total_vector_count': 50000}

### Search data

Now we'll enter some dummy searches and check we get decent results back

In [None]:
# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))

In [72]:
def query_article(query, namespace, top_k=5):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    embedded_query = openai.Embedding.create(
                                            input=query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Query namespace passed as parameter using title vector
    query_result = index.query(embedded_query, 
                                      namespace=namespace, 
                                      top_k=top_k)

    # Print query results 
    print(f'\nMost similar results to {query} in "{namespace}" namespace:\n')
    if not query_result.matches:
        print('no query result')
    
    matches = query_result.matches
    ids = [res.id for res in matches]
    scores = [res.score for res in matches]
    df = pd.DataFrame({'id':ids, 
                       'score':scores,
                       'title': [titles_mapped[_id] for _id in ids],
                       'content': [content_mapped[_id] for _id in ids],
                       })
    
    counter = 0
    for k,v in df.iterrows():
        counter += 1
        print(f'{v.title} (score = {v.score})')
    
    print('\n')

    return df

In [None]:
query_output = query_article('modern art in Europe','title')

In [None]:
content_query_output = query_article("Famous battles in Scottish history",'content')

## Weaviate

The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.

For this we will:
- Set up a local deployment of Weaviate
- Create indices in Weaviate
- Store our data there
- Fire some similarity search queries
- Try a real use case

### Setup

To get Weaviate running locally we will use Docker and follow the instructions contained in the Weaviate documentation here: https://weaviate.io/developers/weaviate/current/installation/docker-compose.html

For an example docker-compose.yaml file please refer to `./weaviate/docker-compose.yaml` in this repo

You can start Weaviate up locally by navigating to this directory and running `docker-compose up -d `

In [113]:
client = weaviate.Client("http://localhost:8080/")

In [114]:
client.schema.delete_all()
client.schema.get()

{'classes': []}

In [115]:
client.is_ready()

True

### Index data

In Weaviate you create __schemas__ to capture each of the entities you will be searching. 

In this case we'll create a schema called **Article** with the **title** vector from above included for us to search by.

The next few steps closely follow the documentation Weaviate provides [here](https://weaviate.io/developers/weaviate/current/tutorials/how-to-use-weaviate-without-modules.htm)

In [116]:
class_obj = {
    "class": "Article",
    "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves
    "properties": [{
        "name": "title",
        "description": "Title of the article",
        "dataType": ["text"]
    },
        {
        "name": "content",
        "description": "Contents of the article",
        "dataType": ["text"]
    }]
}

# Create the schema in Weaviate
client.schema.create_class(class_obj)

# Check that we've created it as intended
client.schema.get()

{'classes': [{'class': 'Article',
   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
    'cleanupIntervalSeconds': 60,
    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
   'properties': [{'dataType': ['text'],
     'description': 'Title of the article',
     'name': 'title',
     'tokenization': 'word'},
    {'dataType': ['text'],
     'description': 'Contents of the article',
     'name': 'content',
     'tokenization': 'word'}],
   'shardingConfig': {'virtualPerPhysical': 128,
    'desiredCount': 1,
    'actualCount': 1,
    'desiredVirtualCount': 128,
    'actualVirtualCount': 128,
    'key': '_id',
    'strategy': 'hash',
    'function': 'murmur3'},
   'vectorIndexConfig': {'skip': False,
    'cleanupIntervalSeconds': 300,
    'maxConnections': 64,
    'efConstruction': 128,
    'ef': -1,
    'dynamicEfMin': 100,
    'dynamicEfMax': 500,
    'dynamicEfFactor': 8,
    'vectorCacheMaxObjects': 2000000,
    'flatSearchCutoff': 40000,
    'distance': 'cos

In [117]:
# Convert DF into a list of tuples
data_objects = []
for k,v in article_df.iterrows():
    data_objects.append((v['title'],v['text'],v['title_vector'],v['vector_id']))

# Upsert into article schema
print("Uploading vectors to article schema..")

# Store a list of UUIDs in case we want to use to refer back to the initial dataframe
uuids = []

# Reuse our batcher from the Pinecone ingestion
for batch_df in df_batcher(article_df):
    for k,v in batch_df.iterrows():
        #print(articles)
        uuid = client.data_object.create(
                              {
                                  "title": v['title'],
                                  "content": v['text']
                              },
                              "Article",
                              vector=v['title_vector']
                            )
        uuids.append(uuid)

Uploading vectors to article schema..


In [118]:
# Test our insert has worked by checking one object
print(client.data_object.get()['objects'][0]['properties']['title'])
print(client.data_object.get()['objects'][0]['properties']['content'])

# Test that all data has loaded
result = client.query.aggregate("Article") \
    .with_fields('meta { count }') \
    .do()
result['data']

Kim Jong-nam
Kim Jong-nam (May 10, 1971 - February 13, 2017) was the eldest son of Kim Jong-il, the former leader of North Korea.

He tried to enter Japan using a fake passport in May 2001.  This was to visit Disneyland.  This caused his father to not approve of him. Kim Jong-nam's younger half-brother Kim Jong-un was made the heir in September 2010.

In June 2010, Kim Jong-nam gave a brief interview to the Associated Press in Macau. He told the reporter that he had "no plans" to defect to Europe. The press had recently said this. Kim Jong-nam lived in an apartment on the southern tip of Macau's Coloane Island until 2007. An anonymous South Korean official reported in October 2010 that Jong-nam had not lived in Macau for "months", and now goes between China and "another country."

When his father died, Kim Jong-nam did not attend the funeral.  This was to avoid rumours on the succession.

He was assassinated in Malaysia on February 13, 2017, which is believed to be ordered by his half-

{'Aggregate': {'Article': [{'meta': {'count': 25000}}]}}

### Search Data

As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors

In [119]:
def query_weaviate(query, schema, top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
                                                input=query,
                                                model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']
    
    near_vector = {"vector": embedded_query}

    # Queries input schema with vectorised user query
    query_result = client.query.get(schema,["title","content", "_additional {certainty}"]) \
    .with_near_vector(near_vector) \
    .with_limit(top_k) \
    .do()
    
    return query_result

In [120]:
query_result = query_weaviate('modern art in Europe','Article')
counter = 0
for article in query_result['data']['Get']['Article']:
    counter += 1
    print(f"{counter}. { article['title']} (Score: {round(article['_additional']['certainty'],3) })")

1. Museum of Modern Art (Score: 0.938)
2. Western Europe (Score: 0.934)
3. Renaissance art (Score: 0.932)
4. Pop art (Score: 0.93)
5. Northern Europe (Score: 0.927)
6. Hellenistic art (Score: 0.926)
7. Modernist literature (Score: 0.924)
8. Art film (Score: 0.922)
9. Central Europe (Score: 0.921)
10. Art (Score: 0.921)
11. European (Score: 0.921)
12. Byzantine art (Score: 0.92)
13. Postmodernism (Score: 0.92)
14. Eastern Europe (Score: 0.92)
15. Cubism (Score: 0.92)
16. Europe (Score: 0.919)
17. Impressionism (Score: 0.919)
18. Bauhaus (Score: 0.919)
19. Surrealism (Score: 0.919)
20. Expressionism (Score: 0.919)


In [85]:
query_result = query_weaviate('Famous battles in Scottish history','Article')
counter = 0
for article in query_result['data']['Get']['Article']:
    counter += 1
    print(f"{counter}. {article['title']} (Score: {round(article['_additional']['certainty'],3) })")

1. Historic Scotland (Score: 0.946)
2. First War of Scottish Independence (Score: 0.946)
3. Battle of Bannockburn (Score: 0.946)
4. Wars of Scottish Independence (Score: 0.944)
5. Second War of Scottish Independence (Score: 0.939)
6. List of Scottish monarchs (Score: 0.937)
7. Scottish Borders (Score: 0.932)
8. Braveheart (Score: 0.929)
9. John of Scotland (Score: 0.929)
10. Guardians of Scotland (Score: 0.926)
11. Holyrood Abbey (Score: 0.925)
12. Scottish (Score: 0.925)
13. Scots (Score: 0.925)
14. Robert I of Scotland (Score: 0.924)
15. Scottish people (Score: 0.924)
16. Alexander I of Scotland (Score: 0.924)
17. Edinburgh Castle (Score: 0.924)
18. Robert Burns (Score: 0.923)
19. Battle of Bosworth Field (Score: 0.922)
20. David II of Scotland (Score: 0.922)


## Qdrant

The last vector database we'll consider is **[Qdrant](https://qdrant.tech/)**. This is a high-performant vector search database written in Rust. It offers both on-premise and cloud version, but for the purposes of that example we're going to use the local deployment mode.

Setting everything up will require:
- Spinning up a local instance of Qdrant
- Configuring the collection and storing the data in it
- Trying out with some queries

### Setup

For the local deployment, we are going to use Docker, according to the Qdrant documentation: https://qdrant.tech/documentation/quick_start/. Qdrant requires just a single container, but an example of the docker-compose.yaml file is available at `./qdrant/docker-compose.yaml` in this repo.

You can start Qdrant instance locally by navigating to this directory and running `docker-compose up -d `

In [99]:
qdrant = qdrant_client.QdrantClient(host='localhost', prefer_grpc=True)

In [100]:
qdrant.get_collections()

CollectionsResponse(collections=[])

### Index data

Qdrant stores data in __collections__ where each object is described by at least one vector and may contain an additional metadata called __payload__. Our collection will be called **Articles** and each object will be described by both **title** and **content** vectors.

We'll be using an official [qdrant-client](https://github.com/qdrant/qdrant_client) package that has all the utility methods already built-in.

In [101]:
from qdrant_client.http import models as rest

In [102]:
vector_size = len(article_df['content_vector'][0])

qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

In [None]:
qdrant.upsert(
    collection_name='Articles',
    points=[
        rest.PointStruct(
            id=k,
            vector={
                'title': v['title_vector'],
                'content': v['content_vector'],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

In [None]:
# Check the collection size to make sure all the points have been stored
qdrant.count(collection_name='Articles')

### Search Data

Once the data is put into Qdrant we will start querying the collection for the closest vectors. We may provide an additional parameter `vector_name` to switch from title to content based search.

In [None]:
def query_qdrant(query, collection_name, vector_name='title', top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input=query,
        model=EMBEDDING_MODEL,
    )['data'][0]['embedding']
    
    query_results = qdrant.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )
    
    return query_results

In [None]:
query_results = query_qdrant('modern art in Europe', 'Articles')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')

In [None]:
# This time we'll query using content vector
query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')

Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo.