# Introduction to Weaviate API

There are many vector databases available, and Weaviate is one of them. It's demonstrated in DeepLearning.AI's course on RAG so I used it here as well. The workflow is similar to other vector databases, so you can (supposedly) easily switch to another one if you prefer.

My first encounter with vector databases was with Chroma, which is also a popular choice. I will cover Chroma in another notebook.



## Typical Vector Database Workflow

1. `Set up the vector database`: This involves creating a connection to the database and defining the schema for your data.

2. `Load data into the vector database`: This step involves ingesting your data into the database, which may include text, images, or other types of data.

3. `Create sparse vectors for keyword search`: This step involves generating sparse vectors (e.g., TF-IDF or bag-of-words) for your data, which allows for keyword-based search capabilities.

4. `Create dense vectors for semantic search`: This step involves generating embeddings for your data, which allows for semantic search capabilities.

5. `Create HNSW index to power ANN`: This step involves creating an index that enables approximate nearest neighbor (ANN) search, which is essential for efficient retrieval of similar items in large datasets.

6. `Query the vector database`: This step involves querying the database using either keyword-based or semantic search methods to retrieve relevant data.

### Terminology Crash Course on KNN, ANN, and HNSW

`KNN` stands for K-Nearest Neighbors, which is a type of algorithm used to find the closest points in a dataset to a given query point. It is commonly used in classification and regression tasks.

<img src = "./resource/knn.jpg" width="400"  alt="KNN Algorithm" align="center">

`ANN` stands for Approximate Nearest Neighbor, which is a technique used to quickly find items in a dataset that are similar to a given query item. It is used to speed up search operations in high-dimensional spaces. However, it may **NOT** always return the exact nearest neighbors, but rather an approximation that is close enough for practical purposes. 

A basic flow of the ANN algorithm is as follows:

1. **Create a proximity graph**: This involves building a graph structure where nodes represent data points and edges represent the proximity or similarity between every two points. The graph is constructed based on some distance metric (e.g., Euclidean distance) and can be sparse or dense depending on the dataset and the algorithm used.

2. **Search for the nearest neighbors**: When a query is received, it is vectorized to create a query vector. The algorithm starts from a *randomly* chosen document (candidate vector) in the proximity graph. This candidate vector could be any vector in the dataset, and it serves as the starting point for the search.

3. **Traverse the graph**: The algorithm traverses the graph by moving to the nearest neighbors of the current candidate vector, based on the distance metric. This process continues until a stopping condition is met, such as reaching a predefined number of neighbors or a certain distance threshold.

<img src = "./resource/ann.jpg" width="400"  alt="ANN Algorithm" align="center">

#### ANN Algorithm in Action

<img src = "./resource/ann1.jpg" width="400"  alt="ANN Algorithm in Action" align="center">

`HNSW` stands for Hierarchical Navigable Small World, which is a type of graph-based index used for approximate nearest neighbor search. It allows for efficient retrieval of similar items in high-dimensional spaces. HNSW improves upon the basic ANN by using a multi-layered proximity graph, allowing for faster searches by starting from higher layers and narrowing down to the full document set.

This method significantly reduces search time, making it feasible to handle billions of vectors with minimal latency, while still providing close matches to the query.

#### HNSW Algorithm 
<img src = "./resource/hnsw1.jpg" width="400"  alt="HNSW Algorithm" align="center">

#### HNSW in Action
<img src = "./resource/hnsw2.jpg" width="400"  alt="HNSW in Action" align="center">


## Weaviate Setup

Weaviate can be run locally or accessed via a cloud service. It is an open-source vector database that allows you to store and search data using vector embeddings.

In [None]:
# Install the Weaviate client library
!pip install weaviate
!pip install typing
!pip install tqdm

In [None]:
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import Filter
from typing import List
from tqdm import tqdm
import joblib
import weaviate
import re
from weaviate.util import generate_uuid5
from pprint import pprint
import os

In [None]:
import weaviate

client = weaviate.connect_to_embedded(
    version="1.26.1",
    headers={
        "X-OpenAI-Api-Key":     os.getenv("OPENAI_API_KEY"),
        "X-OpenAI-Api-Base":    os.getenv("OPENAI_API_BASE")

    },
)