# Part 2: Vector Databases and Similarity Search

## 2.1 Overview of Vector Databases

### What are Vector Databases?
- **Vector Databases** are designed to store and search high-dimensional vectors, which represent data points in a multi-dimensional space.
- These vectors are often generated by machine learning models (e.g., embeddings from NLP models) and are used to capture the semantic meaning of text, images, or other data types.

### Why Use Vector Databases?
- **Efficiency**: Vector databases enable fast similarity searches, crucial for tasks like finding similar documents, images, or even generating recommendations.
- **Scalability**: They can handle large datasets with millions or even billions of vectors, making them suitable for real-world applications.

### Popular Vector Databases
- **Pinecone**: A managed vector database service that scales automatically and provides real-time search capabilities.
- **Milvus**: An open-source vector database optimized for high-performance similarity search, particularly in large datasets.
- **Chroma**: An open-source vector database designed for AI applications, emphasizing fast and efficient similarity searches.
- **Weaviate**: An open-source vector database with support for various data types and integrations, including semantic search capabilities.
- **FAISS (Facebook AI Similarity Search)**: A popular library developed by Facebook AI for efficient similarity search and clustering of dense vectors.
- **Elasticsearch**: A distributed search engine that has added support for vector search, making it suitable for integrating with other structured and unstructured data searches.
- **Qdrant**: An open-source vector search engine and database that provides high-performance, real-time vector search capabilities with a focus on efficient nearest neighbor search.

<br>
<img src="images/vectordb.png" width="700">

### Comparing vector databases
Examples of things that you should keep in mind when considering a vector database:
- **Deployment options**)
- **Scalability**
- **Performance and Benchmarking**
- **Data Management**
- **Algorithms for Vector similarity search**
- **Integration and API**
- **Security**
- **Community and ecosystem**
- **Pricing**
- **Additional features**


[Vector database comparison](https://zackproser.com/blog/vector-databases-compared)


## 2.3 Usual Metrics in Vector Databases

### Common Similarity Metrics
- **L2 (Euclidean) Distance**: Measures the straight-line distance between two points in vector space. Commonly used in many vector databases.
- **Cosine Similarity**: Measures the cosine of the angle between two vectors. Useful when you care about the direction rather than the magnitude.
- **Inner Product**: Measures the dot product of two vectors, often used in neural networks.

## 2.4 Vector Indexing Algorithms

### Importance of Indexing
- Indexing improves the efficiency of similarity searches, especially when dealing with large datasets.

### Common Indexing Algorithms
- **IVF (Inverted File Index)**: Splits the vector space into clusters, each with its own index.
- **HNSW (Hierarchical Navigable Small World)**: A graph-based indexing method that allows for efficient approximate nearest neighbor searches.
- **PQ (Product Quantization)**: Compresses vectors into smaller codes, reducing memory usage.

### Activity:
Experiment with different indexing algorithms in FAISS or another vector database. Compare the search speeds and accuracy for different algorithms.
"""))

# Python cell: Example of IVF Indexing in FAISS
# Set up an IVF index
nlist = 100  # number of clusters
quantizer = faiss.IndexFlatL2(d)  # the quantizer to build the IVF index
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist)

# Train the index
index_ivf.train(xb)

# Add vectors to the index
index_ivf.add(xb)

# Perform the search using IVF
index_ivf.nprobe = 10  # number of clusters to search in
D_ivf, I_ivf = index_ivf.search(xq, k)

print("Indices of nearest neighbors (IVF):", I_ivf)
print("Distances to nearest neighbors (IVF):", D_ivf)

# Markdown cell: Summary and Next Steps
display(Markdown("""
## Summary and Next Steps

### What We Covered:
- **Vector Databases**: An overview of what they are and why they're useful.
- **FAISS Setup**: How to create and index vectors, and perform similarity searches.
- **Similarity Metrics**: Explored common metrics like L2 distance and cosine similarity.
- **Indexing Algorithms**: Discussed and experimented with different indexing methods.

### Next Steps:
- Continue exploring other vector databases like Milvus or Pinecone.
- Try indexing larger datasets and measure the impact on search performance.
- Experiment with pre- and post-filtering techniques to refine search results.

In the next part of the workshop, we'll dive into using LLMs for generating responses based on retrieved data.
"""))
