## Using LanceDB for Embeddings Search

This example takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

### What is a Vector Database

A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.

### Why use a Vector Database

Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.


### Demo Flow
The demo flow is:
- **Setup**: Import packages and set any required variables
- **Load data**: Load a dataset and embed it using OpenAI embeddings
- **LanceDB**
    - *Setup*: Here we'll set up the Python client for LanceDB. For more details go [here](https://lancedb.github.io/lancedb/basic/)
    - *Index Data*: We'll create a table and index it for __titles__
    - *Search Data*: Run a few example queries with various goals in mind.

Once you've run through this example you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings.

## LanceDB

We'll now look at **LanceDB**, a open-source developer-friendly, serverless vector database for AI applications designed to make data management for LLMs frictionless. LanceDB offers native Python and JS support.

In this section, we will:
- Connect to the database
- Create a Table and load the data
- Query the Table with some similarity searches

## Setup

Import the required libraries and set the embedding model that we'd like to use.

In [None]:
# We'll need to install the clients for all vector databases
!pip install lancedb

#Install wget to pull zip file
!pip install wget

In [None]:
import openai

from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# LanceDB's client library for Python
import lancedb

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Load data

In this section we'll load embedded data that we've prepared previous to this session.

In [None]:
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

In [None]:
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")

In [None]:
article_df = pd.read_csv('data/vector_database_wikipedia_articles_embedded.csv')

In [None]:
article_df.head()

In [None]:
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)

In [None]:
article_df.info(show_counts=True)

## LanceDB

We'll now look at **LanceDB**, a open-source developer-friendly, serverless vector database for AI applications designed to make data management for LLMs frictionless.

In this section, we will:
- Connect to the database
- Create a Table and load the data
- Query the Table with some similarity searches

### Connect to the database

Connecting to a LanceDB database is super simple:

In [None]:
uri = "data/sample-lancedb"
db = lancedb.connect(uri)

LanceDB will create the directory if it doesn't exist (including parent directories).

If you need a reminder of the uri, use the `db.uri` property.

### Create the LanceDB Table

In LanceDB the primary abstraction you'll use to work with your data is a Table. A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses [Lance](https://github.com/lancedb/lance), an open data format to store data.


Now we're ready to save the data and create a new LanceDB table. We want to use the embeddings generated from Wikipedia article content.

In [None]:
# LanceDB tables use the "vector" column to store embeddings
article_df.rename(columns={"title_vector":"vector"}, inplace=True)

In [None]:
table_name = "wikipedia"

if table_name not in db.table_names():
  tbl = db.create_table(table_name, data=article_df)
else:
  tbl = db.open_table(table_name)
len(tbl)

### Creating embeddings

To create embeddings out of the query, we'll call the OpenAI embeddings API to get embeddings. Make sure you have an API key setup and that your account has available credits. 

Note that the OpenAI library will try to read your API key from the `OPENAI_API_KEY` environment variable. If you haven't already, set this environment variable by following [these instructions](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).

OpenAI API often fails or times out. So LanceDB's API provides retry and throttling features behind the scenes to make it easier to call these APIs.

In [None]:
def embed_func(c):    
    rs = openai.Embedding.create(input=c, engine=EMBEDDING_MODEL)
    return [record["embedding"] for record in rs["data"]]

### Query the Table
LanceDB searches work without vector indexing. If there is no vector index is created, LanceDB will just brute-force scan the vector column and compute the distance.

In [None]:
def query_article(query, tbl, top_k=5):
    # Create vector embeddings based on the user query
    emb = embed_func(query)[0]

    # Search the table for the top_k most similar results
    df = tbl.search(emb).limit(top_k).to_df()

    return df

Let's try it out now with a few sample queries.

In [None]:
results_1 = query_article("Important battles related to the American Revolutionary War", tbl, 10)
results_1

In [None]:
results_2 = query_article("Coolest mammals native to Asia", tbl, 7)
results_2

In [None]:
results_3 = query_article("Important traits for an entrepreneur", tbl)
results_3