# Real-World Document Embedding for Fast Search

Vector databases have gotten much more popular for querying text data. But how do we get a quick and dirty prototype started? 
<br><br>In this demo, we'll pull data from an internal source like you would at your workplace/larger dev project and run vector search inside this notebook.


#### Goal: Take real-world document data and make fast queries with Vector Search

## Contents
- Baseline Architecture
    - Vector DB common questions
- Implementation
    - Data
    - Code


Our workflow will be as follows: MongoDB data -> OpenAI embedding -> Search with Milvus

### Baseline Architecture

- Data Source: MongoDB
- Embeddings Generator: OpenAI Embeddings API
- Vector DB/Vector Search: Milvus Lite

![VDB Basic Architecture](./vdb.png)

#### Questions we need to answer:
- What's the point of this?
- How is this different from how we have been doing search?
- Do I need to store all of my data in a vector database?
- Does this save time, money, effort?

The key idea is that we can run search based on vector embeddings instead of the data itself. 

Traditional search systems, and even modern search systems, rely on processing parts of the dataset during inference. 

No, we just need to store the vector embeddings. The fastest way is to store a pointer/index of the original data along with its embedding.

Vector search using approximate nearest neighbors (ANN) has a much lower time complexity, resulting in __orders of magnitude__ less compute needed.

## Implementation

### Dataset: TechCrunch RSS News Feed

We have a prepared dataset of news from TechCrunch's RSS Feed, and we load it into a Pandas dataframe for readability. **fulltext** is the cleaned text data from each news article.

#### Let's pull our document data from MongoDB

In [1]:
from pymongo.mongo_client import MongoClient
from tqdm import tqdm
import pandas as pd
import openai
import pymilvus
import milvus
import os

In [2]:
#Get posts
mc = MongoClient(os.getenv('MONGO_URI'))
posts = pd.json_normalize(list(mc['data']['rss_data'].find({}))).rename(columns={'_id': 'title'})
#How the data is structured
posts.head()

Unnamed: 0,title,link,published,description,fulltext
0,AI leaders warn Senate of twin risks: Moving t...,https://techcrunch.com/2023/07/25/ai-leaders-w...,"Tue, 25 Jul 2023 23:23:15 +0000",Leaders from the AI research world appeared be...,Leaders from the AI research world appeared be...
1,Exclusive: Goodwater Capital snags $1B across ...,https://techcrunch.com/2023/07/25/exclusive-go...,"Tue, 25 Jul 2023 22:30:57 +0000","Goodwater, one of comparatively few firms to f...","Goodwater, one of comparatively few firms to f..."
2,Main Sequence hits first close of $450M AUD fo...,https://techcrunch.com/2023/07/25/main-sequenc...,"Tue, 25 Jul 2023 22:16:13 +0000","Main Sequence, an Australian venture capital f...","Main Sequence, an Australian venture capital f..."
3,Snap’s revenue woes continue but earnings yiel...,https://techcrunch.com/2023/07/25/snap-snapcha...,"Tue, 25 Jul 2023 21:50:08 +0000",Snap just reported its quarterly earnings and ...,Snap just reported its quarterly earnings and ...
4,Amazon will now warn consumers of recalls and ...,https://techcrunch.com/2023/07/25/amazon-will-...,"Tue, 25 Jul 2023 20:40:02 +0000",Amazon today introduced a new feature that wil...,Amazon today introduced a new feature that wil...


#### Generate Embeddings

We use OpenAI's embeddings, but in practice you can use any text embedding model. Make sure to use TQDM's progress bar (so you know how long each iteration takes)!

In [3]:
openai.api_key = os.getenv("OPENAI_KEY")
model_id = "text-similarity-davinci-001"

def get_embedding(text: str):
    return openai.Embedding.create(input=text, model=model_id)['data'][0]['embedding']

embeddings = []
for idx, row in tqdm(posts.iterrows()):
    embedding = get_embedding(row['fulltext'])
    embeddings.append(embedding)

20it [00:15,  1.25it/s]


In [4]:
#Add embeddings to our dataframe
posts['embeddings'] = embeddings
posts.head()

Unnamed: 0,title,link,published,description,fulltext,embeddings
0,AI leaders warn Senate of twin risks: Moving t...,https://techcrunch.com/2023/07/25/ai-leaders-w...,"Tue, 25 Jul 2023 23:23:15 +0000",Leaders from the AI research world appeared be...,Leaders from the AI research world appeared be...,"[0.008354601450264454, 0.004420607816427946, -..."
1,Exclusive: Goodwater Capital snags $1B across ...,https://techcrunch.com/2023/07/25/exclusive-go...,"Tue, 25 Jul 2023 22:30:57 +0000","Goodwater, one of comparatively few firms to f...","Goodwater, one of comparatively few firms to f...","[0.007502191700041294, 0.0019230301259085536, ..."
2,Main Sequence hits first close of $450M AUD fo...,https://techcrunch.com/2023/07/25/main-sequenc...,"Tue, 25 Jul 2023 22:16:13 +0000","Main Sequence, an Australian venture capital f...","Main Sequence, an Australian venture capital f...","[0.007211583200842142, -0.004153947811573744, ..."
3,Snap’s revenue woes continue but earnings yiel...,https://techcrunch.com/2023/07/25/snap-snapcha...,"Tue, 25 Jul 2023 21:50:08 +0000",Snap just reported its quarterly earnings and ...,Snap just reported its quarterly earnings and ...,"[-0.009737272746860981, 0.011124690994620323, ..."
4,Amazon will now warn consumers of recalls and ...,https://techcrunch.com/2023/07/25/amazon-will-...,"Tue, 25 Jul 2023 20:40:02 +0000",Amazon today introduced a new feature that wil...,Amazon today introduced a new feature that wil...,"[0.004565441980957985, 0.001533430884592235, -..."


#### VectorDB initialization

In [24]:
#Refresh the Milvus instance if one already exists
if default_server:
    default_server.stop()
    default_server.cleanup()
    connections.disconnect('default')
    del default_server
from milvus import default_server
from pymilvus import connections, utility
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
from datetime import datetime

COLLECTION_NAME = "RSS_POST_SEARCH"
#Depending on your embedding size, you should change the dimension.
#The quickest way I found was to try adding embeddings to the Collection object first,
#and the resulting bug will tell you which dimension your Milvus instance should accept.
DIMENSION = 12288
BATCH_SIZE = 64
TOPK = 3

#Time how long it takes to start Milvus instance
start = datetime.now()

default_server.start()
connections.connect(host="127.0.0.1", port=default_server.listen_port)
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)
    
    
# object should be inserted in the format of (title, date, location, speech embedding)
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="link", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)


#End time
total = datetime.now() - start
print("Milvus instance started in " + str(total) + " seconds")



    __  _________ _   ____  ______
   /  |/  /  _/ /| | / / / / / __/
  / /|_/ // // /_| |/ / /_/ /\ \
 /_/  /_/___/____/___/\____/___/ {Lite}

 Welcome to use Milvus!

 Version:   v2.2.12-lite
 Process:   27402
 Started:   2023-07-26 13:48:26
 Config:    /home/elliot/.milvus.io/milvus-server/2.2.12/configs/milvus.yaml
 Logs:      /home/elliot/.milvus.io/milvus-server/2.2.12/logs

 Ctrl+C to exit ...
Milvus instance started in 0:00:07.426961 seconds


In [25]:
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": DIMENSION},
}

collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

In [26]:
#Add DataFrame data to Milvus
for idx, row in posts.iterrows():
    data = [[row['title']], [row['link']], [str(row['published'])], [row['embeddings']]]
    
    collection.insert(data)
collection.flush()

In [32]:
import time

#Embed our search query
query_string = 'Generative AI'
search_vec = [get_embedding('Generative AI')]


#Find relevant articles
start = time.time()
res = collection.search(
    data=search_vec,  # Embedded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "L2",
            "params": {"nprobe": 10}},
    limit = TOPK,  # Limit to top_k results per search
    output_fields=["title"]  # Include title field in result
)
end = time.time()

#Show results
for hits_i, hits in enumerate(res):
    print("Title:", query_string)
    print("Search Time:", end-start)
    print("Results:")
    for hit in hits:
        print( hit.entity.get("title"), "----", hit.distance)
    print()

Title: Generative AI
Search Time: 0.014770984649658203
Results:
Growth driven by AI will be the metric to watch this earnings cycle ---- 0.9384163022041321
Snap’s revenue woes continue but earnings yield a few bright spots ---- 0.9753267765045166
Buy in bulk: Take your team to TechCrunch Disrupt 2023 and save ---- 0.9970754384994507



In [33]:
#Stop the instance!
default_server.stop()
default_server.cleanup()