# Retrieval Augmented Generation (RAG)
### Knowledge base Ingest into Vector Database

- Given a local directory of proprietary data files, (in this lab we are using 1000 contracts from different universities in the USA) we are generating embeddings with an open sourced pretrained model for each of those files

- The embeddings are stored along with document IDs in a Vector Database to enable semantic search

In [None]:
# Change cwd to home
%cd ~

In [None]:
# Import dependencies
from milvus import default_server
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
from tqdm import tqdm
import subprocess
import sys

sys.path.append("./code")
import utils.model_embedding_utils as model_embedding

import os
from pathlib import Path

## What is Milvus vector database?

- Milvus was created in 2019 with a singular goal: store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models.

- As a database specifically designed to handle queries over input vectors, it is capable of:
    - Indexing vectors on a trillion scale
- Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data

<img src="../../images/milvus_workflow.jpeg" alt="Milvus Workflow" />

In [None]:
# Reset the vector database files
print(subprocess.run(["rm -rf milvus-data"], shell=True))

# Start the server
default_server.set_base_dir('milvus-data')
default_server.start()

In [None]:
# Check if the server is running
connections.connect(alias='default', host='localhost', port=default_server.listen_port)   
print(utility.get_server_version())

## How does it work?
As the Internet grew and evolved, unstructured data became more and more common, including emails, papers, IoT sensor data, Facebook photos, protein structures, and much more. In order for computers to understand and process unstructured data, these are converted into vectors using embedding techniques. 

- **Milvus** stores and indexes these vectors and it's able to analyze the correlation between two vectors by calculating their *similarity distance*. 

- If the **two embedding vectors are very similar**, it means that the **original data sources are similar as well**

### Key Concepts: 
- **Embedding vectors:** An embedding vector is a feature abstraction of unstructured data. Mathematically speaking, an embedding vector is an array of floating-point numbers or binaries. 

- **Vector similarity search:** Is the process of comparing a vector to a database to find vectors that are most similar to the query vector. Approximate nearest neighbor (ANN) search algorithms are used to accelerate the searching process

*For more info: https://milvus.io/docs/overview.md*

In [None]:
def create_milvus_collection(collection_name, dim):
    """ Milvus allows to divide the bulk of vector data into
    a small number of partitions. Search and other operations
    can then be limited to one partition to improve the performance.
    A collection consists of one or more partitions. While creating
    a new collection, Milvus creates a default partition _default"""
    
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    fields = [
        FieldSchema(
            name='relativefilepath',
            dtype=DataType.VARCHAR,
            description='file path relative to root directory ',
            max_length=1000,
            is_primary=True,
            auto_id=False,
        ),
        FieldSchema(
            name='embedding',
            dtype=DataType.FLOAT_VECTOR,
            description='embedding vectors',
            dim=dim,
        )
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)

    # create IVF_FLAT index for collection.
    # Best suited for scenarios that seek an ideal balance between accuracy and query speed
    index_params = {
        'metric_type':'IP',
        'index_type': "IVF_FLAT",
        'params': {"nlist":2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection

In [None]:
# Create/Recreate the Milvus collection
collection_name = 'credit_card_ml_contracts'
collection = create_milvus_collection(collection_name, 384)

print("Milvus database is up and collection is created")

In [None]:
# Create an embedding for given text/doc and insert it into Milvus Vector DB
def insert_embedding(collection, id_path, text):
    embedding =  model_embedding.get_embeddings(text)
    data = [[id_path], [embedding]]
    collection.insert(data)

In [None]:
# Read KB documents in ./data directory and insert embeddings into Vector DB for each doc
# The default embeddings generation model specified in this AMP only generates embeddings for the first 256 tokens of text.
doc_dir = './data'
N_FILES = 1000
for ix, file in tqdm(enumerate(Path(doc_dir).glob(f'**/*.txt'))):
    if ix == N_FILES:
        break
    with open(file, "r") as f: # Open file in read mode
        text = f.read()
        insert_embedding(collection, os.path.abspath(file), text)

In [None]:
# Sealing all entities in the current collection
# Any insertion after a flush operation results in generating new segments
collection.flush()
print('Total number of inserted embeddings is {}.'.format(collection.num_entities))
print('Finished loading Knowledge Base embeddings into Milvus')

In [None]:
default_server.stop()