### Oracle AI Vector Search: Loading the Vector Store

With this Notebook you can load your Knowledge Base in Oracle DB and create and  store the Embeddings Vectors.

The KB is made by a set of txt files, stored in the txt directory. This NB:
* Reads all the txt files and splits in chunks
* Compute the embeddings for all chunks
* Store chunks and embeddings in the ORACLE_KNOWLEDGE table

* This demo is based on the **LangChain** integration
* based on **OCI GenAI multi-lingual (Cohere) embeddings**
* Data will be stored in a single table (ORACLE_KNOWLEDGE)

Afterward, you can do a similarity search and run a simple assistant, based on OCI GenAI, on top.

In [1]:
import logging
from glob import glob
import pandas as pd

# to load and split txt documents
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# to compute embeddings vectors
from langchain_community.embeddings import OCIGenAIEmbeddings

# the class to integrate OCI AI Vector Search with LangChain
from oracle_vector_db_lc import OracleVectorStore

from config_private import COMPARTMENT_OCID

#### Setup

In [2]:
#
# Helper functions
#


# find the url from the file name, using references.csv
def find_ref(df, f_name):
    condition = df["file_name"] == f_name

    ref = df.loc[condition]["url"].values[0]

    return ref


# this function replace the file name with the url in docs metadata
# the url is read from references.csv
def set_url_in_docs(docs, df_ref):
    docs = docs.copy()
    for doc in docs:
        # remove txt from file_name
        file_name = doc.metadata["source"]
        only_name = file_name.split("/")[-1]
        # find the url from the csv
        ref = find_ref(df_ref, only_name)

        doc.metadata["source"] = ref

    return docs

In [3]:
#
# Some configurations
#

# directory where our Knowledge base is contained in txt files
TXT_DIR = "./txt"
# file with f_name, url
REF_FILE = "references.csv"

# OCI GenAI model used for Embeddings
EMBED_MODEL = "cohere.embed-multilingual-v3.0"
ENDPOINT = "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"

# max length in token of the input for embeddings
MAX_LENGTH = 512

# max chunk size, in char, for splitting
CHUNK_SIZE = 1500
# this parameters needs to be adjusted for the Embed model (for example, lowered for Cohere)

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

In [4]:
# this is the file list containing the Knowledge base
file_list = sorted(glob(TXT_DIR + "/" + "*.txt"))

print(f"There are {len(file_list)} files to be loaded...")

There are 75 files to be loaded...


#### Load all text files and then splits in chunks
Here we do some preprocessing on the txt file:
* we replace the file_name in source with the url the txt is coming from

In [5]:
# read all references (url) from  csv file
df_ref = pd.read_csv(REF_FILE)

# load txt and splits in chunks
# with TextLoader it is fast
# documents not yet splitted
origin_docs = DirectoryLoader(
    TXT_DIR, glob="**/*.txt", show_progress=True, loader_cls=TextLoader
).load()

# replace the f_name with the reference (url)
origin_docs = set_url_in_docs(origin_docs, df_ref)

# split docs in chunks
text_splitter = RecursiveCharacterTextSplitter(
    # thse params must be adapted to Knowledge base
    chunk_size=CHUNK_SIZE,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

docs_splitted = text_splitter.split_documents(origin_docs)

print(f"We have splitted docs in {len(docs_splitted)} chunks...")

 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 75/76 [00:00<00:00, 4034.90it/s]

We have splitted docs in 437 chunks...





#### Create Embed Model, Vector Store and load vectors + embeddings in the DB

In [6]:
# clean the existing table
# be careful: do you really want to delete all the existing records?
OracleVectorStore.drop_collection(collection_name="ORACLE_KNOWLEDGE")

2024-02-27 16:48:19,325 - INFO - ORACLE_KNOWLEDGE dropped!!!


In [7]:
OracleVectorStore.create_collection(collection_name="ORACLE_KNOWLEDGE")

2024-02-27 16:48:20,973 - INFO - ORACLE_KNOWLEDGE created!!!


In [8]:
# create embedding model and then the vector store

# create the Embedding Model
embed_model = OCIGenAIEmbeddings(
    # this code is done to be run in OCI DS.
    # If outside replace with API_KEY and provide API_KEYS
    # auth_type = "RESOURCE_PRINCIPAL"
    auth_type="API_KEY",
    model_id=EMBED_MODEL,
    service_endpoint=ENDPOINT,
    compartment_id=COMPARTMENT_OCID,
)

# Here compute embeddings and load texts + embeddings in DB
# can take minutes (for embeddings)
v_store = OracleVectorStore.from_documents(
    docs_splitted, embed_model, collection_name="ORACLE_KNOWLEDGE", verbose=True
)

2024-02-27 16:48:22,347 - INFO - Compute embeddings...


  0%|          | 0/5 [00:00<?, ?it/s]

2024-02-27 16:48:35,063 - INFO - Saving texts, embeddings to DB...


  0%|          | 0/437 [00:00<?, ?it/s]

2024-02-27 16:48:46,904 - INFO - Tot. errors in save_embeddings: 0


#### Do a query for test

In [9]:
# k is the number of docs we want to retrieve
retriever = v_store.as_retriever(search_kwargs={"k": 5})

In [10]:
question = "What is Oracle Strategy for Generative AI?"

result_docs = retriever.get_relevant_documents(question)

2024-02-27 16:48:49,464 - INFO - top_k: 5
2024-02-27 16:48:49,465 - INFO - 
2024-02-27 16:48:50,480 - INFO - select: select C.id, C.CHUNK, C.REF, 
                            ROUND(VECTOR_DISTANCE(C.VEC, :1, DOT), 3) as d 
                            from ORACLE_KNOWLEDGE C
                            order by d
                            FETCH FIRST 5 ROWS ONLY
2024-02-27 16:48:50,733 - INFO - Query duration: 0.4 sec.


In [11]:
for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata["source"])
    print("----------------------------")
    print("")

I mentioned earlier that we took a holistic approach to generative AI as we thought through the complete picture of what enterprises truly need to successfully implement generative AI. But beyond that, we have some core tenets to help ensure that our new offerings will be as valuable as possible for you:
We’re providing enterprise-focused models that are high performing and cost-effective, allowing for many uses cases and efficient fine tuning. We’re also increasingly adapting models to real-world enterprise scenarios, and performing specialized training on large language models with Oracle’s own proprietary knowledge and insights to make them better for business—all with access to best-in-class GPUs and high-performance cluster networking.
Oracle meets you where you are in your generative AI journey with a variety of embedded and managed services features across our infrastructure layer, platform services, and business applications. Working with AI may seem challenging—but it’s dramat