### Oracle AI Vector Search: Loading the Vector Store

With this Notebook you can load your Knowledge Base in Oracle AI Vector Search 
and create and  store the Embeddings Vectors.

The KB is made by a set of pdf files, stored in a directory. This NB:
* Reads all the pdf files and splits into chunks
* Compute the embeddings for all chunks
* Store chunks and embeddings in **ORACLE_KNOWLEDGE** table

* This demo is based on the **LangChain** integration
* **OCI GenAI multi-lingual (Cohere) embeddings**
* Data will be stored in a single table

Afterward, you can run a similarity search and run an assistant, based on OCI GenAI.

In [1]:
import logging
from glob import glob
from tqdm.auto import tqdm

import oracledb

# for loading and splitting
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# to compute embeddings vectors
from oci_cohere_embeddings_utils import OCIGenAIEmbeddingsWithBatch

from langchain_community.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy

# private information
from config_private import COMPARTMENT_ID, DB_USER, DB_PWD, DB_HOST_IP, DB_SERVICE

#### Setup

In [2]:
#
# Some configurations
#

# directory where our Knowledge base is contained in pdf files
BOOKS_DIR = "./books"

CHUNK_SIZE = 1500
CHUNK_OVERLAP = 50

# embeddings model: we're OCI GenAI multilingual Cohere
OCI_EMBED_MODEL = "cohere.embed-multilingual-v3.0"
ENDPOINT = "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"

# to connect to DB
# if you don't change the port is 1521
dsn = f"{DB_HOST_IP}:1521/{DB_SERVICE}"

# for embeddings we're using the extension that handles batching
embed_model = OCIGenAIEmbeddingsWithBatch(
    auth_type="API_KEY",
    model_id=OCI_EMBED_MODEL,
    service_endpoint=ENDPOINT,
    compartment_id=COMPARTMENT_ID,
)

# Configure logging
logger = logging.getLogger("ConsoleLogger")

logger.setLevel(logging.INFO)

In [3]:
# helper functions
def get_recursive_text_splitter():
    """
    return a recursive text splitter
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        length_function=len,
        is_separator_regex=False,
    )
    return text_splitter


def load_books_and_split(books_dir) -> list:
    """
    load a set of books from books_dir and split in chunks
    """
    logger.info("Loading documents from %s...", books_dir)

    text_splitter = get_recursive_text_splitter()

    books_list = sorted(glob(books_dir + "/*.pdf"))

    logger.info("Loading books: ")
    for book in books_list:
        logger.info("* %s", book)

    docs = []

    for book in tqdm(books_list):
        loader = PyPDFLoader(file_path=book)

        docs += loader.load_and_split(text_splitter=text_splitter)

    logger.info("Loaded %s chunks of text...", len(docs))

    return docs

In [4]:
# this is the file list containing the Knowledge base
file_list = sorted(glob(BOOKS_DIR + "/" + "*.pdf"))

logger.info(f"There are {len(file_list)} files to be loaded...")
logger.info("")
for f_name in file_list:
    logger.info(f_name)

2024-06-06 16:07:23,546 - INFO - There are 6 files to be loaded...
2024-06-06 16:07:23,549 - INFO - 
2024-06-06 16:07:23,551 - INFO - ./books/Solution_Definition_LangChainRAG_2.1.pdf
2024-06-06 16:07:23,552 - INFO - ./books/database-concepts.pdf
2024-06-06 16:07:23,553 - INFO - ./books/database-security-assessment-tool-user-guide_3.1.pdf
2024-06-06 16:07:23,553 - INFO - ./books/high-availability-23c.pdf
2024-06-06 16:07:23,554 - INFO - ./books/oracle-ai-vector-search-users-guide.pdf
2024-06-06 16:07:23,555 - INFO - ./books/oracle-database-23c-new-features-guide.pdf


#### Load all files and then splits in chunks

In [5]:
docs = load_books_and_split(BOOKS_DIR)

2024-06-06 16:07:51,201 - INFO - Loading documents from ./books...
2024-06-06 16:07:51,202 - INFO - Loading books: 
2024-06-06 16:07:51,205 - INFO - * ./books/Solution_Definition_LangChainRAG_2.1.pdf
2024-06-06 16:07:51,207 - INFO - * ./books/database-concepts.pdf
2024-06-06 16:07:51,208 - INFO - * ./books/database-security-assessment-tool-user-guide_3.1.pdf
2024-06-06 16:07:51,208 - INFO - * ./books/high-availability-23c.pdf
2024-06-06 16:07:51,209 - INFO - * ./books/oracle-ai-vector-search-users-guide.pdf
2024-06-06 16:07:51,210 - INFO - * ./books/oracle-database-23c-new-features-guide.pdf


  0%|          | 0/6 [00:00<?, ?it/s]

2024-06-06 16:08:06,713 - INFO - Loaded 3332 chunks of text...


#### Vector Store and load vectors + embeddings in the DB

In [6]:
try:
    # we need to provide a connection as input to OracleVS
    connection = oracledb.connect(user=DB_USER, password=DB_PWD, dsn=dsn)
    logger.info("Connection successful!")

    # here we are loading all the texts and embeddings
    logger.info("Loading in OracleVS...")

    v_store = OracleVS.from_documents(
        docs,
        embed_model,
        client=connection,
        table_name="ORACLE_KNOWLEDGE",
        distance_strategy=DistanceStrategy.COSINE,
    )

    logger.info("Loading completed!")

except Exception as e:
    logger.error("Connection failed!")
    logger.error(e)

2024-06-06 16:08:15,018 - INFO - Connection successful!
2024-06-06 16:08:15,020 - INFO - Loading in OracleVS...


  0%|          | 0/38 [00:00<?, ?it/s]

2024-06-06 16:09:16,046 - INFO - Loading completed!


#### Do a query for test

In [7]:
# k is the number of docs we want to retrieve
try:
    connection = oracledb.connect(user=DB_USER, password=DB_PWD, dsn=dsn)
    logger.info("Connection successful!")

    # get again an instance of OracleVS
    v_store = OracleVS(
        client=connection,
        table_name="ORACLE_KNOWLEDGE",
        distance_strategy=DistanceStrategy.COSINE,
        embedding_function=embed_model,
    )

    retriever = v_store.as_retriever(search_kwargs={"k": 6})

    logger.info("Retriever created...")

except Exception as e:
    logger.error("Connection failed!")
    logger.error(e)

2024-06-06 16:09:21,416 - INFO - Connection successful!
2024-06-06 16:09:21,741 - INFO - Retriever created...


In [8]:
question = "What is the purpose for the SDD document for LangChain?"

result_docs = retriever.invoke(question)

In [9]:
# display results

for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata)
    print("----------------------------")
    print("")

Document Control  
Copyright @202 4, Oracle and/or its affiliates  
 Page 4 
Document Control  
1.1 Version Control  
Version  Authors  Date  Comments  
1.0 Martijn de 
Grunt  
Emir Özdel  03 April  
2024 Created a new Solution Definition document. To be used for iterative review and 
improvement.  
1.1 Martijn de 
Grunt  08 April 
2024  Added Project Scope, Title,  and Business Context  
1.2 Emir Özdel  08 April 
2024  Minor changes  
1.3 
 Martijn de 
Grunt  08 April 
2024  Naming convention LangChain RAG instead of Custom RAG  
 
1.4 Emir Özdel  08 April 
2024  Further Refinements and Instructions page with the new name convention is 
updated.  
 
1.5 Emir Özdel  17 April 
2024  The readability is increased, some corrections are made.  
2.0 Emir Özdel 19 April 
2024  Some important changes are made in the document, improve the step -by-step  
guide for a better readability. Added more context  
2.1 Emir Özdel 26 April 
2024 Based on the feedback, some unclear points are clarified an