### Oracle AI Vector Search: Loading the Vector Store

With this Notebook you can load your Knowledge Base in Oracle DB and create and  store the Embeddings Vectors.

The KB is made by a set of pdf files, stored in a directory. This NB:
* Reads all the pdf files and splits in chunks
* Compute the embeddings for all chunks
* Store chunks and embeddings in **ORACLE_KNOWLEDGE** table

* This demo is based on the **LangChain** integration
* **OCI GenAI multi-lingual (Cohere) embeddings**
* Data will be stored in a single table (ORACLE_KNOWLEDGE)

Afterward, you can do a similarity search and run an assistant, based on OCI GenAI, on top.

In [1]:
import logging
from glob import glob
import pandas as pd

# to load and split txt documents
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# to compute embeddings vectors
from langchain_community.embeddings import OCIGenAIEmbeddings

# the class to integrate OCI AI Vector Search with LangChain
from oracle_vector_db_lc import OracleVectorStore
from chunk_index_utils import load_books_and_split

from config import OCI_EMBED_MODEL, ENDPOINT
from config_private import COMPARTMENT_ID

In [2]:
# Test connection to the DB
OracleVectorStore.test_connection()

2024-05-03 15:35:37,141 - INFO - Successfully connected !!!


#### Setup

In [3]:
#
# Some configurations
#

# directory where our Knowledge base is contained in txt files
BOOKS_DIR = "./books"

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

embed_model = OCIGenAIEmbeddings(
    # this code is done to be run in OCI DS.
    # If outside replace with API_KEY and provide API_KEYS
    # auth_type = "RESOURCE_PRINCIPAL"
    auth_type="API_KEY",
    model_id=OCI_EMBED_MODEL,
    service_endpoint=ENDPOINT,
    compartment_id=COMPARTMENT_ID,
)

In [4]:
# this is the file list containing the Knowledge base
file_list = sorted(glob(BOOKS_DIR + "/" + "*.pdf"))

print(f"There are {len(file_list)} files to be loaded...")
for f_name in file_list:
    print(f_name)

There are 8 files to be loaded...
./books/CurrentEssentialsofMedicine.pdf
./books/Il conto corrente in parole semplici.pdf
./books/La storia del Gruppo-iccrea.pdf
./books/La_Centrale_dei_Rischi_in_parole_semplici.pdf
./books/covid19_treatment_guidelines.pdf
./books/database-concepts.pdf
./books/high-availability-23c.pdf
./books/the-side-effects-of-metformin-a-review.pdf


#### Load all files and then splits in chunks

In [5]:
docs = load_books_and_split(BOOKS_DIR)

2024-05-03 15:35:51,328 - Loading documents from ./books...
2024-05-03 15:35:51,332 - Loading books: ['./books/La_Centrale_dei_Rischi_in_parole_semplici.pdf', './books/CurrentEssentialsofMedicine.pdf', './books/database-concepts.pdf', './books/covid19_treatment_guidelines.pdf', './books/Il conto corrente in parole semplici.pdf', './books/La storia del Gruppo-iccrea.pdf', './books/the-side-effects-of-metformin-a-review.pdf', './books/high-availability-23c.pdf']


  0%|          | 0/8 [00:00<?, ?it/s]

2024-05-03 15:36:11,026 - Loaded 4832 chunks...


#### Create Embed Model, Vector Store and load vectors + embeddings in the DB

In [6]:
# clean the existing table
# be careful: do you really want to delete all the existing records?
OracleVectorStore.drop_collection(collection_name="ORACLE_KNOWLEDGE")

2024-05-03 15:36:18,650 - INFO - ORACLE_KNOWLEDGE dropped!!!


In [7]:
OracleVectorStore.create_collection(collection_name="ORACLE_KNOWLEDGE")

2024-05-03 15:36:20,026 - INFO - ORACLE_KNOWLEDGE created!!!


In [8]:
# create embedding model and then the vector store

# Here compute embeddings and load texts + embeddings in DB
# can take minutes (for embeddings)
v_store = OracleVectorStore.from_documents(
    docs, embed_model, collection_name="ORACLE_KNOWLEDGE", verbose=True
)

2024-05-03 15:36:22,063 - INFO - Compute embeddings...


  0%|          | 0/54 [00:00<?, ?it/s]

2024-05-03 15:39:13,972 - INFO - Saving texts, embeddings to DB...


  0%|          | 0/4832 [00:00<?, ?it/s]

2024-05-03 15:41:44,943 - INFO - Tot. errors in save_embeddings: 0


#### Do a query for test

In [9]:
v_store = OracleVectorStore(
    embedding=embed_model, collection_name="ORACLE_KNOWLEDGE", verbose=True
)

In [10]:
# k is the number of docs we want to retrieve
retriever = v_store.as_retriever(search_kwargs={"k": 6})

In [11]:
question = "Elenca i passi salienti della storia del gruppo Iccrea"

result_docs = retriever.invoke(question)

2024-05-03 15:42:28,812 - INFO - top_k: 6
2024-05-03 15:42:28,812 - INFO - 
2024-05-03 15:42:29,699 - INFO - select: select C.id, C.CHUNK, C.REF, C.PAG,
                            VECTOR_DISTANCE(C.VEC, :1, COSINE) as d 
                            from ORACLE_KNOWLEDGE C
                            order by d
                            FETCH FIRST 6 ROWS ONLY
2024-05-03 15:42:30,025 - INFO - Query duration: 0.5 sec.


In [12]:
for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata)
    print("----------------------------")
    print("")

La storia del Gruppo Un Gruppo che crea il futuro con la forza del passato
{'source': './books/La storia del Gruppo-iccrea.pdf', 'page': 0}
----------------------------

1963 Le origini di Iccrea Banca Iccrea Banca nasce il 30 novembre del 1963, quando i rappresentanti di 190 Casse Rurali si riuniscono a Roma per stipulare l’atto costitutivo dell’Istituto di Credito delle Casse Rurali e Artigiane (CRA). Ispirata come le prime Casse Rurali dell’Ottocento al pensiero cristiano sociale espresso dall’enciclica Rerum Novarum di Leone XIII, Iccrea Banca viene costituita con lo scopo di far crescere l’attività delle CRA, agevolandone e coordinandone l’azione attraverso lo svolgimento di funzioni creditizie, l’intermediazione bancaria e l’assistenza finanziaria. Iccrea Banca rappresenta la prima forma organizzativa di auto-gestione, lo strumento per rendere le Casse Rurali indipendenti dalle altre banche. Guido Carli, Governatore della Banca d’Italia negli anni ’60, ne commenta così la funzion