# Oracle AI Vector Search Document Processing
Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords.
One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.

In addition, because Oracle has been building database technologies for so long, your vectors can benefit from all of Oracle Database's most powerful features, like the following:

 * Partitioning Support
 * Real Application Clusters scalability
 * Exadata smart scans
 * Shard processing across geographically distributed databases
 * Transactions
 * Parallel SQL
 * Disaster recovery
 * Security
 * Oracle Machine Learning
 * Oracle Graph Database
 * Oracle Spatial and Graph
 * Oracle Blockchain
 * JSON

The guide demonstrates how to use Document Processing Capabilities within Oracle AI Vector Search to load and chunk documents using OracleDocLoader and OracleTextSplitter respectively. 

This guide also demonstrates how Oracle AI Vector Search can be used with Langchain to serve an end-to-end RAG pipeline. This guide goes through examples of:

 * Loading the documents from various sources using OracleDocLoader
 * Summarizing them within/outside the database using OracleSummary
 * Generating embeddings for them within/outside the database using OracleEmbeddings
 * Chunking them according to different requirements using Advanced Oracle Capabilities from OracleTextSplitter
 * Storing and Indexing them in a Vector Store and querying them for queries in OracleVS

### Prerequisites

Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. 

In [None]:
# pip install oracledb

### Create Demo User
First, create a demo user with all the required privileges. 

In [None]:
import sys

import oracledb

# please update with your username, password, hostname and service_name
# please make sure this user has sufficient privileges to perform all below
username = "<username>"
password = "<password>"
dsn = "<hostname>/<service_name>"

try:
    conn = oracledb.connect(user=username, password=password, dsn=dsn)
    print("Connection successful!")

    cursor = conn.cursor()
    cursor.execute(
        """
    begin
        -- drop user
        begin
            execute immediate 'drop user testuser cascade';
        exception
            when others then
                dbms_output.put_line('Error setting up user.');
        end;
        execute immediate 'create user testuser identified by testuser';
        execute immediate 'grant connect, unlimited tablespace, create credential, create procedure, create any index to testuser';
        execute immediate 'create or replace directory DEMO_DIR as ''<directory>''';
        execute immediate 'grant read, write on directory DEMO_DIR to public';
        execute immediate 'grant create mining model to testuser';
        execute immediate 'grant execute on sys.dmutil_lib to testuser';

        -- network access
        begin
            DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(
                host => '*',
                ace => xs$ace_type(privilege_list => xs$name_list('connect'),
                                principal_name => 'testuser',
                                principal_type => xs_acl.ptype_db));
        end;
    end;
    """
    )
    print("User setup done!")
    cursor.close()
except Exception as e:
    print("User setup failed!")
    cursor.close()
    sys.exit(1)

### Connect to Oracle Database
The following sample code will show how to connect to Oracle Database. 

In [None]:
import sys

import oracledb

# please update with your username, password, hostname and service_name
username = "testuser"
password = "testuser"
dsn = "<hostname>/<service_name>"

try:
    conn = oracledb.connect(user=username, password=password, dsn=dsn)
    print("Connection successful!")
except Exception as e:
    print("Connection failed!")
    sys.exit(1)

### Load Documents
The users can load the documents from Oracle Database or a file system or both. They just need to set the loader parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.

The main benefit of using OracleDocLoader is that it can handle 150+ different file formats. You don't need to use different types of loader for different file formats. Here is the list formats that we support: https://docs.oracle.com/en/database/oracle/oracle-database/23/ccref/oracle-text-supported-document-formats.html 

The following sample code will show how to do that:

In [None]:
""" import the dependencies"""
from langchain_community.document_loaders.oracleai import OracleDocLoader
from langchain_core.documents import Document

""" setting loader parameters """

""" you can choose any of the setting options or combine them. 
    for now, let's load from a Database table.
"""

"""
# loading a local file
loader_params = {}
loader_params["file"] = "<file>"

# loading from a local directory
loader_params = {}
loader_params["dir"] = "<directory>"
"""

# loading from Oracle Database table
# make sure you have the table with this specification
loader_params = {}
loader_params = {
    "owner": "ut",
    "tablename": "demo_tab",
    "colname": "data",
}

""" load the docs """
loader = OracleDocLoader(conn=conn, params=loader_params)
docs = loader.load()

""" verify """
print(f"Number of docs loaded: {len(docs)}")
# print(f"Document-0: {docs[0].page_content}") # content

### Split Documents
The documents can be in different sizes: small, medium, large, or very large. The users like to split/chunk their documents into smaller pieces to generate embeddings. There are lots of different splitting customizations the users can do. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.

The following sample code will show how to do that:

In [None]:
""" import the dependencies"""
from langchain_community.document_loaders.oracleai import OracleTextSplitter
from langchain_core.documents import Document

""" setting splitter parameters """

""" please choose the option that you prefer.
    for now, let's use split by default parameters.
"""

"""
# Some examples
# split by chars, max 500 chars
splitter_params = {"split": "chars", "max": 500, "normalize": "all"}

# split by words, max 100 words
splitter_params = {"split": "words", "max": 100, "normalize": "all"}

# split by sentence, max 20 sentences
splitter_params = {"split": "sentence", "max": 20, "normalize": "all"}
"""

# split by default parameters
splitter_params = {"normalize": "all"}

""" get the splitter instance """
splitter = OracleTextSplitter(conn=conn, params=splitter_params)

list_chunks = []
for doc in docs:
    chunks = splitter.split_text(doc.page_content)
    list_chunks.extend(chunks)

""" verify """
print(f"Number of Chunks: {len(list_chunks)}")
# print(f"Chunk-0: {list_chunks[0]}") # content

### End to End Demo
Now that you know how to use Oracle AI Vector Search's OracleDocLoader and OracleTextSplitter, let us show how to build an end to end RAG pipeline with the help of Oracle AI Vector Search.

First, let's import all the dependencies.

In [None]:
import sys

import oracledb
from langchain_community.document_loaders.oracleai import (
    OracleDocLoader,
    OracleTextSplitter,
)
from langchain_community.embeddings.oracleai import OracleEmbeddings
from langchain_community.utilities.oracleai import OracleSummary
from langchain_community.vectorstores import oraclevs
from langchain_community.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_core.documents import Document

Let's think about a scenario that the users have some documents in Oracle Database or in a file system. They want to use the data for Oracle AI Vector Search using Langchain.

For that, the users need to do some document preprocessing. The first step would be to read the documents, generate their summary(if needed) and then chunk/split them if needed. After that, they need to generate the embeddings for those chunks and store into Oracle AI Vector Store. Finally, the users will perform some semantic queries on those data.

Oracle AI Vector Search Langchain library provides a range of document processing functionalities including document loading, splitting, generating summary and embeddings.

Next, let's combine all document processing stages together. Here is the sample code below:

In [None]:
"""
In this sample example, we will use 'database' provider for both summary and embeddings.
So, we don't need to do the followings:
    - set proxy for 3rd party providers
    - create credential for 3rd party providers

If you choose to use 3rd party provider, 
please follow the necessary steps for proxy and credential.
"""

# oracle connection
# please update with your username, password, hostname, and service_name
username = "testuser"
password = "testuser"
dsn = "<hostname>/<service_name>"

try:
    conn = oracledb.connect(user=username, password=password, dsn=dsn)
    print("Connection successful!")
except Exception as e:
    print("Connection failed!")
    sys.exit(1)


# load onnx model
# please update with your related information
onnx_dir = "DEMO_DIR"
onnx_file = "tinybert.onnx"
model_name = "demo_model"
try:
    OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)
    print("ONNX model loaded.")
except Exception as e:
    print("ONNX model loading failed!")
    sys.exit(1)


# params
# please update necessary fields with related information
loader_params = {
    "owner": "testuser",
    "tablename": "demo_tab",
    "colname": "data",
}
summary_params = {
    "provider": "database",
    "glevel": "S",
    "numParagraphs": 1,
    "language": "english",
}
splitter_params = {"normalize": "all"}
embedder_params = {"provider": "database", "model": "demo_model"}

# instantiate loader, summary, splitter, and embedder
loader = OracleDocLoader(conn=conn, params=loader_params)
summary = OracleSummary(conn=conn, params=summary_params)
splitter = OracleTextSplitter(conn=conn, params=splitter_params)
embedder = OracleEmbeddings(conn=conn, params=embedder_params)

# process the documents
chunks_with_mdata = []
for id, doc in enumerate(docs, start=1):
    summ = summary.get_summary(doc.page_content)
    chunks = splitter.split_text(doc.page_content)
    for ic, chunk in enumerate(chunks, start=1):
        chunk_metadata = doc.metadata.copy()
        chunk_metadata["id"] = chunk_metadata["_oid"] + "$" + str(id) + "$" + str(ic)
        chunk_metadata["document_id"] = str(id)
        chunk_metadata["document_summary"] = str(summ[0])
        chunks_with_mdata.append(
            Document(page_content=str(chunk), metadata=chunk_metadata)
        )

""" verify """
print(f"Number of total chunks with metadata: {len(chunks_with_mdata)}")

At this point, we have processed the documents and generated chunks with metadata. Next, we will create Oracle AI Vector Store with those chunks.

Here is the sample code how to do that:

In [None]:
# create Oracle AI Vector Store
vectorstore = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs",
    distance_strategy=DistanceStrategy.DOT_PRODUCT,
)

The above example creates a vector store with DOT_PRODUCT distance strategy. 

However, the users can create Oracle AI Vector Store provides with different distance strategies. In the following example, we will show a few other options that we support.

***Note*** The following code is just for your information. Just showing what some other options are available.

In [None]:
# create some vector stores
vectorstore_dot = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs_dot",
    distance_strategy=DistanceStrategy.DOT_PRODUCT,
)

vectorstore_cosine = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs_cosine",
    distance_strategy=DistanceStrategy.COSINE,
)

vectorstore_euclidean = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs_euclidean",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

vectorstore_dot_ivf = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs_dot_ivf",
    distance_strategy=DistanceStrategy.DOT_PRODUCT,
)

vectorstore_euclidean_ivf = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs_euclidean_ivf",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
)

vectorstore_cosine_ivf = OracleVS.from_documents(
    chunks_with_mdata,
    embedder,
    client=conn,
    table_name="oravs_cosine_ivf",
    distance_strategy=DistanceStrategy.COSINE,
)

Now that we have embeddings stored in vector stores, let's create an index on them to get better semantic search performance during query time.

Here is the sample code to create an index:

In [None]:
oraclevs.create_index(
    conn,
    vectorstore,
    params={"idx_name": "hnsw_oravs", "idx_type": "HNSW"},
)

The above example creates a default HNSW index on the embeddings stored in 'oravs' table. The users can set different parameters as per their requirements. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.

***Note*** The following sample examples are just for your information. Just showing what some other options are available.


In [None]:
# creating HNSW indices
# Index for DOT_PRODUCT strategy with specific parameters
oraclevs.create_index(
    conn, vectorstore_dot, params={"idx_name": "oravs_dot_hnsw", "idx_type": "HNSW"}
)

# Index for COSINE strategy with specific parameters
oraclevs.create_index(
    conn,
    vectorstore_cosine,
    params={
        "idx_name": "oravs_cosine_hnsw",
        "idx_type": "HNSW",
        "accuracy": 97,
        "parallel": 16,
    },
)

# Index for EUCLIDEAN_DISTANCE strategy with specific parameters
oraclevs.create_index(
    conn,
    vectorstore_euclidean,
    params={
        "idx_name": "oravs_euclidean_hnsw",
        "idx_type": "HNSW",
        "neighbors": 64,
        "efConstruction": 100,
    },
)

# creating IVF indices
# Index for DOT_PRODUCT strategy with specific parameters
oraclevs.create_index(
    conn,
    vectorstore_dot_ivf,
    params={
        "idx_name": "oravs_dot_ivf",
        "idx_type": "IVF",
    },
)

# Index for COSINE strategy with specific parameters
oraclevs.create_index(
    conn,
    vectorstore_cosine_ivf,
    params={
        "idx_name": "oravs_cosine_ivf",
        "idx_type": "IVF",
        "accuracy": 90,
        "parallel": 32,
    },
)

# Index for EUCLIDEAN_DISTANCE strategy with specific parameters
oraclevs.create_index(
    conn,
    vectorstore_euclidean_ivf,
    params={"idx_name": "oravs_euclidean_ivf", "idx_type": "IVF", "neighbor_part": 64},
)

## Perform Semantic Search
All set!

We have processed the documents, stored them to vector store, and then created index to get better query performance. Now let's do some semantic searches.

Here is the sample code for this:

In [None]:
query = "What is Oracle AI Vector Store?"
filter = {"document_id": ["1"]}

# Similarity search without a filter
print(vectorstore.similarity_search(query, 1))

# Similarity search with a filter
print(vectorstore.similarity_search(query, 1, filter=filter))

# Similarity search with relevance score
print(vectorstore.similarity_search_with_relevance_score(query, 1))

# Similarity search with relevance score with filter
print(vectorstore.similarity_search_with_relevance_score(query, 1, filter=filter))

# Max marginal relevance search
print(vectorstore.max_marginal_relevance_search(query, 1, fetch_k=20, lambda_mult=0.5))

# Max marginal relevance search with filter
print(
    vectorstore.max_marginal_relevance_search(
        query, 1, fetch_k=20, lambda_mult=0.5, filter=filter
    )
)