## RAG using a pdf book
* see: https://python.langchain.com/docs/use_cases/question_answering/
* using **Cohere** embeddings

In [1]:
# modified to load from Pdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# two possible vector store
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS

# removed OpenAI, using Cohere embeddings
from langchain.embeddings import CohereEmbeddings

from langchain import hub

# removed OpenAI, using OCI GenAI
from oci.config import from_file

# oci_llm is in a local file
from oci_llm import OCIGenAILLM

from langchain.schema.runnable import RunnablePassthrough

# private configs
from config_private import COMPARTMENT_OCID, COHERE_API_KEY

In [2]:
# to enable some debugging
DEBUG = False

In [3]:
# functions
def get_answer(rag_chain, question):
    response = rag_chain.invoke(question)

    print(f"Question: {question}")
    print("The response:")
    print(response)
    print()

In [4]:
# read OCI config to connect to OCI with API key
CONFIG_PROFILE = "DEFAULT"
config = from_file("~/.oci/config", CONFIG_PROFILE)

# OCI GenAI endpoint (for now Chicago)
ENDPOINT = "https://generativeai.aiservice.us-chicago-1.oci.oraclecloud.com"

# check the config to access to api keys
if DEBUG:
    print(config)

#### Loading the document

In [5]:
# BLOG_POST = "https://python.langchain.com/docs/get_started/introduction"
BOOK = "./oracle-database-23c-new-features-guide.pdf"

loader = PyPDFLoader(BOOK)

data = loader.load()

#### Splitting the document in chunks

In [6]:
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

splits = text_splitter.split_documents(data)

In [8]:
print(f"We have splitted the pdf in {len(splits)} splits...")

We have splitted the pdf in 143 splits...


In [9]:
# some post processing

# replace \n with blank
for split in splits:
    split.page_content = split.page_content.replace("\n", " ")

In [10]:
# have a look at a single split
splits[20].page_content

'JSON Type Support for External Tables Support for access and direct-loading of JSON-type columns is provided for external tables. JSON data type is supported as a column type in the external table definition. Newline- delimited and JSON-array file options are supported, which facilitates importing JSON data from an external table. This feature makes it easier to load data into a JSON-type columns. Related Resources View Documentation JSON/JSON_VALUE will Convert PL/SQL Aggregate Type to/from JSON The PL/SQL JSON constructor is enhanced to accept an instance of a corresponding PL/SQL aggregate type, returning a JSON object or array type populated with the aggregate type data. The PL/SQL JSON_VALUE operator is enhanced so that its returning clause can accept a type name that defines the type of the instance that the operator is to return. JSON constructor support for aggregate data types streamlines data interchange between PL/SQL applications and languages that support JSON. Related Re

#### Embeddings and Vectore Store

In [13]:
%%time

# We have substituted OpenAI with HF# see leaderboard here: https://huggingface.co/spaces/mteb/leaderboard
# EMBED_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"

cohere = CohereEmbeddings(cohere_api_key=COHERE_API_KEY)

# using Chroma or FAISS as Vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=cohere)
# vectorstore = FAISS.from_documents(documents=splits, embedding=hf)

retriever = vectorstore.as_retriever()

CPU times: user 1.13 s, sys: 61.9 ms, total: 1.2 s
Wall time: 2.77 s


#### Define the prompt structure

In [14]:
rag_prompt = hub.pull("rlm/rag-prompt")

#### Define the LLM: OCI GenAI

In [15]:
# compartment OCID from config_private.py

llm = OCIGenAILLM(
    temperature=1,
    max_tokens=2000,
    config=config,
    compartment_id=COMPARTMENT_OCID,
    endpoint=ENDPOINT,
    debug=DEBUG,
)

#### Define the (Lang)Chain

In [16]:
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm

#### Process the question

In [17]:
# a list of possible questions
QUESTION1 = "What is the best architecture for an LLM?"
QUESTION2 = "What is LangChain?"
QUESTION3 = "Make a list of database 23c innovations in AI"
QUESTION4 = "List the new features in Oracle Database 23c"
QUESTION5 = "Are there features related to Time Series in Oracle Database 23c?"
QUESTION6 = "Are there features related to Machine Learning in Oracle Database 23c?"

In [18]:
%%time

# the question
get_answer(rag_chain, question=QUESTION4)

Question: List the new features in Oracle Database 23c
The response:
 Oracle Database 23c has over 300 new features and enhancements. Some of the new features include: JSON, graph, microservices, and developer productivity.

CPU times: user 91 ms, sys: 10.7 ms, total: 102 ms
Wall time: 2.09 s


In [19]:
%%time

# the question
get_answer(rag_chain, question=QUESTION5)

Question: Are there features related to Time Series in Oracle Database 23c?
The response:
 Yes, Oracle Database 23c has features related to Time Series. The following is a list of some of these features:

1. Oracle Time Series (OTS) is a feature that allows you to create a time-based index on a table to enable faster query performance and easier data analysis.
2. Oracle TimesTen In-Memory Database is a feature that provides a high-performance, in-memory database for fast data access and analytics.
3. Oracle GoldenGate is a feature that allows you to replicate data in real-time between different databases, including Oracle Database and other databases such as MySQL and Microsoft SQL Server.
4. Oracle Data Guard is a feature that provides a high availability and disaster recovery solution for Oracle Database, including the ability to replicate data in real-time between two or more databases.
5. Oracle Flashback Database is a feature that allows you to flashback the Oracle Database to a s

In [20]:
%%time

# the question
get_answer(rag_chain, question=QUESTION6)

Question: Are there features related to Machine Learning in Oracle Database 23c?
The response:
 Yes, Oracle Database 23c includes several features related to Machine Learning, including enhancements for machine learning algorithms and improved data preparation for high cardinality categorical features. The database also includes support for data query persistence with models and automated time series model search.

CPU times: user 47.6 ms, sys: 7.7 ms, total: 55.3 ms
Wall time: 2 s


#### Explore the vectore store

In [21]:
# Retrieve relevant splits for any question using similarity search.

# This is simply "top K" retrieval where we select documents based on embedding similarity to the query.

TOP_K = 5

docs = vectorstore.similarity_search(QUESTION5, k=TOP_K)

len(docs)

5

In [None]:
for i, doc in enumerate(docs):
    print(f"chunk n. {i+1}")
    print(doc.page_content)
    print()