## RAG using a pdf book
* see: https://python.langchain.com/docs/use_cases/question_answering/
* using Cohere embeddings
* using a custom prompt

In [1]:
# modified to load from Pdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate

# two possible vector store
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS

# removed OpenAI, using Cohere embeddings
from langchain.embeddings import CohereEmbeddings

from langchain import hub

# removed OpenAI, using OCI GenAI
from oci.config import from_file

# oci_llm is in a local file
from oci_llm import OCIGenAILLM

from langchain.schema.runnable import RunnablePassthrough

# private configs
from config_private import COMPARTMENT_OCID, COHERE_API_KEY

In [2]:
# to enable some debugging
DEBUG = False

#### Template for custom prompt

In [3]:
# this is the template for the prompt
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use five sentences maximum. 
Always say "Thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""

In [4]:
# functions
def get_answer(rag_chain, question):
    response = rag_chain.invoke(question)

    print(f"Question: {question}")
    print()
    print("The response:")
    print(response)
    print()

In [5]:
# read OCI config to connect to OCI with API key
CONFIG_PROFILE = "DEFAULT"
config = from_file("~/.oci/config", CONFIG_PROFILE)

# OCI GenAI endpoint (for now Chicago)
ENDPOINT = "https://generativeai.aiservice.us-chicago-1.oci.oraclecloud.com"

# check the config to access to api keys
if DEBUG:
    print(config)

#### Loading the document

In [6]:
# BLOG_POST = "https://python.langchain.com/docs/get_started/introduction"
BOOK = "./oracle-database-23c-new-features-guide.pdf"

loader = PyPDFLoader(BOOK)

data = loader.load()

#### Splitting the document in chunks

In [7]:
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

splits = text_splitter.split_documents(data)

In [8]:
print(f"We have splitted the pdf in {len(splits)} splits...")

We have splitted the pdf in 143 splits...


In [9]:
# some post processing

# replace \n with blank
for split in splits:
    split.page_content = split.page_content.replace("\n", " ")

In [10]:
# have a look at a single split
splits[20].page_content

'JSON Type Support for External Tables Support for access and direct-loading of JSON-type columns is provided for external tables. JSON data type is supported as a column type in the external table definition. Newline- delimited and JSON-array file options are supported, which facilitates importing JSON data from an external table. This feature makes it easier to load data into a JSON-type columns. Related Resources View Documentation JSON/JSON_VALUE will Convert PL/SQL Aggregate Type to/from JSON The PL/SQL JSON constructor is enhanced to accept an instance of a corresponding PL/SQL aggregate type, returning a JSON object or array type populated with the aggregate type data. The PL/SQL JSON_VALUE operator is enhanced so that its returning clause can accept a type name that defines the type of the instance that the operator is to return. JSON constructor support for aggregate data types streamlines data interchange between PL/SQL applications and languages that support JSON. Related Re

#### Embeddings and Vectore Store

In [11]:
%%time

cohere = CohereEmbeddings(cohere_api_key=COHERE_API_KEY)

# using Chroma or FAISS as Vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=cohere)
# vectorstore = FAISS.from_documents(documents=splits, embedding=hf)

retriever = vectorstore.as_retriever()

CPU times: user 541 ms, sys: 89.5 ms, total: 630 ms
Wall time: 2.4 s


#### Define the prompt structure

In [12]:
rag_prompt_custom = PromptTemplate.from_template(template)

#### Define the LLM: OCI GenAI

In [13]:
# compartment OCID from config_private.py

# using mostly defaults
llm = OCIGenAILLM(
    temperature=1.0,
    max_tokens=1500,
    config=config,
    compartment_id=COMPARTMENT_OCID,
    endpoint=ENDPOINT,
    debug=DEBUG,
)

#### Define the (Lang)Chain

In [14]:
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()} | rag_prompt_custom | llm
)

#### Process the question

In [15]:
# a list of possible questions
QUESTION1 = "What is the best architecture for an LLM?"
QUESTION2 = "What is LangChain?"
QUESTION3 = "Make a list of database 23c innovations in AI"
QUESTION4 = "List the new features in Oracle Database 23c"
QUESTION6 = "Are there features related to Machine Learning in Oracle Database 23c?"

In [21]:
%%time

# the question
get_answer(rag_chain, question=QUESTION4)

Question: List the new features in Oracle Database 23c

The response:
 There are over 300 new features and enhancements in Oracle Database 23c. Here are some of the key features:
- JSON Relational Duality: Data can be accessed and updated as either JSON documents or relational tables.
- Operational Property Graphs in SQL: Developers can now build real-time graph analysis applications against operational data directly in the Oracle Database.
- Microservice Support: New functionality makes it simpler to implement cross-service transactions.
- Lock-Free Column Value Reservations: Lock-free column value reservations allow applications to reserve part of a value in a column without locking the row.
- Add and Drop User Columns in Blockchain and Immutable Tables: You can now add and drop user columns in blockchain and immutable tables.
- Blockchain Table Countersignature: You can now add countersignatures to blockchain tables.
- Blockchain Table Delegate Signer: You can now add a delegate sig

In [18]:
%%time

# the question
get_answer(rag_chain, question=QUESTION6)

Question: Are there features related to Machine Learning in Oracle Database 23c?

The response:
 Yes, there are several features related to Machine Learning in Oracle Database 23c, including:

1.  Improved data preparation for high cardinality categorical features: This feature simplifies the process of preparing data for machine learning by providing better support for handling high cardinality categorical features.

2.  Lineage: Data Query Persisted with Model: This feature allows you to persist the data query associated with a machine learning model, which can help improve the performance of the model.

3.  Multiple time series: This feature allows you to work with multiple time series data sets, which can be useful for machine learning applications.

4.  Outlier detection using expectation maximization (EM) clustering: This feature allows you to detect outliers in your data using EM clustering, which can be useful for machine learning applications.

5.  Partitioned model performanc

#### Explore the vectore store

In [19]:
# Retrieve relevant splits for any question using similarity search.

# This is simply "top K" retrieval where we select documents based on embedding similarity to the query.

TOP_K = 5

docs = vectorstore.similarity_search(QUESTION5, k=TOP_K)

len(docs)

5

In [20]:
for i, doc in enumerate(docs):
    print(f"chunk n. {i+1}")
    print(doc.page_content)
    print()

chunk n. 1
Oracle® Database Oracle Database New Features Release 23c F48428-15 October 2023

chunk n. 2
1 Introduction Oracle Database 23c is the next long term support release of Oracle Database. Oracle Database 23c, code named “App Simple,”  accelerates Oracle's mission to make it simple to develop and run all data-driven applications. It's the sum of all the features from the Oracle Database 21c innovation release plus over 300 new features and enhancements. Key focus areas include JSON, graph, microservices, and developer productivity. Note: For information about desupported features, see Oracle Database Changes, Desupports, and Deprecations. JSON Relational Duality Data can be transparently accessed and updated as either JSON documents or relational tables. Developers benefit from the strengths of both, which are simpler and more powerful than Object Relational Mapping (ORM). See JSON-Relational Duality . Operational Property Graphs in SQL Developers can now build real-time graph 