## RAG using a pdf book
* see: https://python.langchain.com/docs/use_cases/question_answering/

In [1]:
# modified to load from Pdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# two possible vector store
from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS

# removed OpenAI, using HF
from langchain.embeddings import HuggingFaceEmbeddings

from langchain import hub

# removed OpenAI, using OCI GenAI
import oci

# oci_llm is in a local file
from oci_llm import OCIGenAILLM

from langchain.schema.runnable import RunnablePassthrough

# private configs
from config_private import COMPARTMENT_OCID

In [2]:
# to enable some debugging
DEBUG = False

In [3]:
# read OCI config to connect to OCI with API key
CONFIG_PROFILE = "DEFAULT"
config = oci.config.from_file("~/.oci/config", CONFIG_PROFILE)

# OCI GenAI endpoint (for now Chicago)
ENDPOINT = "https://generativeai.aiservice.us-chicago-1.oci.oraclecloud.com"

# check the config to access to api keys
if DEBUG:
    print(config)

#### Loading the document

In [4]:
# BLOG_POST = "https://python.langchain.com/docs/get_started/introduction"
BOOK = "./oracle-database-23c-new-features-guide.pdf"

loader = PyPDFLoader(BOOK)

data = loader.load()

#### Splitting the document in chunks

In [5]:
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

splits = text_splitter.split_documents(data)

In [6]:
print(f"We have {len(splits)} splits...")

We have 143 splits...


In [7]:
# some post processing

# replace \n with blank
for split in splits:
    split.page_content = split.page_content.replace("\n", " ")

In [9]:
# have a look at a single split
splits[20].page_content

'JSON Type Support for External Tables Support for access and direct-loading of JSON-type columns is provided for external tables. JSON data type is supported as a column type in the external table definition. Newline- delimited and JSON-array file options are supported, which facilitates importing JSON data from an external table. This feature makes it easier to load data into a JSON-type columns. Related Resources View Documentation JSON/JSON_VALUE will Convert PL/SQL Aggregate Type to/from JSON The PL/SQL JSON constructor is enhanced to accept an instance of a corresponding PL/SQL aggregate type, returning a JSON object or array type populated with the aggregate type data. The PL/SQL JSON_VALUE operator is enhanced so that its returning clause can accept a type name that defines the type of the instance that the operator is to return. JSON constructor support for aggregate data types streamlines data interchange between PL/SQL applications and languages that support JSON. Related Re

#### Embeddings and Vectore Store

In [10]:
# We have substituted OpenAI with HF# see leaderboard here: https://huggingface.co/spaces/mteb/leaderboard
# EMBED_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
EMBED_MODEL_NAME = "BAAI/bge-base-en-v1.5"

model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}


hf = HuggingFaceEmbeddings(
    model_name=EMBED_MODEL_NAME, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

# using Chroma or FAISS as Vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=hf)
# vectorstore = FAISS.from_documents(documents=splits, embedding=hf)

retriever = vectorstore.as_retriever()

#### Define the prompt structure

In [11]:
rag_prompt = hub.pull("rlm/rag-prompt")

#### Define the LLM: OCI GenAI

In [12]:
# compartment OCID from config_private.py

llm = OCIGenAILLM(
    temperature=1,
    max_tokens=2000,
    config=config,
    compartment_id=COMPARTMENT_OCID,
    endpoint=ENDPOINT,
    debug=DEBUG,
)

#### Define the (Lang)Chain

In [13]:
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm

#### Process the question

In [41]:
# a list of possible questions
QUESTION1 = "What is the best architecture for an LLM?"
QUESTION2 = "What is LangChain?"
QUESTION3 = "Make a list of database 23c innovations in AI"
QUESTION4 = "List the new features in Oracle Database 23c"
QUESTION5 = "Are there features related to Time Series in Oracle Database 23c?"
QUESTION6 = "Are there features related to Machine Learning in Oracle Database 23c?"

In [43]:
%%time

# the question
QUESTION = QUESTION4

response = rag_chain.invoke(QUESTION)

print("The response:")
print(response)
print()

The response:
 Oracle Database 23c has the following new features:
1. Lock-Free Column Value Reservations
2. Microservice Support
3. JSON Relational Duality
4. Operational Property Graphs in SQL
5. Selective In-Memory Columns
6. CMAN Diagnostics and Logging Enhancements
7. DBMS_DICTIONARY_CHECK PL/SQL Package
8. Estimate the Space Saved with Deduplication
9. Extent-Based Scrubbing
10. High Availability Diagnosability Using the DBMS_SCHEDULER Package
11. In-Memory Advisor
12. Oracle Call Interface (OCI) APIs to Enable Client-Side Tracing
13. Rename LOB Segment
14. AutoUpgrade Release Update (RU) Upgrades
15. AutoUpgrade Sets Parallelism Based on System Resources
16. AutoUpgrade Supports Upgrades with Keystore Access to Databases Using TDE
17. AutoUpgrade Unplug-Plug Upgrades to Different Systems
18. REST APIs for AutoUpgrade

CPU times: user 66.2 ms, sys: 25.3 ms, total: 91.6 ms
Wall time: 7.05 s


In [33]:
QUESTION = QUESTION5

response = rag_chain.invoke(QUESTION)

print("The response:")
print(response)
print()

The response:
 Oracle Database 23c includes features related to Time Series such as:
- Time series functions and syntax
- Time series analysis and forecasting
- Time series-specific indexes

Is there anything else I can help you with?



In [34]:
QUESTION = QUESTION6

response = rag_chain.invoke(QUESTION)

print("The response:")
print(response)
print()

The response:
 Yes, Oracle Database 23c has features related to Machine Learning, including enhancements such as Automated Time Series Model Search, Explicit Semantic Analysis Support for Dense Projection with Embeddings, GLM Link Functions, Improved Data Prep for High Cardinality Categorical Features, Lineage: Data Query Persisted with Model, Multiple Time Series, Outlier Detection using Expectation Maximization (EM) Clustering, Partitioned Model Performance Improvement, XGBoost Support for Constraints and for Survival Analysis, and Vectorized Query Processing: Multi-Level Joins and Aggregations.



#### Explore the vectore store

In [44]:
# Retrieve relevant splits for any question using similarity search.

# This is simply "top K" retrieval where we select documents based on embedding similarity to the query.

TOP_K = 5

docs = vectorstore.similarity_search(QUESTION4, k=TOP_K)

len(docs)

5

In [45]:
for i, doc in enumerate(docs):
    print(f"chunk n. {i+1}")
    print(doc.page_content)
    print()

chunk n. 1
Oracle® Database Oracle Database New Features Release 23c F48428-15 October 2023

chunk n. 2
1 Introduction Oracle Database 23c is the next long term support release of Oracle Database. Oracle Database 23c, code named “App Simple,”  accelerates Oracle's mission to make it simple to develop and run all data-driven applications. It's the sum of all the features from the Oracle Database 21c innovation release plus over 300 new features and enhancements. Key focus areas include JSON, graph, microservices, and developer productivity. Note: For information about desupported features, see Oracle Database Changes, Desupports, and Deprecations. JSON Relational Duality Data can be transparently accessed and updated as either JSON documents or relational tables. Developers benefit from the strengths of both, which are simpler and more powerful than Object Relational Mapping (ORM). See JSON-Relational Duality . Operational Property Graphs in SQL Developers can now build real-time graph 