# A basic RAG system

This notebook shows you how to build a simple RAG system.

We take an out-of-copyright geography text, chunk it, and store it in a vector database.

Then, we ask questions and the RAG system finds us answers.

## Set up.

Install the necessary packages, set up the API keys etc.

In [6]:
%pip install --quiet -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv
load_dotenv("../keys.env");

In [2]:
PROVIDER = "Google"
#PROVIDER = "OpenAI"
PERSIST_DIR = "vectordb"

In [3]:
if PROVIDER == "Google":
    from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    model = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.1)
else:
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    embeddings = OpenAIEmbeddings()
    model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.1)

## Step 1: Getting the data

We'll use an out-of-copyright geography textbook as our example. Normally, of course, you'll use documents relevant to your enterprise here.  We'll get the website, pull out the paragraphs and do some simple cleanup.

In [4]:
import urllib.request
import bs4
DOC_URL="https://www.gutenberg.org/cache/epub/3772/pg3772-images.html"
html = urllib.request.urlopen(DOC_URL)
paragraphs = [" ".join(p.get_text().split()).strip() for p in bs4.BeautifulSoup(html, 'html.parser').find_all('p')]

In [5]:
len(paragraphs)

2047

In [6]:
paragraphs[1090]

'Palæontological Relations of the Oolitic Strata.—Observations have already been made on the distinctness of the organic remains of the Oolitic and Cretaceous strata, and the proportion of species common to the different members of the Oolite. Between the Lower Oolite and the Lias there is a somewhat greater break, for out of 256 mollusca of the Upper Lias, thirty-seven species only pass up into the Inferior Oolite.'

## Step 2: Creating embeddings of the chunks and storing them in a vector database

We were careful to split the text into paragraphs, so that each chunk is somewhat consistent in terms of topic. Another approach is to split into sentences. A third approach is to split into overlapping chunks of equivalent characters. Look at the available text splitters in langchain.
For example:
<pre>
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
</pre>

In [7]:
!rm -rf $PERSIST_DIR  # from scratch

In [8]:
from langchain.docstore.document import Document
from langchain_chroma import Chroma

docs = [Document(page_content=p, metadata={"source": "geography", "paragraph": pno+1}) for pno, p in enumerate(paragraphs)]
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory=PERSIST_DIR)

In [9]:
!ls -lrth $PERSIST_DIR

total 18M
drwxr-xr-x 2 jupyter jupyter 4.0K Jul 31 22:22 d25490e4-557b-431d-9c73-6132ad2cba6e
-rw-r--r-- 1 jupyter jupyter  18M Jul 31 22:22 chroma.sqlite3


## Step 3: Load the vector db from disk and create a chain to ask questions

The question is embedded with the same embedding function as the paragraphs.
Similar chunks are found and added to the context.
The LLM uses this information to answer the question

In [10]:
from langchain.chains import RetrievalQA
from langchain_chroma import Chroma

vectorstore = Chroma(embedding_function=embeddings, persist_directory=PERSIST_DIR)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [11]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retrieval_chain = RunnablePassthrough() | retriever
retrieval_chain.invoke("What rocks do you find in the Upper Lias?")

[Document(metadata={'paragraph': 1104, 'source': 'geography'}, page_content='The Lias has been divided in England into three groups, the Upper, Middle, and Lower. The Upper Lias consists first of sands, which were formerly regarded as the base of the Oolite, but which, according to Dr. Wright, are by their fossils more properly referable to the Lias; secondly, of clay shale and thin beds of limestone. The Middle Lias, or marl-stone series, has been divided into three zones; and the Lower Lias, according to the labours of Quenstedt, Oppel, Strickland, Wright, and others, into seven zones, each marked by its own group of fossils. This Lower Lias averages from 600 to 900 feet in thickness.'),
 Document(metadata={'paragraph': 1103, 'source': 'geography'}, page_content='Lias.—The English provincial name of Lias has been very generally adopted for a formation of argillaceous limestone, marl, and clay, which forms the base of the Oolite, and is classed by many geologists as part of that group

In [12]:
def add_docs_to_context(docs):
    return "\n".join(doc.page_content for doc in docs)

In [13]:
from langchain import hub
rag_prompt = hub.pull("rlm/rag-prompt")
print(rag_prompt)

input_variables=['context', 'question'] metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]


In [14]:
rag_chain = (
    {"context": retriever | add_docs_to_context, "question": RunnablePassthrough()}
    | rag_prompt
    | model
    | StrOutputParser()
)

In [15]:
rag_chain.invoke("What rocks do you find in the Upper Lias?")

'The Upper Lias consists of sands, clay shale, and thin beds of limestone.  The sands were originally thought to be part of the Oolite, but fossils indicate they belong to the Lias.  The Upper Lias is characterized by an alternation of thin limestone beds and dark-colored argillaceous partings. \n'