# RAG: Indexing

This notebook shows you how to build a simple RAG system.

We take an out-of-copyright geography text, chunk it, and store it in a vector database.

Then, in the generation notebook, we ask questions and the RAG system finds us answers.

## Set up.

Install the necessary packages, set up the API keys etc.

In [2]:
%pip install --quiet -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv
load_dotenv("../keys.env");

In [2]:
PROVIDER = "Google"
#PROVIDER = "OpenAI"
PERSIST_DIR = "vectordb"

In [3]:
if PROVIDER == "Google":
    from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    model = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.1)
else:
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    embeddings = OpenAIEmbeddings()
    model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.1)

## Step 1: Getting the data

We'll use an out-of-copyright geography textbook as our example. Normally, of course, you'll use documents relevant to your enterprise here.  We'll get the website, pull out the paragraphs and do some simple cleanup.

In [4]:
import urllib.request
import bs4
DOC_URL="https://www.gutenberg.org/cache/epub/3772/pg3772-images.html"
html = urllib.request.urlopen(DOC_URL)
paragraphs = [" ".join(p.get_text().split()).strip() for p in bs4.BeautifulSoup(html, 'html.parser').find_all('p')]

In [5]:
len(paragraphs)

2047

In [6]:
paragraphs[1090]

'Palæontological Relations of the Oolitic Strata.—Observations have already been made on the distinctness of the organic remains of the Oolitic and Cretaceous strata, and the proportion of species common to the different members of the Oolite. Between the Lower Oolite and the Lias there is a somewhat greater break, for out of 256 mollusca of the Upper Lias, thirty-seven species only pass up into the Inferior Oolite.'

## Step 2: Creating embeddings of the chunks and storing them in a vector database

We were careful to split the text into paragraphs, so that each chunk is somewhat consistent in terms of topic. Another approach is to split into sentences. A third approach is to split into overlapping chunks of equivalent characters. Look at the available text splitters in langchain.
For example:
<pre>
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
</pre>

In [7]:
!rm -rf $PERSIST_DIR  # from scratch

In [8]:
from langchain.docstore.document import Document
from langchain_chroma import Chroma

docs = [Document(page_content=p, metadata={"source": "geography", "paragraph": pno+1}) for pno, p in enumerate(paragraphs)]
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory=PERSIST_DIR)

In [9]:
!ls -lrth $PERSIST_DIR

total 18M
drwxr-xr-x 2 jupyter jupyter 4.0K Jul 31 22:22 d25490e4-557b-431d-9c73-6132ad2cba6e
-rw-r--r-- 1 jupyter jupyter  18M Jul 31 22:22 chroma.sqlite3


## Next step

Look at [./rag_1_generation.ipynb](rag_1_generation.ipynb)