<div class="alert alert-block alert-info">

# Retrieval Augmented Generation (RAG)

In [None]:
#libraries, make sure to install these in your python environment
#pip install langchain langchain-openai langchain-community
#pip install langchain langchain-openai langchain-community
#pip install beautifulsoup4
#pip install pypdf
#pip install langchain-huggingface sentence_transformers
#if you are having problems with dependencies, try upgrading these libraries:
#pip install --upgrade  langchain langchain-huggingface sentence_transformers
#pip install langchain-postgres

### Navigation <a class="anchor" id="navigation"></a>

[Task 1](#task1)
[Task 2](#task2)
[Task 3](#task3)
[Task 4](#task4)

<div class="alert alert-block alert-info">

## Motivation

In the previous notebook, you learned the building blocks of creating an LLM chain using langchain, and built your first small AI chatbot using langchain!

This was done using a model provided by a distributor, and as such has been pre-trained on data that we can't change. So unfortunately it's a larger task to get these models to do things they weren't trained to do. As a matter of fact, this is a problem that researchers are still struggling with and [exploring](https://arxiv.org/abs/2312.01203).
Some consider this ability of a model to learn new data, specifically after it has been trained, to be the definition of intelligence that artificial intelligence should strive for. 
In order to progress professional fields in a dynamic world, it would be super useful to have this functionality.

There are a few tools to tackle this, including:

1. Fine tuning
    - Involves removing a small subset of the [weights](https://artificialintelligenceschool.com/understanding-weights-in-large-language-models/#2-the-anatomy-of-weights-in-neural-networks) of a model, and re-training these weights over a few [epochs](https://aiwiki.ai/wiki/Epoch) (training cycles) on new data. That is, re-training part of the model to do something new that you would like it to do.
    - Is limited due to the [compute](https://www.lleverage.ai/glossary/compute) required to retrain the model, as well as the amount of times you can retrain the model without running into something called [catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference)! Oh no! This means that the model starts to lose it's fundamental functionality because of the process described above, the weights that enable the model to function are forgotten so much so that the model fails to remain useful.
2. In-context learning
    - This is where the user (you) gives the model feedback on how well it's doing a task, which lets the model change it's output without changing the weights, something only thought possible in the past 5 years.
    - But who wants to do that? This approach is called few-shot learning and not only takes a few messages back and forth with the chatbot, but is more limited the less similar the required task is to the model's [modus operandi](https://en.wikipedia.org/wiki/Modus_operandi). What we really want is [zero-shot learning](https://en.wikipedia.org/wiki/Zero-shot_learning) where you don't have to interact with the model at all before it's ready to answer your questions and do your bidding! How goods that.

3. *RAG*
   - This is similar to the context we were providing when prompting models in the last document, however when the information eclipses a few pages, it soon becomes impractical to give the model this data every time you want an output.
   - Instead of expanding what's called the '[context window](https://www.ibm.com/think/topics/context-window)' of your model, in order to be able to 'remember' all of the data that you have passed it (every time that you prompt the model), RAG uses a pre-processing method to be able to identify which parts of your data it needs to access to answer your question.
   - LangChain splits your data down into small chunks that contain some [semantic meaning as embeddings](#embeddings), and uses this semantic representation to  quickly search your data for the chunk of data that is relevant to your question.

This tutorial document will take you through pre-processing and manipulating your data using LangChain and then using it with your model.

<div class="alert alert-block alert-info">

### Pre-processing

The first step for the aforementioned pre-processing phase is to convert your document to a document type that langchain can handle. LangChain provides document loaders for various types of document sources, a couple of which we will use below.

<div class="alert alert-block alert-warning">

### Task 1 <a class="anchor" id="task1"></a>

Ok, wow that's a lot of theory. So let's now do something practical and use this langchain loader.

Please edit the pre_process.txt file to have some data or context that you want to pass to our RAG machine and use the pseudocode below to load it. We will use the TextLoader, from [langchain's loaders](https://python.langchain.com/docs/integrations/document_loaders/). You need to create an instance of the loader and pass it the location of text file pre_process.txt. These loaders use the load() method to provide the documents that are ready and loaded.

In [None]:
from langchain_community.document_loaders import TextLoader
#loader = TextLoader(???)
#load the document using load()

In [6]:
from langchain_community.document_loaders import TextLoader
#example answer
loader = TextLoader("pre_process.txt")
pre_process_document = loader.load()
print(pre_process_document)

[Document(metadata={'source': 'pre_process.txt'}, page_content="The most recent advancedements in NLP are being driven by LLMS. These models benefit greatly from large size and are used by devs working with Natural Language Processing. Developers can use these models through Hugging Face's 'transformers' library, or by utilizing OPenAI and Cohere's offering through the 'openai'\n and 'cohere' libraries, respectively.")]


This is the basic format of loaders, and as a part of task 1, below use the webbaseloader to load html from a url of your choice into text:

In [None]:
from langchain_community.document_loaders import WebBaseLoader
# loader = ???
# load the document

In [5]:
#example answer
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Artificial_intelligence")
ai_wiki_document = loader.load()

When the length of your document is larger than the [context window](https://www.ibm.com/think/topics/context-window) of your model, RAG is the only of the discussed techniques that is practical. This often happens with long pdfs, so below download and load  the [23-24 Annual report from the ABS](https://www.abs.gov.au/about/our-organisation/corporate-reporting/abs-annual-reports) (Australian Bureau of Statistics) or a document of your choice using pypdfloader.

In [7]:
from langchain_community.document_loaders import PyPDFLoader
# give the loader the path to your pdf
# load the doc, make sure to name it as we will use it soon

In [8]:
#example answer
loader = PyPDFLoader('23-24_ABS_report.pdf')
ABS_report_langchain = loader.load()

<div class="alert alert-block alert-info">

### Creating chunks

Now that we have read and loaded our data in, we can move forward to splitting our data into chunks as priorly discussed. But how is this done?

- If you take a close look at the text coming out of our load() method, you should see a lot of spaces between words, '\n' characters, these denote new lines, and '\n\n' which denotes a new paragraph.
- You will input a chunk size, say 100, which will be the maximum size of yours chunks (smaller pieces of data).
- Assuming your inputted data is greater than your chunk size, LangChain will split your data first at each paragraph ('/n/n'). Then whichever chunks are still greater than the chunk size will be split by the next seperator, line breaks (\n). And lastly if some chunks are still too large they will be split at word spaces. 

<div class="alert alert-block alert-warning">

### Task 2 <a class="anchor" id="task2"></a>

Split your alredy loaded document below

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
#split_docs_from_pdf = splitter.split_documents(#your loaded pdf here)

In [10]:
#example answer
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs_from_pdf = splitter.split_documents(ABS_report_langchain)


The chunk_overlap argument allows two chunks to overlap in order to maintain meaning in each chunk.

<div class="alert alert-block alert-info">

### These snippets of text are written in Markdown,

 a markup language for creating formatted text with a plain text editor. LangChain splitters conveniently provide methods for splitting markdown (and other) languages, so let's have a go at splitting this markdown block using a small chunk size. First we will copy and store it as a variable.

In [11]:
markdown_snippet = """<div class="alert alert-block alert-info">

### These snippets of text are written in Markdown,

 a markup language for creating formatted text with a plain text editor. LangChain splitters conveiniently provide methods for splitting markdown (and other) languages, so let's have a go at splitting this markdown block using a small chunk size. First we will copy and store it as a variable."""

In [12]:
markdown_splitter = RecursiveCharacterTextSplitter.from_language(language ="markdown", chunk_size=10, chunk_overlap=0)
markdown_docs = markdown_splitter.create_documents([markdown_snippet])


<div class="alert alert-block alert-warning">

### Task 3 <a class="anchor" id="task3"></a>

<div class="alert alert-block alert-info">

For our work to continue, we need vector representations of our data this process is called embedding. LangChain can interface with different text embeddings models; we will use HuggingFace's to embed our pdf document from earlier. Make sure you have installed the correct libraries in your python environment from the [top](#) of the document. For this task, play around with the names of the variables to extract and embed text from all the previously split data.

<div class="alert alert-block alert-info"> <a class="anchor" id="embeddings"></a>

You might be wondering what embeddings are and how we get this 'semantic' meaning when computers think in terms of numbers, not meaning. It's at this point that I recommend you check out the [embeddings](./explanations/embeddings_primer.ipynb) document for a more detailed explanation, though it's not mandatory to complete this tutorial. 

The embedding models take a list of strings as inputs, so first we need to extract the texts from our pdf (our splitter method splits data into different documents (chunks), not text, with their own individual metadatas).

In [13]:
#edit for different splits
pdf_chunk_texts = [chunk.page_content for chunk in split_docs_from_pdf]

In [14]:
#edit for different splits
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
embeddings_model = HuggingFaceEmbeddings()
pdf_embeddings = embeddings_model.embed_documents(pdf_chunk_texts)

<div class="alert alert-block alert-info">

## Vectorstore

Now that we have embeddings of our chunks, the next step is to conduct a similarity search like cosine similarity:
$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \, \|B\|} \\
\text{where:}\\
A \cdot B = \sum_{i=1}^{n} A_i B_i, \\
\|A\| = \sqrt{\sum_{i=1}^{n} A_i^2}, \quad
\|B\| = \sqrt{\sum_{i=1}^{n} B_i^2}
$$  

To implement this, we will pass our embeddings to a database called a Vectorstore, that can quickly and efficiently conduct the similarity searches that we want on unstructured data. <a class="anchor" id="vectorstore"></a>

<p align="center">
<img src="images/vector_store.png" alt="drawing" width="600">
<br>
<em>Figure 2: Vectorstore process, from <a href="https://python.langchain.com/docs/concepts/vectorstores/" target="_blank">langchain</a></em>
</p> 


Vectorstores have the same CRUD (create, read, update, delete) and search functionality as normal databases, which lets them have a lot of use cases including our RAG. We will be using PostgreSQL as our vectorstore provider and it's extension, PGVector. Don't worry too much if you haven't used databases before as for our purposes it's quite straightforward, but please download [docker](https://docs.docker.com/get-started/get-docker/) for your os, which we will use for PGVector. Then run the following in your terminal:




docker run \
--name pgvector-container \
-e POSTGRES_USER=langchain \
-e POSTGRES_PASSWORD=langchain \
-e POSTGRES_DB=langchain \
-p 6024:5432 \
-d pgvector/pgvector:pg16


<div class="alert alert-block alert-info">
After running the above, in your docker desktop app, you should see a pgvector container with a green light next to it. We will also need the following connection:
postgresql+psycopg://langchain:langchain@localhost:6024/langchain

This will set up a connection to our postgres container running in docker.


Now, let's use our vectorstore with our split documents (chunks) from before:

In [15]:
from langchain_postgres.vectorstores import PGVector
connection = 'postgresql+psycopg://langchain:langchain@localhost:6024/langchain'
pgvector_database = PGVector.from_documents(split_docs_from_pdf,embeddings_model,connection=connection)

Here vectorstore is able to do a lot of work for us, we simply pass it our chunks, the embedding model we wish to use and the connection to our pgvector instance. The chunks are then embedded, stored alongside our chunk metadata and ready to be used.

In [16]:
query = 'what statistics do the ABS publish?' 
number_of_results = 4
similarity_search_results = pgvector_database.similarity_search(query,k=number_of_results)
print(similarity_search_results[0].page_content)

By following the IMF standard, the ABS continues its reputation for delivering 
high-quality and credible statistics that adhere to expected benchmarks. The IMF 
assessment results empower users to objectively compare Australia’s statistical 
capabilities with those of other countries. ABS official statistics conform to both 
Australian and international standards, accessible on the ABS website. 
13 An error is significant if it could mislead a user as to the value of a statistical indicator of national or state 
importance.
14 Tier 1 statistical releases represent the foundation work of a national statistical organisation.
15 The methodology for this measure only includes errors found in statistical releases published on 
the ABS website.


Passing our query to the similarity_search() first calls the embedding model to embed our query, from which postgres can run a search for the N most similar embeddings and return them, ordered.

<div class="alert alert-block alert-info">
Our pgvector_database now has databse the functionality such that:

- We can use the .add_documents() method to add more documents to our database, as well as assign them ids with the id option
- The id above can be passed to the .delete() method to delete select documents

Changing our database however introduces a new problem, re-indexing our data can lead to costly recomputations of embeddings and duplications of content. For this, langchain provides an indexing API with a RecordManager to keep track of our changes. This is done by computing [hashes](https://en.wikipedia.org/wiki/Hash_function) for each document when content is indexed, and RecordManager stores this hash along with the edit time and the source ID (from the document's metadata). The API also has tools to help cleanup your data, and avoid duplicates.