In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

True

## Retrieval Augmented Generation (RAG)

As you know, LLMs are models that have been trained on large amount of data. While they can generate human-like contents, there is no gaurantee that their answers to be trustworthy. Let's consider that we would like to use LLMs to create a chatbot answering the questions of customers of a particular company. Since LLMs do not know anything about this company and its products (during training), they fail to provide appropritate answers to the customers. In order to address this issue, one may consider to fine-tune LLMs for this application. However, here we face two challenges: a) the need for large amount of data (from the company), b) the need for large computational resources (required for training LLMs). 

Fortunately, there is a better solution that does not have these issues. One can help LLMs to provide appropriate answers (to the customers' questions) by providing contexts to the LLM. For instance, if a customer ask about a particular product, we can help the LLM by giving the model the information about this product. This is called `Retrieval Augmented Generation` (RAG) which means: 
* helping the LLM, by retrieving the relevant information from a database.

A typical RAG consists of two component:
* indexing: creation of a database containing the company data (in numerical format), 
* retrieval and generation: given a query (from a customer), retrieve the most relevant data (from the database) and generate the answer (based on the relevant data).

In this tutorial, we focus on the first component and the second one will be presented in the next tutorial.

### Indexing

Let's consider we have a set of document (from the company) containing useful information for answering questions. Here, our goal is to create a pipeline that
* ingesting data from a source and indexing it.

In order to do that, we follow a three-step process.
1) Load: first we should load the data using [Document Loader](https://python.langchain.com/docs/concepts/document_loaders/).
2) Split: then we use [Text Splitter](https://python.langchain.com/docs/concepts/text_splitters/) to split texts into smaller chunks. The main reason is that not all text (of a document) is relevant to a particular question. Thus, by splitting the text into smaller chunks, we can retrieve only the relevant part.
3) Embedding and Storage: in this step, we convert each chunk into numerical/vectorial representation and then store it. This is done by using [Embedding models](https://python.langchain.com/docs/concepts/embedding_models/) and [Vector Store](https://python.langchain.com/docs/concepts/vectorstores/).

In [6]:
from langchain import hub # to download prompt

# from langchain_community.document_loaders import WebBaseLoader # to load a webpage
from langchain_community.document_loaders import PyPDFLoader # to load a pdf

from langchain.text_splitter import RecursiveCharacterTextSplitter # to split a text

In [38]:
# create loader objects for each pdf file
data_folder = './data'
loader = PyPDFLoader('data/caselaw_1.pdf')

In [39]:
# create text splitter objects
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, # number of characters per chunk
    chunk_overlap=50, # number of characters overlap between two adjacent chunks
)

In [42]:
# load documents and then split them into smaller chunks
docs = loader.load_and_split(text_splitter)
print(f"There are {len(docs)} chunks.")

There are 15 chunks.


##### Vector stores

We create a vector stores which contains the vector embedding for each chunk. Here, we have two options for vector stores:
* Chroma: it runs on your local machine as a library.
* other tools (such as Pinecone) which provide broad functionality to store and search over vectors

Thus, you can run either of the following two cell

In [49]:
# Option 1: Chroma
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma

# the embedding model: it convert the chunks into vectors
embedding = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

db = Chroma.from_documents(
    docs,
    embedding=embedding,
    persist_directory="./data/chroma_langchain_db",  # Where to save data locally.
)

In [None]:
# Option 2: Pinecone
# !pip install langchain_pinecone
# from langchain_openai.embeddings import OpenAIEmbeddings
# from langchain_pinecone import PineconeVectorStore

# embedding = OpenAIEmbeddings(
#     model="text-embedding-ada-002"
# )
# db = PineconeVectorStore.from_documents(
#     documents,
#     embedding=embedding,
#     index_name = "langchain-rag-index",
# )

In [78]:
# to test if vectors capture the semantic of text
res = db.similarity_search("Who are judges?")
res[0].page_content[50:]

'Judge Ch., sitting in a single-judge formation, \nbegan the examination of the criminal case.\n6.  On 30 September 2015 Judge Ch. held a final hearing. He retired to the \ndeliberation room, but did not return to the hearing room to deliver a verdict.\n7.  The Government submitted a copy of a note signed by P., merely \nconfirming that on 30 September 2015 he had received a copy of a verdict in \na criminal case without any further detail.\n8.  According to the applicant, he had had a conversation after the hearing \nwith Judge Ch., who had informed him that the verdict would be ready in two \nor three weeks. However, the applicant had not received a copy of the verdict, \nnor did the case file contain one.\n9.  On 16 February 2016 Judge Ch. was dismissed by a presidential order.\n10.  According to letters from the then acting President of the \nBabushkinskyi District Court of Dnipropetrovsk of 10 November 2016 and'

##### To load a vector store in chroma

In [70]:
vector_store = None
vector_store = Chroma(
    embedding_function=embedding,
    persist_directory="./data/chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [None]:
vector_store.get()

In [76]:
vector_store.similarity_search("Who are judges")

In [59]:
import chromadb

persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection("collection_name")
collection.add(ids=["1", "2", "3"], documents=["a", "b", "c"])

vector_store_from_client = Chroma(
    client=persistent_client,
    collection_name="collection_name",
    embedding_function=embeddings,
)

{'ids': ['005c0e60-10a1-477e-be87-681492464747',
  'c112a22f-ecd2-43c8-b2c5-7729ebe6d1c6',
  '5cc424d6-6ead-41b9-927d-87c144103f7a',
  '734abcfe-049d-4215-83c9-1776cca20862',
  '971b3c28-4146-4d38-96c7-8f42f5cf9281',
  'd77b292a-5f72-40cc-9993-a6067fa4314b',
  'ba3d74ad-1c64-45f5-b3b7-71c717f98cae',
  '73406c37-3dc6-49fc-a004-be17cb9a61cd',
  'a677b2b6-5b69-454e-874a-5200237124c5',
  '868aec41-6947-4c75-a427-880a5351ea17',
  'b421e35f-9693-48d9-a9fd-a99ed0e913bc',
  'de6101ad-cd1d-487c-b432-edcc8d062463',
  '90257941-fba5-44b9-814f-296a2ad5ed33',
  '3b1ab197-95c6-4d87-8e99-6b762b2c5ceb',
  'd6331fb7-781e-46e6-81bf-57e3619e9d2d'],
 'embeddings': None,
 'documents': ['FIFTH SECTION\nCASE OF ORLOV v. UKRAINE\n(Application no. 10993/18)\nJUDGMENT\nSTRASBOURG\n21 November 2024\nThis judgment is final but it may be subject to editorial revision.',
  'ORLOV v. UKRAINE JUDGMENT\n1\nIn the case of Orlov v. Ukraine,\nThe European Court of Human Rights (Fifth Section), sitting as a \nCommittee co