# 2. LangChain **RAG**

<a target="_blank" href="https://colab.research.google.com/github/IT-HUSET/ai-workshop-250121/blob/main/lab/2-langchain-retrieval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a><br/>

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

![RAG - indexing](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png)

## Setup

### Install dependencies

In [None]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install chromadb~=0.5.18 sentence-transformers~=3.3 lark~=1.2 --upgrade --quiet
%pip install langchain~=0.3.10 langchain_openai~=0.2.11 langchain_community~=0.3.10 langchain-chroma~=0.1.4 --upgrade --quiet
%pip install youtube-transcript-api~=0.6.3 --upgrade --quiet



# If running locally, you can do this instead:
#%pip install -r ../requirements.txt

### Load environment variables

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup Chat Model

In [None]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
api_version = "2024-10-01-preview"
llm = AzureChatOpenAI(deployment_name="gpt-4o-mini", temperature=0.0, openai_api_version=api_version)
embedding_model = AzureOpenAIEmbeddings(model="text-embedding-3-large", openai_api_version=api_version)

### Setup path to data 

In [None]:
data_path = "../data"

## Document Loading

### PDFs

PDFs can be loaded in a number of different ways, but the easiest is by using the `PyPDFLoader` class. PDFs can be loaded from a local file or a URL.

In [None]:
from langchain_community.document_loaders import PyPDFLoader
#loader = PyPDFLoader("some_local_file.pdf")
loader = PyPDFLoader("https://data.riksdagen.se/fil/CDA05163-DE71-448D-807D-747C997E8F3A") # AI:s betydelse för framtidens arbetsmarknad och skola
#loader = PyPDFLoader("https://data.riksdagen.se/fil/61B7540B-EEDD-4922-B61B-FC0A9F3AE4E2") # 2024/25:263 AI, annan ny teknik och de mänskliga rättigheterna
#loader = PyPDFLoader("https://data.riksdagen.se/fil/0D43150B-5B31-43A4-89CD-4FE0478EC6C7") # 2024/25:263 AI, annan ny teknik och de mänskliga rättigheterna (svar)
pdf_pages = loader.load()

**Each page** is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
len(pdf_pages)

In [None]:
page = pdf_pages[0]
print(page.page_content[0:500])

In [None]:
page.metadata

### YouTube

In [None]:
from langchain_community.document_loaders import YoutubeLoader

#url="https://www.youtube.com/watch?v=XC7BeLRm7ak"
url="https://www.youtube.com/watch?v=tflYCulLYiI"
loader = YoutubeLoader.from_youtube_url(
    url, language="sv", add_video_info=False
)
yt_docs = loader.load()
assert len(yt_docs) == 1 # Only one document will be created when using YoutubeLoader

In [None]:
yt_docs[0].page_content[0:500]

### Web Page

There are a number of different ways of loading data from the web, but the easiest is by using the `WebBaseLoader` class, which uses the parser BeautifulSoup under the hood.

In [None]:
from langchain.document_loaders import WebBaseLoader

page_url = "https://world.hey.com/dhh/open-source-royalty-and-mad-kings-a8f79d16"
loader = WebBaseLoader(page_url)
# loader = WebBaseLoader(page_url, header_template={
#     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
# })

In [None]:
web_docs = loader.load()

In [None]:
print(web_docs[0].page_content[:500])

## Splitting

May seem simple, but it can be a complex process that requires some thought, planning and a lot of fine-tuning and iteration.

![Splitting](https://python.langchain.com/assets/images/text_splitters-7961ccc13e05e2fd7f7f58048e082f47.png)

### Basic splitting

The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit.

Key benefits of length-based splitting:

- Straightforward implementation
- Consistent chunk sizes
- Easily adaptable to different model requirements

The most common splitter for splitting text on length is `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=25
)

#### Let's split the loaded PDF pages (above)

In [None]:
splits = text_splitter.split_documents(pdf_pages)

In [None]:
print(f"Document splits: {len(splits)}")
print(f"Loaded pages: {len(pdf_pages)}")

In [None]:
splits.extend(text_splitter.split_documents(web_docs))

## Embeddings

Let's take our splits and embed them.

In [None]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [None]:
embedding1 = embedding_model.embed_query(sentence1)
embedding2 = embedding_model.embed_query(sentence2)
embedding3 = embedding_model.embed_query(sentence3)

print(embedding1[:10])
#print(len(embedding1))

In [None]:
import numpy as np

Embedding 1 and 2 should be similar (using NumPy's dot product to calculate similarity)

In [None]:
np.dot(embedding1, embedding2)

But Embedding 3 should differ more

In [None]:
np.dot(embedding1, embedding3)

In [None]:
np.dot(embedding2, embedding3)

## Vectorstores

In [None]:
from langchain_chroma import Chroma

In [None]:
# Optional persist_directory to save the database
persist_directory = './db/2-langchain-retrieval/'

# Remove the directory and all files in it recursively if it exists
import shutil
import os
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)

#### Set up the vector database - we'll use the simple Chroma database here

In [None]:
vectordb = Chroma(
    collection_name="2-langchain-retrieval",
    embedding_function=embedding_model,
    #persist_directory=persist_directory # Optionally persist the database
)

vectordb.add_documents(documents=splits)

#### Let's do some similarity Search

In [None]:
question = "Vad betyder AI i praktiken för framtidens arbetsmarknad och kompetensbehov"

def print_docs(docs):
    for i, doc in enumerate(docs):
        print(f"Doc {i}:\n{doc.page_content[:200].strip()}...\n---")

In [None]:
docs = vectordb.similarity_search(question,k=3)
# Print first result
print_docs(docs)

In [None]:
docs = vectordb.similarity_search("Who is David Heinemeier Hansson?",k=3)
# Print first result
print_docs(docs)

### Retriever

[Retrievers](https://python.langchain.com/docs/concepts/retrievers/) are responsible for taking a query and returning relevant documents. There are many types of retrieval systems exist, including vectorstores, graph databases, and relational databases. LangChain provides a uniform interface for interacting with different types of retrieval systems. The **`Retriever`** interface also implements the **`Runnable`** interface, making it possible to use it as part of a chain.

When creating a Retriever, it's possible to specify configuration related to the retrieval operation, such as:
* **`search_type`** - the type of search to perform, for instance, "similarity" or "hybrid"
* **`search_kwargs`** - dictionary containing additional keyword arguments to pass to the search function
    * **`k`** - the number of documents to retrieve
    * **`score_threshold`** - the minimum similarity score required for a document to be considered relevant
    * **`filter`** - filter by document metadata (format may be specific to the retrieval system)

In [None]:
# Setup a retriever
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

# Invoke/query the retriever
documents = retriever.invoke(question)

In [None]:
print_docs(documents)