# 2. LangChain **RAG**

<a target="_blank" href="https://colab.research.google.com/github/IT-HUSET/ai-workshop-250121/blob/main/lab/2-langchain-retrieval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a><br/>

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

![RAG - indexing](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png)

## Setup

### Install dependencies

In [1]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install chromadb~=0.5.18 sentence-transformers~=3.3 lark~=1.2 --upgrade --quiet
%pip install langchain~=0.3.10 langchain_openai~=0.2.11 langchain_community~=0.3.10 langchain-chroma~=0.1.4 --upgrade --quiet
%pip install youtube-transcript-api~=0.6.3 --upgrade --quiet



# If running locally, you can do this instead:
#%pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Load environment variables

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup Chat Model

In [3]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
api_version = "2024-10-01-preview"
llm = AzureChatOpenAI(deployment_name="gpt-4o-mini", temperature=0.0, openai_api_version=api_version)
embedding_model = AzureOpenAIEmbeddings(model="text-embedding-3-large", openai_api_version=api_version)

### Setup path to data 

In [4]:
data_path = "../data"

## Document Loading

### PDFs

PDFs can be loaded in a number of different ways, but the easiest is by using the `PyPDFLoader` class. PDFs can be loaded from a local file or a URL.

In [5]:
from langchain_community.document_loaders import PyPDFLoader
#loader = PyPDFLoader("some_local_file.pdf")
loader = PyPDFLoader("https://data.riksdagen.se/fil/CDA05163-DE71-448D-807D-747C997E8F3A") # AI:s betydelse för framtidens arbetsmarknad och skola
#loader = PyPDFLoader("https://data.riksdagen.se/fil/61B7540B-EEDD-4922-B61B-FC0A9F3AE4E2") # 2024/25:263 AI, annan ny teknik och de mänskliga rättigheterna
#loader = PyPDFLoader("https://data.riksdagen.se/fil/0D43150B-5B31-43A4-89CD-4FE0478EC6C7") # 2024/25:263 AI, annan ny teknik och de mänskliga rättigheterna (svar)
pdf_pages = loader.load()

**Each page** is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [6]:
len(pdf_pages)

3

In [7]:
page = pdf_pages[0]
print(page.page_content[0:500])

 
Enskild motion C  
Motion till riksdagen  
2024/25:2055 
av Niels Paarup-Petersen (C) 
AI:s betydelse för framtidens 
arbetsmarknad och skola 
 
 
 
 
Förslag till riksdagsbeslut 
1. Riksdagen ställer sig bakom det som anförs i motionen om att tillsätta en utredning 
med uppdrag att kartlägga behov och förutsättningar för framtidens utbildning och 
ett lärande arbetsliv och tillkännager detta för regeringen. 
2. Riksdagen ställer sig bakom det som anförs i motionen om att utforma underlag och 


In [None]:
page.metadata

### YouTube

In [8]:
from langchain_community.document_loaders import YoutubeLoader

#url="https://www.youtube.com/watch?v=XC7BeLRm7ak"
url="https://www.youtube.com/watch?v=tflYCulLYiI"
loader = YoutubeLoader.from_youtube_url(
    url, language="sv", add_video_info=False
)
yt_docs = loader.load()
assert len(yt_docs) == 1 # Only one document will be created when using YoutubeLoader

AssertionError: 

In [9]:
yt_docs[0].page_content[0:500]

IndexError: list index out of range

### Web Page

There are a number of different ways of loading data from the web, but the easiest is by using the `WebBaseLoader` class, which uses the parser BeautifulSoup under the hood.

In [10]:
from langchain.document_loaders import WebBaseLoader

page_url = "https://world.hey.com/dhh/open-source-royalty-and-mad-kings-a8f79d16"
loader = WebBaseLoader(page_url)
# loader = WebBaseLoader(page_url, header_template={
#     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
# })

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [11]:
web_docs = loader.load()

In [12]:
print(web_docs[0].page_content[:500])





Open source royalty and mad kings
















    David Heinemeier Hansson
  



  October 13, 2024


Open source royalty and mad kings




I'm solidly in favor of the Benevolent Dictator For Life (BDFL) model of open source stewardship. This is how projects from Linux to Python, from Laravel to Ruby, and yes, Rails, have kept their cohesion, decisiveness, and forward motion. It's a model with decades worth of achievements to its name. But it's not a mandate from heaven. It's not infalli


## Splitting

May seem simple, but it can be a complex process that requires some thought, planning and a lot of fine-tuning and iteration.

![Splitting](https://python.langchain.com/assets/images/text_splitters-7961ccc13e05e2fd7f7f58048e082f47.png)

### Basic splitting

The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit.

Key benefits of length-based splitting:

- Straightforward implementation
- Consistent chunk sizes
- Easily adaptable to different model requirements

The most common splitter for splitting text on length is `RecursiveCharacterTextSplitter`.

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=25
)

#### Let's split the loaded PDF pages (above)

In [14]:
splits = text_splitter.split_documents(pdf_pages)

In [15]:
print(f"Document splits: {len(splits)}")
print(f"Loaded pages: {len(pdf_pages)}")

Document splits: 11
Loaded pages: 3


In [17]:
splits.extend(text_splitter.split_documents(web_docs))

## Embeddings

Let's take our splits and embed them.

In [18]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [19]:
embedding1 = embedding_model.embed_query(sentence1)
embedding2 = embedding_model.embed_query(sentence2)
embedding3 = embedding_model.embed_query(sentence3)

print(embedding1[:10])
#print(len(embedding1))

[-0.02032002992928028, -0.014629864133894444, -0.007168493699282408, -0.005369397345930338, 0.02253752201795578, 0.012119496241211891, -0.012935365550220013, -0.003099606605246663, -0.0017677171854302287, 0.04041691869497299]


In [20]:
import numpy as np

Embedding 1 and 2 should be similar (using NumPy's dot product to calculate similarity)

In [21]:
np.dot(embedding1, embedding2)

0.8321187122592171

But Embedding 3 should differ more

In [22]:
np.dot(embedding1, embedding3)

0.15658984714299562

In [23]:
np.dot(embedding2, embedding3)

0.11712613252353708

## Vectorstores

In [24]:
from langchain_chroma import Chroma

In [25]:
# Optional persist_directory to save the database
persist_directory = './db/2-langchain-retrieval/'

# Remove the directory and all files in it recursively if it exists
import shutil
import os
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)

#### Set up the vector database - we'll use the simple Chroma database here

In [26]:
vectordb = Chroma(
    collection_name="2-langchain-retrieval",
    embedding_function=embedding_model,
    #persist_directory=persist_directory # Optionally persist the database
)

vectordb.add_documents(documents=splits)

['941c7de5-e8a7-46c7-8429-93c41a6d9814',
 '3b59fcd2-4df0-4fb5-800e-0d3cc98a1423',
 'cdbe7d56-cbcb-4a8a-a136-8c5a1ef6f6cb',
 '2e1b624b-259f-498e-b567-1c1bcc37ea8b',
 '57a54eee-049a-4818-b369-9b31b0c165ee',
 '2afcea0e-b2ba-432b-bbd5-cf05d6881733',
 'c61e3d5b-218e-4644-a6d1-1741f714396b',
 '3f42f947-acb8-4a46-aae6-20294b4bff44',
 '48ce59be-2c5d-4853-999b-fd0d33de5c95',
 '25a74018-8747-4253-8dd9-a0259508442e',
 'd5468629-afdc-49ec-93e4-68dfcf0c2bcb',
 '2c5c3dc5-ff4c-4d61-a9b4-3a5da7786ee8',
 'baf87f21-f406-4a2d-9303-689a97d6e585',
 '4da5e8b8-db16-4c02-b053-59196c74abc8',
 '82ab4740-874f-4305-8b21-8606ddac7473',
 '8a732880-cbc5-4d30-90b4-8beca695934d',
 '73eebe7b-6356-4071-b0d5-cb587baeaf21',
 '9ae7f757-1a00-4a78-aafd-4d6475409204',
 '3adb883c-0202-4986-9c06-81820359de74',
 'cc197b6a-f220-44c2-b25d-8aff65aa67df',
 '1750588a-2632-44f5-ab7f-ae554c50313c',
 '66e54355-d0e9-457e-aaeb-2026662d208a',
 'aae91f17-aa95-4222-a4b9-9b69eeb6961c',
 'b9458957-b85d-492e-8378-a11d4bed0c1b',
 'f1b89203-0149-

#### Let's do some similarity Search

In [28]:
question = "Vad betyder AI i praktiken för framtidens arbetsmarknad och kompetensbehov"

def print_docs(docs):
    for i, doc in enumerate(docs):
        print(f"Doc {i}:\n{doc.page_content[:200].strip()}...\n---")

In [29]:
docs = vectordb.similarity_search(question,k=3)
# Print first result
print_docs(docs)

Doc 0:
2 
Motivering 
Vad betyder AI i praktiken för framtidens arbetsmarknad och kompetensbehov? Det är 
en fråga Joakim Wernberg och hans medförfattare bland annat försökt svara på. Nedan 
motion är hämtad...
---
Doc 1:
vi se till att vara förberedda med en gedigen informationsbank. Framtidens arbete med 
AI kommer mest troligen att sätta krav på ett kontinuerligt lärandebehov och en 
flexibilitet. Det handlar inte e...
---
Doc 2:
ekonomin. Det kan därför bli relevant att skilja på de kompetensbehov som förändras 
snabbt från de som förändras långsammare.  
Vilka delar av framtidens kompetensbehov kan tillgodoses med nuvarande...
---


In [30]:
docs = vectordb.similarity_search("Who is David Heinemeier Hansson?",k=3)
# Print first result
print_docs(docs)

Doc 0:
About David Heinemeier Hansson
      


Made Basecamp and HEY for the underdogs as co-owner and CTO of 37signals. Created Ruby on Rails. Wrote REWORK, It Doesn't Have to Be Crazy at Work, and REMOTE....
---
Doc 1:
About David Heinemeier Hansson
      


Made Basecamp and HEY for the underdogs as co-owner and CTO of 37signals. Created Ruby on Rails. Wrote REWORK, It Doesn't Have to Be Crazy at Work, and REMOTE....
---
Doc 2:
Open source royalty and mad kings
















    David Heinemeier Hansson
  



  October 13, 2024


Open source royalty and mad kings...
---


### Retriever

[Retrievers](https://python.langchain.com/docs/concepts/retrievers/) are responsible for taking a query and returning relevant documents. There are many types of retrieval systems exist, including vectorstores, graph databases, and relational databases. LangChain provides a uniform interface for interacting with different types of retrieval systems. The **`Retriever`** interface also implements the **`Runnable`** interface, making it possible to use it as part of a chain.

When creating a Retriever, it's possible to specify configuration related to the retrieval operation, such as:
* **`search_type`** - the type of search to perform, for instance, "similarity" or "hybrid"
* **`search_kwargs`** - dictionary containing additional keyword arguments to pass to the search function
    * **`k`** - the number of documents to retrieve
    * **`score_threshold`** - the minimum similarity score required for a document to be considered relevant
    * **`filter`** - filter by document metadata (format may be specific to the retrieval system)

In [31]:
# Setup a retriever
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

# Invoke/query the retriever
documents = retriever.invoke(question)

In [32]:
print_docs(documents)

Doc 0:
2 
Motivering 
Vad betyder AI i praktiken för framtidens arbetsmarknad och kompetensbehov? Det är 
en fråga Joakim Wernberg och hans medförfattare bland annat försökt svara på. Nedan 
motion är hämtad...
---
Doc 1:
vi se till att vara förberedda med en gedigen informationsbank. Framtidens arbete med 
AI kommer mest troligen att sätta krav på ett kontinuerligt lärandebehov och en 
flexibilitet. Det handlar inte e...
---
Doc 2:
ekonomin. Det kan därför bli relevant att skilja på de kompetensbehov som förändras 
snabbt från de som förändras långsammare.  
Vilka delar av framtidens kompetensbehov kan tillgodoses med nuvarande...
---
