**IMPORTANT NOTE:** In order to make this notebook functional an *.env* file and *Introduction_to_Tableau.pdf* file are necessary which are not shared in the repository. The *.env* file contains the OpenAI API key which is private to the user and is not shared due to privacy reasons. *Introduction_to_Tableau.pdf* is the input file to make the retrieval part of the RAG which is not shared due to intellectual property rights of 365 Data Science.

In [None]:
!pip install langchain==0.3.3

!pip install langchain-chroma==0.1.4
!pip install langchain-community==0.3.2
!pip install langchain-openai==0.2.2
!pip install pypdf==5.0.1
!pip install python-dotenv==1.0.1

In [2]:
from google.colab import drive
drive.mount(f"/content/drive")


import os
os.chdir("/content/drive/MyDrive/Colab Notebooks/365 Data Science LangChain course")
print("Current directory:", os.getcwd())

Mounted at /content/drive
Current directory: /content/drive/MyDrive/Colab Notebooks/365 Data Science LangChain course


# Retrieval Augmented Generation (RAG)

Large Language Models (LLMs) have revolutionized natural language processing, enabling impressive performance on a wide range of tasks. However, they are inherently limited by the data they were trained on. Once an LLM is deployed, it lacks direct access to external, up-to-date, or domain-specific knowledge, which can lead to outdated responses, hallucinations, or an inability to answer highly specialized queries. This is where **Retrieval-Augmented Generation (RAG)** comes in.

RAG enhances LLMs by integrating retrieval-based knowledge retrieval with generative capabilities. Instead of relying solely on the static knowledge embedded within its parameters, a RAG system first retrieves relevant documents or facts from an external knowledge base (e.g., a vector database, a document store, or an API) before generating a response. This allows the model to dynamically incorporate fresh, factual, and domain-specific information into its outputs, significantly improving accuracy, reliability, and adaptability.


**Why RAG is Necessary?**

While traditional LLMs can generate fluent and contextually relevant responses, they suffer from key limitations:

**Knowledge Cutoff & Staleness:** LLMs can only provide information based on the data available during their last training phase. They cannot access real-time updates, new research, or evolving knowledge.

**Hallucination Issues:** Without external validation, LLMs sometimes generate factually incorrect or misleading responses. They can sometimes be so convincing that it may be really difficult to distinguish these hallucinations from genuine facts.

**Scalability & Cost:** Training or fine-tuning a large model on new data is expensive, computationally intensive, and requires extensive labeled data.

RAG mitigates these limitations by retrieving relevant documents at query time and conditioning the response on real-time data. This makes it particularly useful for applications like question-answering, chatbots, legal and medical AI assistants, and enterprise search systems.

In the absence of RAG, we may need fine tuning the LLM to achieve similar functionality. There are some key factors that make RAG preferrable on fine tuning in many cases. Fine-tuning involves modifying an LLM's parameters by training it on new domain-specific data. However, this approach requires a full or partial (i.e. retraining only final layers) retraining process to adapt the model while preserving its general capabilities. Despite its effectiveness, fine-tuning is computationally expensive, demanding significant resources in terms of training time, hardware, and labeled data. Additionally, once a model is fine-tuned, it remains static and updating knowledge necessitates another round of training, making it costly and impractical for frequently changing domains.

In contrast, RAG avoids these issues by retrieving external information dynamically at query time. Instead of embedding all knowledge within the model's weights, it fetches relevant documents from a database or API, allowing for real-time updates without retraining. This makes RAG a more efficient, scalable, and cost-effective solution, particularly for applications requiring up-to-date information or multi-domain adaptability.


The RAG process consists of three key steps: **Indexing**, **retrieval**, and **generation**. Each step plays a crucial role in ensuring that the model can dynamically fetch useful knowledge and incorporate it into its responses.

* **1. Indexing:** Before an LLM can retrieve information, relevant documents or data sources need to be indexed. This step ensures that raw knowledge (such as PDFs, research papers, website content, or even images) is structured in a way that enables efficient retrieval. Indexing consists of the following sub-steps:

 * **Loading:**  The raw data (text files, PDFs, HTML content, images, etc.) is collected and prepared for processing.

 * **Splitting:** Since documents can be large, they are broken into smaller chunks to facilitate more efficient retrieval and matching. The chunking strategy depends on the type of content and use case.

 * **Embedding:** This is the critical transformation step. The textual data (or other types of structured data) is converted into a numerical vector format using a pre-trained embedding model. Until this step, all data remains in its original raw form, but embedding maps it into a fixed-size (or sometimes variable-length) vector representation in a high-dimensional space. These vectorized representations capture the semantic meaning of the data, making similarity-based retrieval possible.

 * **Storing:** The generated embeddings are stored in a vector database (e.g., FAISS, Pinecone, ChromaDB), where they can be efficiently searched based on relevance during retrieval. The original document content is also stored alongside the embeddings for reference.

* **2. Retrieval:** When a query is received, the system retrieves the most useful pieces of information by searching the stored embeddings in the vector database. Here, "useful" generally means relevant (directly matching the query intent) but can also mean diverse (capturing multiple perspectives or sources, preventing duplicates). Some retrieval methods prioritize purely relevance-based ranking, while others introduce diversity-promoting techniques (e.g., Maximal Marginal Relevance (MMR)) to ensure a balance between closely matching and complementary information. This improves the robustness of responses and reduces redundancy.

* **3. Generation:** Once useful documents are retrieved, they are passed to the LLM along with the user's query. The model then performs a forward pass over this combined input to generate a response. Crucially, this step does not involve training or weight updates—the LLM remains unchanged, merely conditioning its output on the provided context. By leveraging external information in real time, the model produces responses that are more factually grounded, context-aware, and resistant to hallucinations, without the need for expensive retraining.

By integrating these three steps, RAG enables LLMs to stay relevant, handle domain-specific queries, and respond with real-time knowledge—without expensive retraining. This makes it a powerful alternative to traditional fine-tuning and an essential technique for AI systems requiring continuous access to evolving data.

# Create a Q&A Chatbot with LangChain Project

### Set the OpenAI API Key as an Environment Variable

In [3]:
%load_ext dotenv
%dotenv

In [4]:
import os
import openai

# Get the API key for OpenAI
openai.api_key = os.getenv("OPENAI_API_KEY")

### Import the Libraries

In [5]:
from langchain_community.document_loaders.pdf import PyPDFLoader

from langchain_text_splitters import (MarkdownHeaderTextSplitter,
                                      TokenTextSplitter)

from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.messages import SystemMessage
from langchain_core.prompts import (PromptTemplate,
                                    HumanMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.runnables import (RunnablePassthrough,
                                      RunnableLambda,
                                      chain)

from langchain_openai import (ChatOpenAI,
                              OpenAIEmbeddings)

from langchain_chroma.vectorstores import Chroma

### Load the Course Transcript

This is the first step in the **indexing** process, where the document is loaded for further processing. It uses PyPDFLoader to read the PDF file *Introduction_to_Tableau.pdf*. It extracts all text into a list *docs_list* and then concatenates it into a single string *string_list_concat* for structured processing.

In [6]:
loader_pdf = PyPDFLoader("Introduction_to_Tableau.pdf")

In [7]:
docs_list = loader_pdf.load()

In [8]:
string_list_concat = "\n".join([doc.page_content for doc in docs_list])

### Split the Course Transcript with MarkdownHeaderTextSplitter

This step does not split the document into smaller chunks for *indexing*. Instead, it organizes the text based on section titles and course titles. It identifies headers (# for sections, ## for course titles) to preserve document structure. The actual chunking for memory efficiency and retrieval will happen in the next steps.

In [9]:
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Section Title"), ("##", "Course Title")])

In [10]:
docs_list_md_split = md_splitter.split_text(string_list_concat)

### Create a Chain to Correct the Course Transcript

This step refines the extracted transcript by fixing punctuation, formatting, and common spelling errors. Each section of the split transcript *string_list_split* is processed using an LLM (GPT-4o). A system prompt *PROMPT_FORMATTING_S* instructs the model to:
 * Split text into meaningful paragraphs.
 * Correct punctuation errors.
 * Fix common misinterpretations (e.g., "tableaux" → "Tableau").

The corrected text replaces the original content in *docs_list_md_split*, ensuring cleaner, more structured text for later retrieval.

In [11]:
string_list_split = [doc.page_content for doc in docs_list_md_split]

In [12]:
PROMPT_FORMATTING_S = '''Improve the following Tableau lecture transcript by:
- Splitting the text into meaningful paragraphs
- Correcting any misplaced punctuation
- Fixing mistranscribed words (e.g., changing 'tableaux' to 'Tableau')"
'''

PROMPT_TEMPLATE_FORMATTING_H = '''This is the transcript:
{lecture_transcript}
'''

In [18]:
prompt_formatting_s = PROMPT_FORMATTING_S
prompt_template_formatting_h = PROMPT_TEMPLATE_FORMATTING_H
chat_prompt_template_formatting = ChatPromptTemplate.from_messages([("system", prompt_formatting_s), ("human", prompt_template_formatting_h)])

In [19]:
chat = ChatOpenAI(model_name="gpt-4o", temperature=0, seed=365)

In [20]:
str_output_parser = StrOutputParser()

In [21]:
chain_formatting = (chat_prompt_template_formatting | chat | str_output_parser)

In [22]:
string_list_formatted = chain_formatting.batch([{"lecture_transcript": text} for text in string_list_split])

In [23]:
# Override the docs_list_md_split list such that the page_content parameter of each Document objects stores the updated lecture.
for i, doc in enumerate(docs_list_md_split):
    doc.page_content = string_list_formatted[i]

### Split the Lectures with TokenTextSplitter

This step splits the cleaned transcript into smaller chunks for efficient retrieval. Uses *TokenTextSplitter* with a chunk size of 500 tokens and 50-token overlap to maintain context between chunks. This is a crucial step in indexing, ensuring each chunk is manageable for embedding and retrieval.

In [24]:
token_splitter = TokenTextSplitter(encoding_name="cl100k_base", chunk_size=500, chunk_overlap=50)

In [25]:
docs_list_tokens_split = token_splitter.split_documents(docs_list_md_split)

### Create Embeddings, Vector Store, and Retriever

Here the text chunks are converted into vector embeddings using OpenAI's "text-embedding-3-small" model. The vector embeddings are stored in a ChromaDB vector store, persisting them for future queries. With these two steps, indexing part is finished.

Then, we go to the 2nd step of the RAG which is retrival. A retriever is set up with MMR (Maximal Marginal Relevance) search, which balances relevance and diversity in retrieved results.

Let's also delve into how MMR works briefly here. Given a query $q$, a set of candidate documents $D$, and already selected documents $S$, MMR selects the next document $d^*$ by:

$$
d^* = \arg\max_{d \in D \setminus S} \left[ \lambda \cdot \text{sim}(d, q) - (1 - \lambda) \cdot \max_{s \in S} \text{sim}(d, s) \right]
$$

where:
- $ \lambda $ (**0.7 in this example**) controls the trade-off:
  - **Higher $ \lambda $** → More relevance to the query.
  - **Lower $ \lambda $** → More diversity in results.
- $ \text{sim}(x, y) $ is the similarity function (e.g., cosine similarity).

In [26]:
embedding = OpenAIEmbeddings(model="text-embedding-3-small")

In [27]:
vectorstore = Chroma.from_documents(documents=docs_list_tokens_split, embedding=embedding, persist_directory="./intro-to-tableau")

In [28]:
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 2, "lambda_mult": 0.7})

### Create Prompts and Prompt Templates for the Q&A Chatbot Chain

This step defines structured prompt templates to guide the LLM when answering questions using retrieved content. Three prompts are created:
 * PROMPT_CREATING_QUESTION → Formats a question with a lecture reference, ensuring clarity in the Q&A process.
 * PROMPT_RETRIEVING_S → Instructs the LLM to answer using only the provided context from the lecture. This ensures responses are grounded in retrieved content and include section and lecture citations.
 * PROMPT_TEMPLATE_RETRIEVING_H → Defines the structure for passing the retrieved context and question to the model.

In [31]:
PROMPT_CREATING_QUESTION = '''Lecture: {question_lecture}
Title: {question_title}
Body: {question_body}'''

PROMPT_RETRIEVING_S = '''You will receive a question from a student taking a Tableau course, which includes a title and a body.
The corresponding lecture will also be provided.

Answer the question using only the provided context.

At the end of your response, include the section and lecture names where the context was drawn from, formatted as follows:
Resources:
Section: *Section Title*, Lecture: *Lecture Title*
...
Replace *Section Title* and *Lecture Title* with the appropriate titles.'''

PROMPT_TEMPLATE_RETRIEVING_H = '''This is the question:
{question}

This is the context:
{context}'''

prompt_creating_question = PromptTemplate.from_template(PROMPT_CREATING_QUESTION)
prompt_retrieving_s = PROMPT_RETRIEVING_S
prompt_template_retrieving_h = PROMPT_TEMPLATE_RETRIEVING_H

chat_prompt_template_retrieving = ChatPromptTemplate.from_messages([("system", prompt_retrieving_s), ("human", prompt_template_retrieving_h)])

### Create the First Version of the Q&A Chatbot Chain

The retrieval chain is built by linking the following steps:
 * Formatting the input question with prompt_creating_question.
 * Using a retriever to fetch the most relevant lecture sections.
 * Structuring the final input using *chat_prompt_template_retrieving*.

In [34]:
chain_retrieving = (
    prompt_creating_question
    | (lambda x: x.text)
    | (lambda q: {"question": q, "context": retriever.invoke(q)})
    | chat_prompt_template_retrieving
    )

In [35]:
result = chain_retrieving.invoke({"question_lecture": "Adding a custom calculation",
                                  "question_title": "Why are we using SUM here? It's unclear to me.",
                                  "question_body": "This question refers to calculating the GM%."})

In [36]:
result

ChatPromptValue(messages=[SystemMessage(content='You will receive a question from a student taking a Tableau course, which includes a title and a body. \nThe corresponding lecture will also be provided.\n\nAnswer the question using only the provided context.\n\nAt the end of your response, include the section and lecture names where the context was drawn from, formatted as follows: \nResources: \nSection: *Section Title*, Lecture: *Lecture Title* \n...\nReplace *Section Title* and *Lecture Title* with the appropriate titles.', additional_kwargs={}, response_metadata={}), HumanMessage(content='This is the question:\nLecture: Adding a custom calculation\nTitle: Why are we using SUM here? It\'s unclear to me.\nBody: This question refers to calculating the GM%.\n\nThis is the context:\n[Document(metadata={\'Course Title\': \'Adding a custom calculation\', \'Section Title\': \'Tableau Functionalities\'}, page_content="Ok, excellent. We\'re doing good. We\'ve seen quite a few interesting Tab

### Create a Runnable Function to Format the Context

A new function *format_context* is introduced to format retrieved content for better readability and structured output. Instead of raw text chunks, it organizes context with:
 * Document index
 * Section title
 * Lecture title
 * Content text

### How is the Chain Improved?

The retrieval process remains the same, but now the retrieved content is structured before being passed to the prompt. The updated chain *chain_retrieving_improved* includes:
 * Formatting the user question
 * Retrieving relevant lecture sections
 * Applying format_context to organize retrieved text
 * Feeding structured data into the prompt template

In [37]:
def format_context(dictionary):
    formatted_context = "\n----------------------\n".join(
        f"Document {i+1}\nSection Title: {doc.metadata.get('section_title', 'N/A')}\n"
        f"Lecture Title: {doc.metadata.get('lecture_title', 'N/A')}\n"
        f"Content: {doc.page_content}"
        for i, doc in enumerate(dictionary["context"])
    )

    return {"context": formatted_context, "question": dictionary["question"]}

In [38]:
chain_retrieving_improved = (
    prompt_creating_question
    | RunnableLambda(lambda x: x.text)
    | RunnableLambda(lambda q: {"question": q, "context": retriever.invoke(q)})
    | RunnableLambda(format_context)
    | chat_prompt_template_retrieving
)

In [39]:
result_improved = chain_retrieving_improved.invoke({"question_lecture": "Adding a custom calculation",
                                                    "question_title": "Why are we using SUM here? It's unclear to me.",
                                                    "question_body": "This question refers to calculating the GM%."})

In [40]:
result_improved

ChatPromptValue(messages=[SystemMessage(content='You will receive a question from a student taking a Tableau course, which includes a title and a body. \nThe corresponding lecture will also be provided.\n\nAnswer the question using only the provided context.\n\nAt the end of your response, include the section and lecture names where the context was drawn from, formatted as follows: \nResources: \nSection: *Section Title*, Lecture: *Lecture Title* \n...\nReplace *Section Title* and *Lecture Title* with the appropriate titles.', additional_kwargs={}, response_metadata={}), HumanMessage(content="This is the question:\nLecture: Adding a custom calculation\nTitle: Why are we using SUM here? It's unclear to me.\nBody: This question refers to calculating the GM%.\n\nThis is the context:\nDocument 1\nSection Title: N/A\nLecture Title: N/A\nContent: Ok, excellent. We're doing good. We've seen quite a few interesting Tableau tools so far, and we'll continue to do so during this lesson as well.\n

### Stream the Response

The improved retrieval chain *chain_retrieving_improved* is now streaming the response instead of returning it all at once. The for-loop prints each chunk as it is generated, allowing real-time streaming (useful for chat interfaces).

Unlike previous steps, which focused on retrieving and formatting context, this step sends the structured input to the LLM and receives a generated answer based on the retrieved information. Hence, this is the final step of RAG process.

In [41]:
result_streamed = chain_retrieving_improved.stream({
    "question_lecture": "Adding a custom calculation",
    "question_title": "Why are we using SUM here? It's unclear to me.",
    "question_body": "This question refers to calculating the GM%."
})

In [42]:
# Create a for-loop to stream the response
for chunk in result_streamed:
    print(chunk, end="", flush=True)

messages=[SystemMessage(content='You will receive a question from a student taking a Tableau course, which includes a title and a body. \nThe corresponding lecture will also be provided.\n\nAnswer the question using only the provided context.\n\nAt the end of your response, include the section and lecture names where the context was drawn from, formatted as follows: \nResources: \nSection: *Section Title*, Lecture: *Lecture Title* \n...\nReplace *Section Title* and *Lecture Title* with the appropriate titles.', additional_kwargs={}, response_metadata={}), HumanMessage(content="This is the question:\nLecture: Adding a custom calculation\nTitle: Why are we using SUM here? It's unclear to me.\nBody: This question refers to calculating the GM%.\n\nThis is the context:\nDocument 1\nSection Title: N/A\nLecture Title: N/A\nContent: Ok, excellent. We're doing good. We've seen quite a few interesting Tableau tools so far, and we'll continue to do so during this lesson as well.\n\nOur table is a