<a href="https://colab.research.google.com/github/jesusvillota/CSS_DataScience_2025/blob/main/Session3/3_2_RAG_IV_Full_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="max-width: 880px; margin: 20px auto 22px; padding: 0px; border-radius: 18px; border: 1px solid #e5e7eb; background: linear-gradient(180deg, #ffffff 0%, #f9fafb 100%); box-shadow: 0 8px 26px rgba(0,0,0,0.06); overflow: hidden;">

  <!-- Banner Header -->
  <div style="padding: 34px 32px 14px; text-align: center; line-height: 1.38;">
    <div style="font-size: 13px; letter-spacing: 0.14em; text-transform: uppercase; color: #6b7280; font-weight: bold; margin-bottom: 5px;">
      Session #2
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      RAG with LangChain
    </div>
    <div style="font-size: 29px; font-weight: 800; color: #14276c; margin-bottom: 4px;">
      Part V: Chat with your data
    </div>
    <div style="font-size: 16.5px; color: #374151; font-style: italic; margin-bottom: 0;">
      Data Science for Economics: Mastering Unstructured Data
    </div>
  </div>

  <!-- Logo Section -->
  <div style="background: none; text-align: center; margin: 30px 0 10px;">
    <img src="https://www.cemfi.es/images/Logo-Azul.png" alt="CEMFI Logo" style="width: 158px; filter: drop-shadow(0 2px 12px rgba(56,84,156,0.05)); margin-bottom: 0;">
  </div>

  <!-- Name -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1.22em; font-weight: bold; margin-bottom: 0px;">
    Jesus Villota Miranda © 2025
  </div>

  <!-- Contact info -->
  <div style="font-family: 'Times New Roman', Times, serif; color: #38549c; text-align: center; font-size: 1em; margin-top: 7px; margin-bottom: 20px;">
    <a href="mailto:jesus.villota@cemfi.edu.es" style="color: #38549c; text-decoration: none; margin-right:8px;" title="Email">
      <!-- Email logo -->
      <!-- <img src="https://cdn-icons-png.flaticon.com/512/11679/11679732.png" alt="Email" style="width:18px; vertical-align:middle; margin-right:5px;"> -->
      jesus.villota@cemfi.edu.es
    </a>
    <span style="color:#9fa7bd;">|</span>
    <a href="https://www.linkedin.com/in/jesusvillotamiranda/" target="_blank" style="color: #38549c; text-decoration: none; margin-left:7px;" title="LinkedIn">
      <!-- LinkedIn logo -->
      <!-- <img src="https://1.bp.blogspot.com/-onvhHUdW1Us/YI52e9j4eKI/AAAAAAAAE4c/6s9wzOpIDYcAo4YmTX1Qg51OlwMFmilFACLcBGAsYHQ/s1600/Logo%2BLinkedin.png" alt="LinkedIn" style="width:17px; vertical-align:middle; margin-right:5px;"> -->
      LinkedIn
    </a>
  </div>
</div>


**IMPORTANT**: **Are you running this notebook in Google Colab?**

- If so, please make sure that in the cell below `running_in_colab` is set to `True`

- And, of course,  make sure to **run the cell**!

In [1]:
running_in_colab = False

In [2]:
if running_in_colab: 
    ! pip install langchain_huggingface openai pypdf
    from google.colab import drive
    drive.mount('/content/drive')
    folder_dir = '/content/drive/My Drive/docs/'
else: 
    folder_dir = 'docs/'

## Overview

Recall the overall workflow for retrieval augmented generation (RAG):

![](images/rag_pipeline.png)

The final step in RAG is to merge the retrieved documents and the original query to produce a final answer. 
This process is intermediated by an LLM, which sees both your prompt and the retrieved documents as context and then generates an informed response.

<p align="center">
    <img src="images/RAG.png" alt="RAG Final Step" width="320"/>
</p>

In this notebook we will complete the full RAG pipeline

# **RAG Pipeline**

1) **Document Loading**

- Let's load some example PDFs. 
- For this illustration, we will use the transcripts from the first three lectures of the CS229 Machine Learning course (Stanford).
- https://see.stanford.edu/Course/CS229
- Make sure to download the pdfs to your Drive to be able to load the documents (I uploaded them to `Session3/docs`)

In [3]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    PyPDFLoader(folder_dir + "MachineLearning-Lecture01.pdf"),
    PyPDFLoader(folder_dir + "MachineLearning-Lecture02.pdf"),
    PyPDFLoader(folder_dir + "MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

2) **Splitting**

We use the RecursiveCharacterTextSplitter

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

3) **Embeddings**

- Here I give you the option to do it with a free open-source model from HuggingFace, or with the more sophisticated OpenAI embeddings.
- Note that, to use the OpenAI embeddings, you need need an OPENAI_API_KEY and credit in your OpenAI account

In [9]:
if running_in_colab:
    try:
        from google.colab import secrets
        api_key = secrets["OPENAI_API_KEY"]
    except Exception as e:
        print("Could not retrieve OPENAI_API_KEY from Colab secrets:", e)
        api_key = None
else:
    import os
    from dotenv import load_dotenv
    load_dotenv()
    api_key = os.getenv("OPENAI_API_KEY")

In [10]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
use_free_embeddings = False

if use_free_embeddings:
    from langchain_huggingface import HuggingFaceEmbeddings
    embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
else:
    # You need an OPENAI_API_KEY and credit in your OpenAI account
    embedding = OpenAIEmbeddings()

4) **Vectorstore**
- As we saw in the previous notebook, we can store our embeddings in a vectorstore 
- We use Chroma, but there are other alternatives you can explore

In [11]:
from langchain_community.vectorstores import Chroma
persist_directory = 'chroma_vectordb/'

import os
os.makedirs(persist_directory, exist_ok=True)

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

151


5) **Define the user question**

In [12]:
question = "Is probability a class topic?"

6) **Context retrieval**

- This is the context that is retrieved when we call `retriever=vectordb.as_retriever()` on the chain

In [13]:
docs = vectordb.similarity_search(question,k=5)

for i, doc in enumerate(docs):
    print("\n" + "="*40 + f"[ 📄 Relevant Chunk {i+1} ]" + "="*40)
    print(doc.page_content)


of this class will not be very programming intensive, although we will do some 
programming, mostly in either MATLAB or Octave. I'll say a bit more about that later.  
I also assume familiarity with basic probability and statistics. So most undergraduate 
statistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna 
assume all of you know what random variables are, that all of you know what expectation 
is, what a variance or a random variable is. And in case of some of you, it's been a while 
since you've seen some of this material. At some of the discussion sections, we'll actually 
go over some of the prerequisites, sort of as a refresher course under prerequisite class. 
I'll say a bit more about that later as well.  
Lastly, I also assume familiarity with basic linear algebra. And again, most undergraduate 
linear algebra courses are more than enough. So if you've taken courses like Math 51, 
103, Math 113 or CS205 at Stanford, that would be more t

7) **Pass the template and context to the LLM**

In [14]:
from langchain.chat_models import ChatOpenAI
llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, temperature=0)

  llm = ChatOpenAI(model_name=llm_name, temperature=0)


In [15]:
from langchain.prompts import PromptTemplate
# Build prompt
template = """
            Use the following pieces of context to answer the question at the end. \
            If you don't know the answer, just say that you don't know, don't try to make up an answer. \
            Use three sentences maximum. Keep the answer as concise as possible. \
            Always say "thanks for asking!" at the end of the answer. \
            Context: {context} \
            Question: {question} \
            Helpful Answer:
            """
            
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


8) **Build the QA prompt**

In [16]:
from langchain.chains import RetrievalQA
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

9) **Build the RetrievalQA chain**

In [17]:
result = qa_chain({"query": question})

  result = qa_chain({"query": question})


10) **Run the chain**

In [18]:
result

{'query': 'Is probability a class topic?',
 'result': 'Based on the context provided, probability is a topic covered in the class, as the instructor assumes familiarity with basic probability and statistics. Probability is used in the class to provide a probabilistic interpretation for linear regression and to derive the first classification algorithm. Thanks for asking!',
 'source_documents': [Document(metadata={'title': '', 'page': 4, 'page_label': '5', 'moddate': '2008-07-11T11:25:23-07:00', 'total_pages': 22, 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/MachineLearning-Lecture01.pdf', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': ''}, page_content="of this class will not be very programming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octave. I'll say a bit more about that later.  \nI also assume familiarity with basic probability and statistics. So most undergraduate \nstatistics

In [19]:
result["result"]

'Based on the context provided, probability is a topic covered in the class, as the instructor assumes familiarity with basic probability and statistics. Probability is used in the class to provide a probabilistic interpretation for linear regression and to derive the first classification algorithm. Thanks for asking!'

In [20]:
result["source_documents"][0]

Document(metadata={'title': '', 'page': 4, 'page_label': '5', 'moddate': '2008-07-11T11:25:23-07:00', 'total_pages': 22, 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/MachineLearning-Lecture01.pdf', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': ''}, page_content="of this class will not be very programming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octave. I'll say a bit more about that later.  \nI also assume familiarity with basic probability and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what random variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's been a while \nsince you've seen some of this material. At some of the discussion sections, we'll actually \ngo over some of the prerequisites, s