# Document Loading

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

!["rag_overview"](imgs/rag_overview.jpeg)

!["rag"](imgs/rag.png)

In [1]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI

os.environ["GOOGLE_API_KEY"] = os.environ["GEMINI_API_KEY"]
llm_model = "gemini-2.0-flash-lite" # "gemma-3-27b-it" # 

llm = ChatGoogleGenerativeAI(
    model=llm_model,
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [7]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [8]:
print(len(pages))

page = pages[1]
print(page.page_content[0:500])

print(page.metadata)

22
many biologers are there here? Wow, just a few, not many. I'm surprised. Anyone from 
statistics? Okay, a few. So where are the rest of you from?  
Student : iCME.  
Instructor (Andrew Ng) : Say again?  
Student : iCME.  
Instructor (Andrew Ng) : iCME. Cool.  
Student : [Inaudible].  
Instructor (Andrew Ng) : Civi and what else?  
Student : [Inaudible]  
Instructor (Andrew Ng) : Synthesis, [inaudible] systems. Yeah, cool.  
Student : Chemi.  
Instructor (Andrew Ng) : Chemi. Cool.  
Student : [In
{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'docs/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 1, 'page_label': '2'}


## YouTube

In [None]:
from langchain.document_loaders.generic import GenericLoader,  FileSystemBlobLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader