# Document Loading

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

!["rag_overview"](imgs/rag_overview.jpeg)

!["rag"](imgs/rag.png)

In [1]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI

os.environ["GOOGLE_API_KEY"] = os.environ["GEMINI_API_KEY"]
llm_model = "gemini-2.0-flash-lite" # "gemma-3-27b-it" # 

llm = ChatGoogleGenerativeAI(
    model=llm_model,
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [7]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [8]:
print(len(pages))

page = pages[1]
print(page.page_content[0:500])

print(page.metadata)

22
many biologers are there here? Wow, just a few, not many. I'm surprised. Anyone from 
statistics? Okay, a few. So where are the rest of you from?  
Student : iCME.  
Instructor (Andrew Ng) : Say again?  
Student : iCME.  
Instructor (Andrew Ng) : iCME. Cool.  
Student : [Inaudible].  
Instructor (Andrew Ng) : Civi and what else?  
Student : [Inaudible]  
Instructor (Andrew Ng) : Synthesis, [inaudible] systems. Yeah, cool.  
Student : Chemi.  
Instructor (Andrew Ng) : Chemi. Cool.  
Student : [In
{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'docs/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 1, 'page_label': '2'}


## YouTube

In [23]:
# from langchain.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader
# from langchain.document_loaders.parsers import OpenAIWhisperParser
# from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

from transformers import pipeline
# from langchain_huggingface import HuggingFacePipeline
pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3-turbo", device=0)
docs = pipe("docs/Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a", return_timestamps=True)
docs

# url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
# save_dir="docs/"
# loader = GenericLoader(
    #YoutubeAudioLoader([url],save_dir),  # fetch from youtube
    # FileSystemBlobLoader(save_dir, glob="*.m4a"),   #fetch locally
    # OpenAIWhisperParser()
    # llm,
    # show_progress=True,
# )
# docs = loader.load()
# docs = llm("docs/Recording.m4a", return_timestamps=True, chunk_length_s=30, generate_kwargs={"language": "english"})
# docs[0].page_content[0:500]



{'text': " Welcome to CSP 29 Machine Learning. Uh, some of you know that this class I taught at Stanford for a long time, and this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got out, built many of their products and services and startups that I'm sure many of you, I pray all of you are using, uh, uh, today. Um, so what I want to do today was spend some time talking over, uh, logistics and then, uh, spend some time, you know, giving you a beginning of an intro, talk a little bit about machine learning. So about 229, um, you know, all of you have been reading about AI in the news, uh, about machine learning in the news. Um, and you've probably heard me or others say AI is the new electricity. uh, much as the rise of electricity about 100 years ago transformed every major industry. I think AI or really, we call it machine learning but the re

## URLs

In [27]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")

docs = loader.load()
print(docs[0])

page_content='












































































handbook/titles-for-programmers.md at master · basecamp/handbook · GitHub














































Skip to content













Navigation Menu

Toggle navigation




 













            Sign in
          








        Product
        













GitHub Copilot
        Write better code with AI
      







GitHub Advanced Security
        Find and fix vulnerabilities
      







Actions
        Automate any workflow
      







Codespaces
        Instant dev environments
      







Issues
        Plan and track work
      







Code Review
        Manage code changes
      







Discussions
        Collaborate outside of code
      







Code Search
        Find more, search less
      






Explore



      All features

    



      Documentation

    





      GitHub Skills

    





      Blog

    










        Solutions
        






## Notion

In [31]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/")
docs = loader.load()

In [32]:
print(docs[0].page_content[0:200])

# Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that


In [33]:
docs[0].metadata

{'source': "docs\\Blendle's Employee Handbook e367aa77e225482c849111687e114a56.md"}