# LangChain - Chat With Your Data: Part 1

Tutorial on document ingestion using LangChain.

### Background: Why Retrieval-Augmented Generation (RAG)?
Large Language Models (LLMs) cannot access information outside their training data. Retrieval-Augmented Generation (RAG) enables LLMs to fetch and use external documents (e.g., PDFs, web pages, notes) to generate more accurate and context-aware responses.

### Key Concept: RAG Pipeline
1. **Document Loading** – convert various data formats into `Document` objects.
2. **Splitting** – chunk the documents into manageable parts.
3. **Embedding & Indexing** – embed chunks and store them in a vector database.
4. **Retrieval** – pull relevant chunks based on query.
5. **Generation** – combine query and retrieved context to generate output.

This notebook focuses on **Step 1: Document Loading**.

### Environment Setup

In [1]:
import os
import sys
from dotenv import load_dotenv, find_dotenv
import openai

sys.path.append('../..')
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

### Example 1: Loading PDF Documents

In [2]:
#!pip install langchain
#!pip install pypdf

#!pip install yt_dlp
#!pip install pydub


In [3]:
from langchain.document_loaders import PyPDFLoader

pdf_path = "docs/cs229_lectures/MachineLearning-Lecture01.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load()

# Print the number of pages and the first page's content
print(f"Number of pages: {len(pages)}")
print(pages[0].page_content[:500])

# Print the metadata of the first page, there is a source and page number
print(pages[0].metadata)

Number of pages: 22
MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}


### Example 2: Loading and Transcribing YouTube Videos

In [4]:
from langchain.document_loaders.generic import GenericLoader#, FileSystemBlobLoader
from langchain.document_loaders.blob_loaders import FileSystemBlobLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir = "docs/youtube/"

loader = GenericLoader(
    ##YoutubeAudioLoader([url],save_dir),  # fetch from youtube
    FileSystemBlobLoader(save_dir, glob="*.m4a"), # load from local
    OpenAIWhisperParser()
)
docs = loader.load()
print(docs[0].page_content[:500])

Transcribing part 1!
Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s


### Example 4:  Loading Content from Web Pages

In [5]:
###  Loading Web Pages
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")
docs = loader.load()
# print the first page content, notice this needs pre-processing
print(docs[0].page_content[:500])
















































































handbook/titles-for-programmers.md at master · basecamp/handbook · GitHub















































Skip to content














Navigation Menu

Toggle navigation




 













            Sign in
          








        Product
        













GitHub Copilot
        Write better code with AI
      







GitHub Advanced Security
        Find and fix vulnerabilities
      







A


### Example 5: Loading Notion Documents

In [6]:
from langchain.document_loaders import NotionDirectoryLoader

notion_path = "docs/Notion_DB"
loader = NotionDirectoryLoader(notion_path)
docs = loader.load()
print(docs[0].page_content[:200])

# Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that


### Summary
You've learned how to load documents from:
- **PDFs** (via `PyPDFLoader`)
- **YouTube** (via Whisper + `YoutubeAudioLoader`)
- **Web Pages** (via `WebBaseLoader`)
- **Notion** (via `NotionDirectoryLoader`)

All loaders return `Document` objects with `.page_content` and `.metadata`.

**Next Step:** Split documents into chunks for semantic search and efficient retrieval in RAG systems.