## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

In [5]:
%pip install langchain openai dotenv

Defaulting to user installation because normal site-packages is not writeable
Collecting langchain
  Downloading langchain-0.3.27-py3-none-any.whl.metadata (7.8 kB)
Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting langchain-core<1.0.0,>=0.3.72 (from langchain)
  Downloading langchain_core-0.3.74-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.9 (from langchain)
  Downloading langchain_text_splitters-0.3.9-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith>=0.1.17 (from langchain)
  Downloading langsmith-0.4.14-py3-none-any.whl.metadata (14 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading sqlalchemy-2.0.43-cp39-cp39-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)

##### 小筆記：為什麼用 %pip 會比 !pip 還好？
解答：
* !pip = 在旁邊開個小視窗偷偷叫系統去幫你安裝，結果可能裝到別的房間。
* %pip = 直接跟 Notebook 的 kernel 說「幫我裝這個套件」，保證裝到正確的房間。

In [6]:
'''
目的：建立langchain + openai 的基礎環境
'''
#! pip install langchain
import os # 作業系統相關功能（讀取環境變數）
import openai # openai api 客戶端
# import sys # python 系統功能，下載到地端用不到
# sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv # dotenv 是專門用來讀取.env套件的套件，並接上環境
_ = load_dotenv(find_dotenv()) # 讀取.env檔案

'''
為什麼要這樣寫？
find_dotenv() → 自動尋找 .env 檔案（向上搜尋資料夾）
load_dotenv() → 載入 .env 檔案到環境變數
_ = → 把回傳值丟掉（不需要）
load_dotenv() 會回傳什麼？
ans:會回傳true, false，但我們只要結果就好不需要回傳值
一般 : "result = load_doenv()" → 有回傳值
如果不要 : "_ = load_doenv()" → 自動忽略回傳值（pythin 慣例）

'''

# 載入 api key
openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [8]:
%pip install -U langchain-community pypdf
'''
套件介紹
1. langchain-community：包含現在要使用的document_loaders、向量資料庫、llm整合（hugging face, anthropic）
2. pypdf：專門處理pdf檔案得python套件（用來解析pdf）
-U 是什麼？
ans：等於--upgrade，如果已經裝了，升級到最新的版本
'''

Defaulting to user installation because normal site-packages is not writeable
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain-community)
  Downloading aiohttp-3.12.15-cp39-cp39-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp<4.0.0,>=3.8.3->langchain-community)
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.4.0 (fr

'\n套件介紹\n1. langchain-community：包含現在要使用的document_loaders、向量資料庫、llm整合（hugging face, anthropic）\n2. pypdf：專門處理pdf檔案得python套件（用來解析pdf）\n-U 是什麼？\nans：等於--upgrade，如果已經裝了，升級到最新的版本\n'

In [7]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
#! pip install pypdf 

from langchain.document_loaders import PyPDFLoader # PyPDFLoader：LangChain 的 PDF 文件載入器，專門把pdf檔案轉換成langchain形式
# 建立載入器實例（指定來源）
loader = PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf")
# 實際載入檔案：下載/讀取 PDF → 用 pypdf 解析 → 分頁處理 → 轉成 Document 格式
pages = loader.load()

ModuleNotFoundError: Module langchain_community.document_loaders not found. Please install langchain-community to access this module. You can install it using `pip install -U langchain-community`

In [16]:
# 實際看看結果
print(f"總共載入了 {len(pages)} 頁")
print(f"第一頁內容：{pages[0].page_content[:200]}...")
print(f"第一頁 metadata：{pages[0].metadata}")
'''
len(pages) → 了解有多少頁
字串切片 + .page_content → 看到某頁的頁面內容
字串切片 + .metadata → 看到某頁的 meta data
'''

總共載入了 22 頁
第一頁內容：MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of...
第一頁 metadata：{'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': 'https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 0, 'page_label': '1'}


#### 小筆記
##### 1. meta data 的功能是什麼？
* 追蹤來源：如果是rag，有了mata data如果使用者詢問資料出處可以查詢。
* 過濾與搜尋：搜尋時可以只搜尋某一頁、某個日期、某個來源
    ```python
    # 只搜尋特定頁面
    first_5_pages = [p for p in pages if p.metadata['page'] < 5]
    ```
##### 2. 通常 meta data 有那些格式？
```python
{
    'source': 'https://see.stanford.edu/.../MachineLearning-Lecture01.pdf',
    'page': 0,           # 第幾頁（從0開始）
    'total_pages': 20,   # 總頁數
    'title': 'Machine Learning CS229 Lecture 1',
    'author': 'Stanford University',
    'creation_date': '2023-01-15',
    'file_size': '2.5MB'
}
```
##### 3. 為什麼程式裡沒有看到 pypdf 的作用，但還要安裝？
* LangChain 比較像是「封裝」或是中間轉換器，不會參與解析 pdf
    * 呼叫外部的 PDF 解析套件（如 pypdf）→ 把頁面內容抽出來
    * 再把抽出來的文字包裝成 LangChain 的 Document 物件，方便後面使用
* 真正解析的仍舊是 pypdf
* 這樣設計的原因：保持 LangChain 輕量，採取「lazy dependency（延遲依賴）」策略

## YouTube

In [10]:
%pip install yt_dlp pydub

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
from langchain.document_loaders.generic import GenericLoader
from langchain_community.document_loaders import FileSystemBlobLoader # 2025.08.19 位置改了, 原本在document_loaders.generic裡面
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

#### 小筆記：音檔轉文字作法上的差異


In [13]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    #YoutubeAudioLoader([url],save_dir),  # fetch from youtube
    FileSystemBlobLoader(save_dir, glob="*.m4a"),   #fetch locally
    OpenAIWhisperParser()
)
docs = loader.load()

In [19]:
print(docs[0].page_content[0:500])

IndexError: list index out of range