進入 Day 2：Vector Storage。今天的目標是把「任意 PDF → 分割 → 向量化 → 建 FAISS 索引 → 檢索 → 串 Groq 產生回覆（RAG）」一次打通。

In [44]:
!pip install faiss-cpu sentence-transformers pypdf langchain-text-splitters python-dotenv


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [45]:
import os
from dotenv import load_dotenv
load_dotenv()  # 讀 .env（請勿 commit）

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
assert GROQ_API_KEY, "環境變數 GROQ_API_KEY 未設定。請先在 .env 或 os.environ 設定。"

from groq import Groq
client = Groq(api_key=GROQ_API_KEY)

from sentence_transformers import SentenceTransformer
import faiss
from pypdf import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [46]:
# 將你的 PDF 放到 repo 的 data/ 裡，例如 data/guide.pdf
PDF_PATH = "../data/Real-Time Sign Language Detection using LSTM.pdf"  # ← 替換成你的檔名
reader = PdfReader(PDF_PATH)

pages = []
for i, p in enumerate(reader.pages):
    try:
        pages.append(p.extract_text() or "")
    except Exception as e:
        pages.append("")
        
raw_text = "\n".join(pages).strip()
print("Chars:", len(raw_text))
print(raw_text[:1000])


Chars: 23082
Real-Time  Sign  Language  Detection  using  LSTM  
 Chung-Hao  Tuan  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  tuanc@oregonstate.edu  
Yun-Hsuan  Chan  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  chanyun@oregonstate.edu  
Fen-Yun  Huang  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  huanfeny@oregonstate.edu   
 
Abstract
 
    This  paper  proposes  a  real-time  sign  language  detection  system  utilizing  Long  Short-Term  Memory  (LSTM)  networks  combined  with  keypoint-based  feature  extraction.  The  system  leverages  MediaPipe  Holistic  for  extracting  skeletal  landmarks  from  hand,  face,  and  pose  keypoints.  Compared  to  conventional  approaches  like  Hidden  Markov  Models  (HMMs)  and  Convolutional  Neural  Networks  (CNNs),  LSTM  effectively  captures  temporal  dependencies  required  for  recognizing  continuous  gestures.  We  collected

In [47]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # 可依文件長短調整（500~1200）
    chunk_overlap=120,   # 避免斷句丟資訊
    separators=["\n\n", "\n", "。", "，", " "]
)
chunks = splitter.split_text(raw_text)
print("Chunks:", len(chunks))
print(chunks[0][:300])


Chunks: 32
Real-Time  Sign  Language  Detection  using  LSTM  
 Chung-Hao  Tuan  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  tuanc@oregonstate.edu  
Yun-Hsuan  Chan  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  chanyun@oregonstate.edu  
Fen


讓相似語意的句子向量靠近，無關的遠離
每個段落 → 轉成向量

In [48]:
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(embed_model_name)

embeddings = embedder.encode(chunks, batch_size=64, show_progress_bar=True, normalize_embeddings=True)
import numpy as np
emb = np.array(embeddings).astype("float32")
dim = emb.shape[1]
print("Embedding shape:", emb.shape, "dim:", dim)


Batches: 100%|██████████| 1/1 [00:00<00:00,  4.97it/s]

Embedding shape: (32, 384) dim: 384





In [49]:
index = faiss.IndexFlatIP(dim)  # cosine 等效於 inner product + normalized vectors
index.add(emb)                  # 加入全部 chunk 向量
print("Indexed vectors:", index.ntotal)


Indexed vectors: 32


In [50]:
def search(query, k=5, mmr_lambda=0.5):
    # 1) 將 query 向量化
    qv = embedder.encode([query], normalize_embeddings=True).astype("float32")
    # 2) 先抓更多（例如 20）再做 MMR 去冗餘
    fetch = max(k*4, 20)
    D, I = index.search(qv, fetch)
    cands = [(i, float(D[0][j])) for j, i in enumerate(I[0])]

    # 3) 簡易 MMR
    selected, selected_vecs = [], []
    for idx, score in cands:
        cv = emb[idx]
        if not selected:
            selected.append((idx, score))
            selected_vecs.append(cv)
            if len(selected) >= k: break
            continue
        # 與已選的最大相似度
        sim_to_S = max(float(np.dot(cv, sv)) for sv in selected_vecs)
        mmr = mmr_lambda*score - (1-mmr_lambda)*sim_to_S
        # 用門檻挑（簡易版）；更嚴謹可逐步 argmax
        if mmr > -0.2 or len(selected)<k:
            selected.append((idx, score))
            selected_vecs.append(cv)
            if len(selected) >= k: break
    return [chunks[i] for i,_ in selected]  #最後把挑到的 k 個段落（文字本體）回傳

# quick test
query = "這份文件的重點與使用步驟是什麼？"
ctx = search(query, k=5)
len(ctx), ctx[0][:300]


(5,
 'Real-Time  Sign  Language  Detection  using  LSTM  \n Chung-Hao  Tuan  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  tuanc@oregonstate.edu  \nYun-Hsuan  Chan  School  of  Computer  Science  Oregon  State  University,  Corvallis,  OR  USA  chanyun@oregonstate.edu  \nFen')

In [51]:
def rag_answer(question, k=5, sys_prompt="You are a precise assistant. Use the context to answer concisely. If unknown, say you don't know."):
    context = "\n\n".join(search(question, k=k))
    messages = [
        {"role":"system","content": sys_prompt},
        {"role":"user","content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer in English."}
    ]
    resp = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=messages,
        temperature=0.2,
        max_tokens=800,
    )
    return resp.choices[0].message.content

print(rag_answer("How does the paper do Keypoint Extraction?"))


The paper leverages MediaPipe Holistic to extract a comprehensive set of keypoints, including:

1. Pose landmarks (33 landmarks, 132 dimensional features)
2. Facial landmarks (468 landmarks, 1,404 dimensional features)
3. Hand landmarks (21 landmarks per hand, 126 dimensional features)

These keypoints are used for gesture recognition and sign language detection.


Q: 都搜尋到了為什麼還要丟給groq回答

A: 檢索 = 圖書館員幫你找到 3 本相關的書，翻到標記好的頁面。

生成 (Groq LLM) = 一個研究助理幫你讀這些頁面，濃縮出「這篇 paper 用 SIFT 做 keypoint extraction」。

In [52]:
def rag_with_sources(question, k=5):
    retrieved = search(question, k=k)
    answer = rag_answer(question, k=k)
    return {
        "question": question,
        "answer": answer,
        "sources": [{"idx": i, "snippet": s[:300]} for i, s in enumerate(retrieved)]
    }

res = rag_with_sources("How does the paper do Keypoint Extraction?", k=4)
res


{'question': 'How does the paper do Keypoint Extraction?',
 'answer': 'The paper leverages MediaPipe Holistic to extract a comprehensive set of keypoints, including:\n\n1. Pose landmarks (33 landmarks, 132 dimensional features)\n2. Facial landmarks (468 landmarks, 1,404 dimensional features)\n3. Hand landmarks (21 landmarks per hand, 126 dimensional features)\n\nThese keypoints are extracted from video frames using MediaPipe Holistic.',
 'sources': [{'idx': 0,
   'snippet': 'performance  across  different  ambient  lighting  conditions.  ●  Camera  setup:  Standard  webcam  hardware  was  used  in  conjunction  with  OpenCV  for  video  capture  and  MediaPipe  Holistic  for  keypoint  detection,  reflecting  typical  end-user  hardware  configurations.  This  approach '},
  {'idx': 1,
   'snippet': 'We  leveraged  MediaPipe  Holistic  to  extract  a  comprehensive  set  of  keypoints:  \n●  Pose  landmarks:  A  set  of  33  landmarks  representing  the  human  skeleton,  each  charact