<a href="https://colab.research.google.com/github/lyh26x03/aml-redflags-rag/blob/main/build_data_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# -*- coding: utf-8 -*-
"""
build_data.py — 索引建立管線 (Indexing Pipeline)

職責：Write / Create
- 讀取 PDF 文件
- 切成小段落（Chunking）
- 建立向量索引（FAISS）
- 建立關鍵字索引（BM25）
- 統一儲存到 Google Drive

執行時機：
- 首次設置
- PDF 文件有更新時
- 調整 chunking 參數或 embedding model 時

History:
    - 2025-01-xx: v1 — 初始版本（chunk_size=300, 無 metadata 分層）
    - 2025-02-09: v2 — 加入 metadata 分層、retrieval_priority、doc_category
"""

# 🔧 PART 0: SETUP（環境設定）

### 0.1 安裝套件


In [None]:
!pip install pypdf langchain-text-splitters sentence-transformers faiss-cpu rank_bm25

Collecting pypdf
  Downloading pypdf-6.7.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading pypdf-6.7.0-py3-none-any.whl (330 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.6/330.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_text_splitters-1.1.0-py3-none-any.whl (34 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25, pypdf, faiss-cpu, langchain-

### 0.2 Mount Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 0.3 設定路徑與參數


In [None]:
import os

In [None]:
# --- 路徑 ---
SOURCE_DATA_DIR = "/content/drive/MyDrive/AML/data"
INDEX_DIR = "/content/drive/MyDrive/AML/index_v2"

CURRENT_VERSION = "v2" # 可以依據需求設定版本名稱

In [None]:
# # 路徑設定
# PDF_FOLDER = "/content/drive/MyDrive/AML/data"
# BASE_INDEX_ROOT_DIR = "/content/drive/MyDrive/AML" # 索引儲存的根目錄
# CURRENT_INDEX_VERSION = "index_v2" # 可以依據需求設定版本名稱，例如 "fatf_va_v1"
# INDEX_PATH = f"{BASE_INDEX_ROOT_DIR}/{CURRENT_INDEX_VERSION}" # 計算出完整的索引路徑

In [None]:
# --- Embedding 模型 ---
EMBEDDING_MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

In [None]:
# --- Chunking 參數 ---
CHUNK_SIZE = 400
CHUNK_OVERLAP = 80

In [None]:
# --- 驗證 ---
assert os.path.exists(SOURCE_DATA_DIR), f"❌ 資料路徑不存在: {SOURCE_DATA_DIR}"
pdf_files = sorted([f for f in os.listdir(SOURCE_DATA_DIR) if f.endswith(".pdf")])
assert len(pdf_files) > 0, f"❌ {SOURCE_DATA_DIR} 中找不到任何 PDF"
print(f"✅ Config 載入完成")
print(f"   資料路徑: {SOURCE_DATA_DIR} ({len(pdf_files)} 個 PDF)")
for f in pdf_files:
    print(f"      📄 {f}")
print(f"   索引輸出: {INDEX_DIR}")
print(f"   Embedding: {EMBEDDING_MODEL_NAME}")
print(f"   Chunk: size={CHUNK_SIZE}, overlap={CHUNK_OVERLAP}")

✅ Config 載入完成
   資料路徑: /content/drive/MyDrive/AML/data (3 個 PDF)
      📄 fatf_tbm_laundering_red_flags.pdf
      📄 fatf_virtual_assets_red_flags.pdf
      📄 tw_aml_training_slides.pdf
   索引輸出: /content/drive/MyDrive/AML/index_v2
   Embedding: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
   Chunk: size=400, overlap=80


# 📚 PART 1: 資料準備與索引建立 (Indexing Pipeline)

職責：Write/Create
- 讀取 PDF 文件
- 切成小段落（Chunking）
- 建立向量索引（FAISS）
- 建立關鍵字索引（BM25）
- 統一儲存到 Google Drive


執行時機：
- 首次設置
- PDF 文件有更新時
- 調整 chunking 參數時

### 1.1 讀取 PDF + Metadata


```
核心層（core）
├─ FATF 紅旗標準           priority: 1.0
└─ 台灣 AML 法規總覽       priority: 1.0

語意橋樑層（knowledge_bridge）
├─ 訓練投影片              priority: 0.8  ← 你關注的這個
├─ 教師手冊                priority: 0.8
└─ 防制指南                priority: 0.8

領域細則層（sector_specific）
├─ 虛擬資產紅旗            priority: 0.9
├─ 銀行業細則              priority: 0.9
└─ 證券業 Q&A              priority: 0.9
```

---

### 為什麼這樣設計？

問題：訓練投影片的「雙重角色」
```
tw_aml_training_slides.pdf 的內容：
├─ 基本概念（什麼是洗錢？）        ← 這是橋樑
├─ 金融機構實務（開戶審查）        ← 這也是橋樑
└─ 案例說明（400萬分批存款）       ← 這還是橋樑

In [None]:
from pypdf import PdfReader
import os

def get_pdf_metadata(pdf_name):
    """
    根據檔名回傳對應的 metadata

    分層邏輯：
    - core: 權威法規/國際標準（FATF、台灣法規）
    - knowledge_bridge: 簡化教學內容（訓練教材、指南）
    - sector_specific: 領域細則（虛擬資產、銀行、證券）
    """
    metadata = {
        "source": "Unknown",
        "jurisdiction": "Unknown",
        "doc_type": "Unknown",
        "language": "en",
        "doc_category": "unknown",
        "retrieval_priority": 1.0,
        "explanation_style": "neutral"
    }

    # === CORE LAYER ===
    if "fatf_tbm_laundering_red_flags" in pdf_name:
        metadata.update({
            "source": "FATF",
            "jurisdiction": "International",
            "doc_type": "red_flag",
            "language": "en",
            "doc_category": "core",
            "retrieval_priority": 1.0,
            "explanation_style": "authoritative"
        })

    # === SECTOR LAYER ===
    elif "fatf_virtual_assets_red_flags" in pdf_name:
        metadata.update({
            "source": "FATF",
            "jurisdiction": "International",
            "doc_type": "red_flag",
            "language": "en",
            "doc_category": "sector_specific",
            "retrieval_priority": 0.9,
            "explanation_style": "authoritative"
        })

    # === KNOWLEDGE BRIDGE LAYER ===
    elif "tw_aml_training_slides" in pdf_name:
        metadata.update({
            "source": "TW_Gov",
            "jurisdiction": "Taiwan",
            "doc_type": "training",
            "language": "zh",
            "doc_category": "knowledge_bridge",
            "retrieval_priority": 0.8,
            "explanation_style": "simplified"
        })

    return metadata

In [None]:
def load_pdfs(folder_path):
    """
    讀取資料夾中所有 PDF，並附上 metadata

    Args:
        folder_path: PDF 資料夾路徑

    Returns:
        list[dict]: 每個 dict 代表一頁，包含 text + metadata
    """

    pdf_paths = [
        os.path.join(folder_path, f)
        for f in sorted(os.listdir(folder_path))
        if f.endswith(".pdf")
    ]

    if not pdf_paths:
        raise FileNotFoundError(f"在 {folder_path} 中找不到任何 PDF 檔案")

    parsed_pdf_pages = []
    for pdf_path in pdf_paths:
        reader = PdfReader(pdf_path)
        pdf_name = os.path.basename(pdf_path)
        metadata = get_pdf_metadata(pdf_name)

        for i, page in enumerate(reader.pages, start=1):
            text = page.extract_text()
            if text and text.strip():  # 跳過空白頁
                parsed_pdf_pages.append({
                    "pdf_name": pdf_name,
                    "page": i,
                    "text": text,
                    **metadata,
                })

    return parsed_pdf_pages

In [None]:
from collections import Counter

# === 執行 ===
print("\n📚 1.1 讀取 PDF...")
parsed_pdf_pages = load_pdfs(SOURCE_DATA_DIR)
print(f"   ✅ 讀取了 {len(parsed_pdf_pages)} 頁")

page_counts = Counter(p["pdf_name"] for p in parsed_pdf_pages)
for name, count in page_counts.items():
    meta = get_pdf_metadata(name)
    print(f"   📄 {name}: {count} 頁 | category={meta['doc_category']} | priority={meta['retrieval_priority']}")


📚 1.1 讀取 PDF...
   ✅ 讀取了 42 頁
   📄 fatf_tbm_laundering_red_flags.pdf: 9 頁 | category=core | priority=1.0
   📄 fatf_virtual_assets_red_flags.pdf: 23 頁 | category=sector_specific | priority=0.9
   📄 tw_aml_training_slides.pdf: 10 頁 | category=knowledge_bridge | priority=0.8


### 1.2 Chunking（切段落）

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def create_chunks(
    pages: list,
    chunk_size: int = CHUNK_SIZE,
    chunk_overlap: int = CHUNK_OVERLAP,
) -> list:
    """
    把頁面切成小段落

    Args:
        pages: load_pdfs() 的輸出
        chunk_size: 每個 chunk 的最大字元數
        chunk_overlap: chunk 之間的重疊字元數

    Returns:
        list[dict]: 標準化的 chunk 列表，每個 dict 包含 text + metadata
    """
    # 中英文都適用的分隔符
    separators = ["\n\n", "\n", "。", ".", "！", "!", "？", "?", "；", ";", " "]

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=separators,
    )

    chunks = []
    for page in pages:
        splits = splitter.split_text(page["text"])
        for j, split_text in enumerate(splits):
            chunks.append({
                "text": split_text,
                "page": page["page"],
                "chunk_id": f"{page['pdf_name']}_p{page['page']}_c{j}",
                "source": page["source"],
                "language": page["language"],
                "doc_type": page["doc_type"],
                "retrieval_priority": page.get("retrieval_priority", 1.0),
                "doc_category": page.get("doc_category", "unknown"),
                "explanation_style": page.get("explanation_style", "neutral"),
            })

    return chunks

In [None]:
# === 執行 ===
print("\n✂️ 1.2 Chunking...")
chunks = create_chunks(parsed_pdf_pages)
print(f"   ✅ 產生了 {len(chunks)} 個 chunks")

chunk_cats = Counter(c["doc_category"] for c in chunks)
for cat, count in chunk_cats.items():
    print(f"   📦 {cat}: {count} chunks")


✂️ 1.2 Chunking...
   ✅ 產生了 226 個 chunks
   📦 core: 51 chunks
   📦 sector_specific: 165 chunks
   📦 knowledge_bridge: 10 chunks


### 1.3 Embedding + FAISS Index


In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np


def create_faiss_index(chunks: list, embedding_model_name: str):
    """
    建立向量索引

    Args:
        chunks: chunk 列表
        embedding_model_name: sentence-transformers 模型名稱

    Returns:
        tuple: (embedding_model, faiss_index)
    """
    print(f"   載入模型: {embedding_model_name}")
    embedding_model = SentenceTransformer(embedding_model_name)

    # 產生 embeddings
    texts = [c["text"] for c in chunks]
    embeddings = embedding_model.encode(
        texts,
        normalize_embeddings=True,
        show_progress_bar=True,
    )

    # 建立 FAISS index（使用 Inner Product，因為已經 normalize）
    dim = embeddings.shape[1]
    faiss_index = faiss.IndexFlatIP(dim)
    faiss_index.add(np.array(embeddings, dtype="float32"))

    return embedding_model, faiss_index

In [None]:
# 執行
print("\n🧠 1.3 建立 FAISS Index（向量索引）...")
embedding_model, faiss_index = create_faiss_index(chunks, EMBEDDING_MODEL_NAME)
print(f"   ✅ 建立完成，共 {faiss_index.ntotal} 個向量，維度 {faiss_index.d}")


🧠 1.3 建立 FAISS Index（向量索引）...
   載入模型: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/526 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

   ✅ 建立完成，共 226 個向量，維度 384


###1.4 BM25 Index（關鍵字索引）

In [None]:
from rank_bm25 import BM25Okapi
import jieba


def create_bm25_index(chunks: list):
    """
    建立 BM25 關鍵字索引

    Args:
        chunks: chunk 列表（需要有 text 和 language 欄位）

    Returns:
        tuple: (bm25_index, tokenized_corpus)
    """
    tokenized_corpus = []

    for c in chunks:
        if c.get("language") == "zh":
            tokens = list(jieba.cut(c["text"]))
        else:
            tokens = c["text"].lower().split()
        tokenized_corpus.append(tokens)

    bm25_index = BM25Okapi(tokenized_corpus)

    return bm25_index, tokenized_corpus

  re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
  re_skip_default = re.compile("(\r\n|\s)", re.U)
  re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")


In [None]:
# 執行
print("\n📝 1.4 建立 BM25 Index（關鍵字索引）...")
bm25_index, tokenized_corpus = create_bm25_index(chunks)
print(f"   ✅ 建立完成，共 {len(tokenized_corpus)} 個文件")

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...



📝 1.4 建立 BM25 Index（關鍵字索引）...


Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.965 seconds.
DEBUG:jieba:Loading model cost 0.965 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.


   ✅ 建立完成，共 226 個文件


###1.5 儲存到 Google Drive

In [None]:
import json
import pickle
from pathlib import Path
from datetime import datetime


def save_all_indexes(
    base_dir: str,
    faiss_index,
    chunks: list,
    bm25_index,
    tokenized_corpus: list,
    embedding_model_name: str,
    chunk_size: int,
    chunk_overlap: int,
    version_name: str = CURRENT_VERSION
):
    """
    統一儲存所有索引和資料，支援版本管理。

    儲存的檔案：
    - faiss_index.bin: FAISS 向量索引
    - chunks.json: 原始 chunks（JSON 格式，方便檢視）
    - bm25_index.pkl: BM25 索引
    - tokenized_corpus.pkl: 分詞後的語料
    - metadata.json: 索引的 metadata（版本、建立時間等）

    Args:
        base_dir: 索引根目錄（如 /content/drive/MyDrive/AML/indices）
        version_name: 版本名稱（如 "v2"）
        faiss_index: FAISS 索引物件
        chunks: chunk 列表
        bm25_index: BM25 索引物件
        tokenized_corpus: 分詞後的語料
        embedding_model_name: 使用的 embedding 模型名稱
        chunk_size: chunk 大小
        chunk_overlap: chunk 重疊
    """
    full_path = Path(base_dir) / version_name
    full_path.mkdir(parents=True, exist_ok=True)
    print(f"   儲存路徑: {full_path}")

    # 1. FAISS Index
    faiss.write_index(faiss_index, str(full_path / "faiss_index.bin"))
    print(f"   ✅ faiss_index.bin")

    # 2. Chunks
    with open(full_path / "chunks.json", "w", encoding="utf-8") as f:
        json.dump(chunks, f, ensure_ascii=False, indent=2)
    print(f"   ✅ chunks.json ({len(chunks)} chunks)")

    # 3. BM25 Index
    with open(full_path / "bm25_index.pkl", "wb") as f:
        pickle.dump(bm25_index, f)
    print(f"   ✅ bm25_index.pkl")

    # 4. Tokenized Corpus
    with open(full_path / "tokenized_corpus.pkl", "wb") as f:
        pickle.dump(tokenized_corpus, f)
    print(f"   ✅ tokenized_corpus.pkl")

    # 5. Metadata
    metadata = {
        "version": version_name,
        "created_at": datetime.now().isoformat(),
        "config": {
            "embedding_model": embedding_model_name,
            "chunk_size": chunk_size,
            "chunk_overlap": chunk_overlap,
        },
        "stats": {
            "total_chunks": len(chunks),
            "total_vectors": faiss_index.ntotal,
            "vector_dimension": faiss_index.d,
        },
    }
    with open(full_path / "metadata.json", "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False, indent=2)
    print(f"   ✅ metadata.json")

In [None]:
# === 執行 ===
print("\n💾 1.5 儲存到 Google Drive...")
save_all_indexes(
    base_dir=INDEX_DIR,
    faiss_index=faiss_index,
    chunks=chunks,
    bm25_index=bm25_index,
    tokenized_corpus=tokenized_corpus,
    embedding_model_name=EMBEDDING_MODEL_NAME,
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)


💾 1.5 儲存到 Google Drive...
   儲存路徑: /content/drive/MyDrive/AML/index_v2/v2
   ✅ faiss_index.bin
   ✅ chunks.json (226 chunks)
   ✅ bm25_index.pkl
   ✅ tokenized_corpus.pkl
   ✅ metadata.json


In [None]:
# 驗證
print("\n📋 儲存結果驗證：")
print("-" * 40)
for f in sorted(os.listdir(INDEX_DIR)):
    size = os.path.getsize(os.path.join(INDEX_DIR, f))
    print(f"   {f}: {size:,} bytes")

print("\n" + "=" * 60)
print("✅ PART 1 完成！所有索引已儲存。")
print("   下次實驗只需要在 experiment_rag_v2 中載入即可。")
print("=" * 60)


📋 儲存結果驗證：
----------------------------------------
   v2: 4,096 bytes

✅ PART 1 完成！所有索引已儲存。
   下次實驗只需要在 experiment_rag_v2 中載入即可。
