<a href="https://colab.research.google.com/github/kuo8129/GenAI/blob/main/20250603%E6%9C%9F%E6%9C%AB%E5%B0%88%E6%A1%88/20250603%E6%9C%9F%E6%9C%AB%E5%B0%88%E6%A1%881.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔨專案目標：打造景點向量資料庫

#### 📌建置流程：

1. 轉換表格成純文字檔 或 直接上傳檔案
2. 讀取檔案並依資料類型進行標記及彈性分割
3. 建立並儲存向量資料庫
---


## 資料來源

> [《政府資料開放平台》景點-觀光資訊資料庫](https://data.gov.tw/dataset/7777)


## 轉換表格成純文字檔
* 分別建立輸入與輸出用資料夾
* 文字轉檔輸出資料夾 `uploaded_docs` 可同時作為向量資料庫建置的輸入資料夾
* 使用 `files.upload()` 讓使用者上傳檔案
* 將每列儲存格內容轉成格式化文字，每筆資料以空行分隔

In [1]:
from google.colab import files
import os
import pandas as pd

In [2]:
# 建立上傳 Excel 檔案用的資料夾
upload_dir = "raw_data"
os.makedirs(upload_dir, exist_ok=True)

# 建立儲存合併 txt 的資料夾
output_dir = "uploaded_docs"
os.makedirs(output_dir, exist_ok=True)

In [3]:
# 提示使用者上傳檔案
print("請上傳 CSV 或 Excel 檔案：")
uploaded = files.upload()

for filename in uploaded.keys():
    # 儲存上傳檔案到 raw_data 資料夾
    file_path = os.path.join(upload_dir, filename)
    os.rename(filename, file_path)

    try:
        # 讀取 CSV 或 Excel
        if filename.endswith(".csv"):
            df = pd.read_csv(file_path)
        elif filename.endswith(".xlsx") or filename.endswith(".xls"):
            df = pd.read_excel(file_path)
        else:
            print(f"❌ 不支援的格式：{filename}")
            continue

        df = df.fillna("")  # 補空白欄位

        # 產生對應的 .txt 檔案名
        base_name = os.path.splitext(filename)[0]
        txt_filename = f"{base_name}.txt"
        txt_path = os.path.join(output_dir, txt_filename)

        # 將所有資料合併寫入一個 .txt 檔
        with open(txt_path, "w", encoding="utf-8") as f:
            for idx, row in df.iterrows():
                for col in df.columns:
                    f.write(f"{col}:{row[col]}\n")
                f.write("\n")  # 每筆資料空一行

        print(f"✅ 已產出 {txt_filename}（{len(df)} 筆資料）")

    except Exception as e:
        print(f"⚠️ 錯誤：{filename}：{e}")

請上傳 CSV 或 Excel 檔案：


Saving taiwan_attractions2.xlsx to taiwan_attractions2.xlsx
✅ 已產出 taiwan_attractions2.txt（5068 筆資料）


## 建立資料夾及檔案上傳
* 若已有.txt, .pdf, .docx 檔案可直接從此步驟開始執行
* 若已執行表格轉換文字檔流程，可略過此一步驟

In [4]:
import os
upload_dir = "uploaded_docs"
os.makedirs(upload_dir, exist_ok=True)
print(f"請將你的 .txt, .pdf, .docx 檔案放到這個資料夾中： {upload_dir}")

請將你的 .txt, .pdf, .docx 檔案放到這個資料夾中： uploaded_docs


## 安裝並匯入套件

In [5]:
!pip install -U langchain langchain-community pypdf python-docx sentence-transformers faiss-cpu

Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting pypdf
  Downloading pypdf-5.6.0-py3-none-any.whl.metadata (7.2 kB)
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from data

In [6]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from typing import List, Dict, Any
import re

## 自訂 E5 embedding 類別
* 資料庫的內容前面加上 "passage:" 當前綴詞，以增加精準度
* 問題的前面加上 "query:" 當前綴詞，以增加精準度

In [7]:
class CustomE5Embedding(HuggingFaceEmbeddings):
    def embed_documents(self, texts):
        texts = [f"passage: {t}" for t in texts]
        return super().embed_documents(texts)

    def embed_query(self, text):
        return super().embed_query(f"query: {text}")

## 擷取重要欄位資訊作為 metadata

In [8]:
# 從景點文本中提取metadata
def extract_tourism_metadata(text: str) -> Dict[str, Any]:
    metadata = {}

    # 提取ID
    id_match = re.search(r'ID:([^\n]+)', text)
    if id_match:
        metadata['id'] = id_match.group(1).strip()

    # 提取景點名稱
    name_match = re.search(r'景點名稱:([^\n]+)', text)
    if name_match:
        metadata['name'] = name_match.group(1).strip()

    # 提取景點類別
    category_match = re.search(r'景點類別:([^\n]+)', text)
    if category_match:
        categories = [cat.strip() for cat in category_match.group(1).split(',')]
        metadata['categories'] = categories
        metadata['primary_category'] = categories[0] if categories else None

    # 提取區域
    region_match = re.search(r'區域:([^\n]+)', text)
    if region_match:
        metadata['region'] = region_match.group(1).strip()

    return metadata

## 根據資料長度彈性分割文本
* 盡可能避免分隔同一景點資訊
* 每段資訊(chunk)皆附上metadata以利後續搜尋

In [9]:
# 針對景點資料進行智能分割
def split_tourism_documents(documents: List[Document]) -> List[Document]:
    split_docs = []

    for doc in documents:
        text = doc.page_content

        # 按景點分割（每個ID開始一個新景點）
        attractions = re.split(r'\n(?=ID:\d+)', text)

        for attraction in attractions:
            if not attraction.strip():
                continue

            # 提取metadata
            metadata = extract_tourism_metadata(attraction)
            metadata.update(doc.metadata)  # 保留原有metadata

            # 檢查景點資料完整性
            if len(attraction.strip()) < 50:  # 太短的資料可能不完整
                continue

            # 如果景點介紹很長，可以進一步分割
            if len(attraction) > 800:
                # 使用更細緻的分割
                splitter = RecursiveCharacterTextSplitter(
                    chunk_size=600,
                    chunk_overlap=150,
                    separators=["\n景點介紹:", "\n交通指南:", "\n開放時間:", "\n", " "]
                )
                sub_chunks = splitter.split_text(attraction)

                for i, chunk in enumerate(sub_chunks):
                    chunk_metadata = metadata.copy()
                    chunk_metadata['chunk_index'] = i
                    chunk_metadata['total_chunks'] = len(sub_chunks)

                    split_docs.append(Document(
                        page_content=chunk,
                        metadata=chunk_metadata
                    ))
            else:
                # 整個景點作為一個chunk
                split_docs.append(Document(
                    page_content=attraction,
                    metadata=metadata
                ))

    return split_docs

## 建立向量資料庫

In [10]:
folder_path = upload_dir
documents = []

# 載入文件
for file in os.listdir(folder_path):
    path = os.path.join(folder_path, file)
    if file.endswith(".txt"):
        loader = TextLoader(path, encoding='utf-8')
    elif file.endswith(".pdf"):
        loader = PyPDFLoader(path)
    elif file.endswith(".docx"):
        loader = UnstructuredWordDocumentLoader(path)
    else:
        continue
    documents.extend(loader.load())

In [11]:
split_docs = split_tourism_documents(documents)

In [None]:
embedding_model = CustomE5Embedding(model_name="intfloat/multilingual-e5-small")
vectorstore = FAISS.from_documents(split_docs, embedding_model)

## 儲存向量資料庫

In [None]:
vectorstore.save_local("faiss_db")
print("向量資料庫已儲存到 faiss_db")

In [None]:
!zip -r faiss_db.zip faiss_db
print("壓縮好的向量資料庫已儲存為 'faiss_db.zip'，請下載此檔案備份。")