<a href="https://colab.research.google.com/github/lillylovecode/GenerativeAI_class/blob/main/HW06_RAG01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

使用rPPG相關文獻做一個系統資料庫

### 1. 建立資料夾

In [None]:
import os
upload_dir = "uploaded_docs"
os.makedirs(upload_dir, exist_ok=True)
print(f"請將你的 .txt, .pdf, .docx 檔案放到這個資料夾中： {upload_dir}")

請將你的 .txt, .pdf, .docx 檔案放到這個資料夾中： uploaded_docs


### 2. 更新必要套件並引入

In [None]:
!pip install -U langchain langchain-community pypdf python-docx sentence-transformers faiss-cpu

Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Dow

In [None]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings #使用HuggingFace的Embedding model
from langchain.vectorstores import FAISS #meta出的建立向量資料庫套件

### 3. 依 e5 建議加入

自訂支援 E5 的 embedding 模型（加上 "passage:" 或 "query:" 前綴）

會選擇e5系列的原因是因為支援多語言，其他的model大多只有在英文做得好。

passage 代表資料庫的`訊息`，query代表查詢的`問題`，有做前綴動作效果會好很多


In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

class CustomE5Embedding(HuggingFaceEmbeddings):
    def embed_documents(self, texts):
        texts = [f"passage: {t}" for t in texts]
        return super().embed_documents(texts) #繼承HuggingFace的原功能 讀取先備知識

    def embed_query(self, text):
        return super().embed_query(f"query: {text}") #繼承原功能 進行查詢

### 4. 載入文件

In [None]:
folder_path = upload_dir
documents = []
for file in os.listdir(folder_path):
    path = os.path.join(folder_path, file)
    if file.endswith(".txt"):
        loader = TextLoader(path)
    elif file.endswith(".pdf"):
        loader = PyPDFLoader(path)
    elif file.endswith(".docx"):
        loader = UnstructuredWordDocumentLoader(path)
    else:
        continue
    documents.extend(loader.load())

### 5. 建立向量資料庫

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100) #這裡的參數可以調整chunk_size(文字塊大小) chunk_overlap(文字塊重疊) 會影響到效果
split_docs = splitter.split_documents(documents)

確認文件有放入資料夾且有切分成功

In [None]:
print("Number of documents loaded:", len(documents))  # Check the number of loaded documents
print("Number of split documents:", len(split_docs))  # Check the number of documents after splitting


Number of documents loaded: 247
Number of split documents: 531


In [None]:
embedding_model = CustomE5Embedding(model_name="intfloat/multilingual-e5-small") #這裡設定欲使用的model
vectorstore = FAISS.from_documents(split_docs, embedding_model) #開始切分文字變成向量

### 6. 儲存向量資料庫

In [None]:
vectorstore.save_local("faiss_db") #完成後會變成資料夾

In [None]:
!zip -r faiss_db.zip faiss_db

  adding: faiss_db/ (stored 0%)
  adding: faiss_db/index.pkl (deflated 67%)
  adding: faiss_db/index.faiss (deflated 8%)


In [None]:
print("✅ 壓縮好的向量資料庫已儲存為 'faiss_db.zip'，請下載此檔案備份。")

✅ 壓縮好的向量資料庫已儲存為 'faiss_db.zip'，請下載此檔案備份。
