## 任務類型
資料擷取 (Data Ingestion) 與 ETL (Extract, Transform, Load)

## 任務目標
自動登入指定的 Gmail 帳戶，擷取固定數量(自行設定)的電子郵件。它會解析郵件內容（包含純文字與 HTML），進行文字清理（移除 URL、引言符號等），然後將乾淨的文字區塊儲存到一個名為 vault.txt 的檔案中，作為接下來RAG系統的檢索來源。

In [None]:
#pip install -r requirements.txt

### 步驟 1｜匯入所有需要的函式庫
**logging**在cell輸出紀錄

**dotenv**讀取env中的環境變數

**imap_tools**用於連接IMAP伺服器(如 Gmail)並擷取郵件

In [None]:
import os
import re
import logging
from datetime import datetime
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from imap_tools import MailBox
from dotenv import load_dotenv


### 步驟 2｜設定logging的格式
**load_dotenv()**:讀取env中的環境變數。

**logging.basicConfig()**:設定logging的基本組態，讓記錄包含時間戳、等級和訊息。

In [None]:
load_dotenv()  
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


### 步驟 3｜提取Gmail資料
**get_email_content(mail)**:

從 imap-tools 的 mail 物件中提取純文字內容(mail.text)。

如果失敗(例如郵件只有HTML)則改用 BeautifulSoup 來解析 mail.html 並提取文字。

In [None]:
def get_email_content(mail):
    if not text_content and mail.html:
        logging.debug(f"Email UID {mail.uid}: No plain text found, parsing HTML.")
        try:
            soup = BeautifulSoup(mail.html, "lxml")
            text_content = soup.get_text(separator=" ", strip=True)
        except Exception as e:
            logging.warning(f"Failed to parse HTML for email UID {mail.uid}: {e}")
            text_content = "" 
    return text_content

### 步驟 4｜設計資料前處理功能
**chunk_text(text, max_length=1000)**:

此函式負責「清理」和「分割」文字區塊：

1. 移除郵件引言符號 (>) 和分隔符 (---)。

2. 移除 URLs。

3. 標準化空白字元。

4. 使用正規表示式將文章切分為句子。

5. 將句子組合成不大於 1000 字元的**文字區塊(chunks)。**

In [None]:
def chunk_text(text, max_length=1000):
    text = re.sub(r"\s*(?:>\s*){2,}", " ", text)
    text = re.sub(r"-{3,}", " ", text)
    text = re.sub(r"_{3,}", " ", text)
    text = re.sub(r"https?://\S+|www\S+", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    sentences = re.split(r"(?<=[.!?]) +", text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_length:
            current_chunk += (sentence + " ").strip()
        else:
            if current_chunk:
                chunks.append(current_chunk)

            if len(sentence) > max_length:
                chunks.append(sentence[:max_length])
                current_chunk = sentence[max_length:] + " "
            else:
                current_chunk = sentence + " "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

### 步驟 5｜設計處理後資料的存取方式
**save_chunks_to_vault(chunk)**:

以附加append(也可以選擇使用write)模式開啟vault.txt，並將傳入的文字塊(chunks)逐一寫入檔案，每個區塊之間用兩個換行符分隔。

In [None]:
def save_chunks_to_vault(chunks):
    vault_path = "vault.txt"
    try:
        with open(vault_path, "a", encoding="utf-8") as vault_file:
            for chunk in chunks:
                vault_file.write(chunk.strip() + "\n\n")
    except IOError as e:
        logging.error(f"Failed to write to vault file {vault_path}: {e}")


### 步驟 6｜設計擷取Gmail資料之流程
**search_and_process_emails(client, email_source, search_keyword, start_date, end_date, limit)** :

1. 根據傳入的日期和關鍵字建立 IMAP 搜尋條件。

2. 如果設定了 limit（限制數量），它會設定 reverse=True 來確保抓取的是「最新」的 N 封郵件。

3. 使用 client.fetch 執行搜尋並迭代每封郵件。

4. 對每封郵件呼叫 get_email_content 來取文字。

5. 呼叫 chunk_text 來清理和分塊。

6. 呼叫 save_chunks_to_vault 來儲存。

In [None]:
def search_and_process_emails(client, email_source, search_keyword, start_date, end_date, limit):
    criteria = []
    if start_date:
        criteria.append(f"SINCE {start_date}")
    if end_date:
        criteria.append(f"BEFORE {end_date}")
    if search_keyword:
        criteria.append(f'BODY "{search_keyword}"')

    search_query = " ".join(criteria)
    if not search_query:
        search_query = "ALL"

    logging.info(f"Using search criteria for {email_source}: {search_query}")
    fetch_limit = None
    fetch_reverse = False
    if limit:
        logging.info(f"Limiting fetch to the {limit} most recent emails.")
        fetch_limit = limit
        fetch_reverse = True 
    try:
        mail_count = 0
        found_content = False
        for mail in client.fetch(
            search_query, mark_seen=False, limit=fetch_limit, reverse=fetch_reverse
        ):
            mail_count += 1
            logging.info(
                f"Processing {email_source} email UID: {mail.uid}, Subject: {mail.subject}"
            )

            try:
                text_content = get_email_content(mail)

                if text_content:
                    chunks = chunk_text(text_content)
                    if chunks:
                        save_chunks_to_vault(chunks)
                        found_content = True
                else:
                    logging.info(
                        f"Email UID {mail.uid} had no processable text content."
                    )

            except Exception as e:
                logging.error(
                    f"Failed to process email UID: {mail.uid} from {email_source}. Error: {e}",
                    exc_info=True,
                )

        if mail_count == 0:
            logging.warning(f"No emails found matching criteria in {email_source}.")
        elif not found_content:
            logging.warning(
                f"Found {mail_count} emails, but no processable text content was extracted to save to 'vault.txt'."
            )
        else:
            logging.info(
                f"Found and processed {mail_count} emails. Content saved to 'vault.txt'."
            )

    except Exception as e:
        logging.error(f"Failed to search or fetch emails from {email_source}: {e}")


### 步驟 7｜定義主程式
**main()** :

1. 手動設定參數: 在這裡，命令列參數被移除，改為手動設定變數，例如 LIMIT = 50（只抓最新50封）。

2. 清空 Vault: 檢查如果 LIMIT 被設定，程式會「刪除」舊的 vault.txt。

3. 載入憑證: 從環境變數讀取 GMAIL_USERNAME 和 GMAIL_PASSWORD。

4. 登入: 使用 MailBox("imap.gmail.com").login(...) 連接到 Gmail。

5. 執行: 呼叫 search_and_process_emails 來啟動整個擷取流程。

In [None]:



def main():
    KEYWORD = ""  # 搜尋的關鍵字，留空則為全部
    START_DATE = None  # "DD.MM.YYYY" 格式, e.g., "01.10.2025"
    END_DATE = None    # "DD.MM.YYYY" 格式
    LIMIT = 50          # 限制讀取最新 N 封郵件, 設為 None 則不限制
    start_date_imap = None
    end_date_imap = None
    try:
        if START_DATE:
            start_date_imap = datetime.strptime(START_DATE, "%d.%m.%Y").strftime(
                "%d-%b-%Y"
            )
        if END_DATE:
            end_date_imap = datetime.strptime(END_DATE, "%d.%m.%Y").strftime(
                "%d-%b-%Y"
            )
    except ValueError as e:
        logging.error(
            f"Date format is incorrect. Please use DD.MM.YYYY format. Details: {e}"
        )
        return
    if LIMIT:
        vault_path = "vault.txt"
        logging.warning(
            f"Using LIMIT={LIMIT}. This will clear '{vault_path}' for a fresh snapshot."
        )
        if os.path.exists(vault_path):
            try:
                os.remove(vault_path)
                logging.info(f"Successfully cleared old '{vault_path}'.")
            except OSError as e:
                logging.error(f"Failed to clear '{vault_path}': {e}. Exiting.")
                return
    gmail_username = os.getenv("GMAIL_USERNAME")
    gmail_password = os.getenv("GMAIL_PASSWORD")
    if not gmail_username or not gmail_password:
        logging.error("GMAIL_USERNAME or GMAIL_PASSWORD not set in .env file. Exiting.")
        logging.error(
            "Reminder: For Gmail, you must use a 16-digit 'App Password', not your regular password."
        )
        return
    try:
        logging.info(f"Connecting to Gmail as {gmail_username}...")
        with MailBox("imap.gmail.com").login(
            gmail_username, gmail_password, initial_folder="inbox"
        ) as M:
            logging.info(f"Successfully connected to Gmail.")
            search_and_process_emails(
                M, "Gmail", KEYWORD, start_date_imap, end_date_imap, LIMIT
            )
    except Exception as e:
        logging.error(f"Failed to connect or login to Gmail: {e}", exc_info=True)
        logging.error(
            "Please check your network connection and ensure you are using a 16-digit 'App Password' for GMAIL_PASSWORD in your .env file."
        )
    logging.info("Email processing finished.")

In [11]:
main()

2025-10-28 21:03:08,184 - INFO - Connecting to Gmail as jack920312lin@gmail.com...
2025-10-28 21:03:09,542 - INFO - Successfully connected to Gmail.
2025-10-28 21:03:09,544 - INFO - Using search criteria for Gmail: ALL
2025-10-28 21:03:09,544 - INFO - Limiting fetch to the 5 most recent emails.
2025-10-28 21:03:10,288 - INFO - Processing Gmail email UID: 756, Subject: The Monthly Build - Google Developer Program - October 2025
2025-10-28 21:03:10,526 - INFO - Processing Gmail email UID: 755, Subject: Zoom Courses + Conferences + Special Issues
2025-10-28 21:03:10,866 - INFO - Processing Gmail email UID: 754, Subject: 暐捷，快來領取每週獎勵 🎉
2025-10-28 21:03:11,088 - INFO - Processing Gmail email UID: 753, Subject: 安全性快訊
2025-10-28 21:03:11,347 - INFO - Processing Gmail email UID: 752, Subject: 你的評論在 Google 地圖上大受歡迎！
2025-10-28 21:03:11,348 - INFO - Found and processed 5 emails. Content saved to 'vault.txt'.
2025-10-28 21:03:11,548 - INFO - Email processing finished.
