<a href="https://colab.research.google.com/github/lurking92/knowledge-graph-for-elderly/blob/main/BLIP_%E9%96%8B%E7%99%BC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 增強版影片分析（Colab 版 .ipynb）
本 Notebook 在 Google Colab 上可直接執行，完成以下工作：
- 安裝相依套件並建立專案目錄
- 自動下載 Grounding DINO 所需權重
- 影片上傳到 `data/videos/`
- 以 `scripts/` 拆分模組：`extract_frames.py`、`object_detector.py`、`generate_caption.py`
- 透過 `main.py` 串起整個流程，輸出字幕與偵測摘要

執行順序：依序執行每個區塊即可；最後用 `!python main.py --video ...` 或 `--dir ...` 執行。


## 函式庫安裝與設定

In [1]:
%%writefile requirements.txt
opencc-python-reimplemented
opencv-python-headless>=4.10.0.84
torch
torchvision
transformers>=4.41.0
tokenizers>=0.20.0
accelerate>=0.24.1
Pillow>=10.0.0
tqdm
ffmpeg-python
sentencepiece
sacremoses
scipy>=1.14.0
numpy>=1.26.4,<2.0
einops
requests
timm>=0.9.16
protobuf>=4.25.3,<5
matplotlib

Writing requirements.txt


In [2]:
!pip install -q -r requirements.txt
print("函式庫安裝完成。")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.8/481.8 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you ha

In [3]:
from pathlib import Path
PROJECT_ROOT = Path('/content')
DATA_DIR = PROJECT_ROOT / 'data'
VIDEO_DIR = DATA_DIR / 'videos'
FRAME_DIR = DATA_DIR / 'frames'
OUTPUT_DIR = Path('outputs')
MODELS_DIR = Path('models')
SCRIPTS_DIR = Path('scripts')

for folder in [DATA_DIR, VIDEO_DIR, FRAME_DIR, OUTPUT_DIR, MODELS_DIR, SCRIPTS_DIR]:
    folder.mkdir(parents=True, exist_ok=True)
print('目錄準備完成。')

目錄準備完成。


## 模型下載與設定（Grounding DINO）

In [4]:
import requests
from pathlib import Path
from tqdm import tqdm

GROUNDING_DINO_CONFIG_URL = 'https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py'
GROUNDING_DINO_WEIGHTS_URL = 'https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth'

config_path = Path('models/GroundingDINO_SwinT_OGC.py')
weights_path = Path('models/groundingdino_swint_ogc.pth')

def download_file(url: str, destination: Path) -> None:
    if destination.exists():
        print(f'{destination.name} 已存在，略過下載。')
        return
    response = requests.get(url, stream=True)
    response.raise_for_status()
    total = int(response.headers.get('content-length', 0))
    progress = tqdm(total=total, unit='B', unit_scale=True, desc=destination.name)
    with destination.open('wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)
                progress.update(len(chunk))
    progress.close()
    print(f'已下載 {destination.name}')

download_file(GROUNDING_DINO_CONFIG_URL, config_path)
download_file(GROUNDING_DINO_WEIGHTS_URL, weights_path)

GroundingDINO_SwinT_OGC.py: 1.01kB [00:00, 1.65MB/s]                 


已下載 GroundingDINO_SwinT_OGC.py


groundingdino_swint_ogc.pth: 100%|██████████| 694M/694M [00:13<00:00, 52.3MB/s]

已下載 groundingdino_swint_ogc.pth





### 選用：掛載 Google Drive（若影片放在雲端硬碟）

In [5]:
# 如需使用 Google Drive 檔案請執行，否則可略過
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


### 選用：Hugging Face 登入（若需存取 gated 模型）

In [6]:
from huggingface_hub import login
try:
    # Colab 的使用者可於右上角「變數」區設定 HF_TOKEN
    from google.colab import userdata
    token = userdata.get('HF_TOKEN')
except Exception:
    token = None

if token:
    login(token)
    print("Hugging Face 登入成功")
else:
    print("未提供 HF_TOKEN，若模型需要授權請手動登入或設定變數。")

Hugging Face 登入成功


## 上傳測試影片到 `data/videos/`

In [7]:
from google.colab import files
import shutil

uploaded = files.upload()
video_paths = []
for fname in uploaded.keys():
    dst = VIDEO_DIR / fname
    shutil.move(fname, dst)
    video_paths.append(str(dst))
print('影片已上傳到:', video_paths)

Saving Cook carrot1_1.mp4 to Cook carrot1_1.mp4
Saving Cook fried bread4_1.mp4 to Cook fried bread4_1.mp4
Saving Cook potato using microwave1_4.mp4 to Cook potato using microwave1_4.mp4
影片已上傳到: ['/content/data/videos/Cook carrot1_1.mp4', '/content/data/videos/Cook fried bread4_1.mp4', '/content/data/videos/Cook potato using microwave1_4.mp4']


## 影格提取模組：`scripts/extract_frames.py`

In [23]:
%%writefile scripts/extract_frames.py
import os
from typing import List, Tuple
import cv2

def extract_frames(video_path: str, output_dir: str, target_fps: float = 1.0) -> Tuple[List[Tuple[str, float]], float]:

    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise RuntimeError(f'無法開啟影片: {video_path}')

    native_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    step = max(int(round(native_fps / max(target_fps, 1e-3))), 1)

    frame_idx = 0
    saved: List[Tuple[str, float]] = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        if frame_idx % step == 0:
            timestamp = frame_idx / native_fps
            frame_name = f'frame_{frame_idx:06d}.jpg'
            frame_path = os.path.join(output_dir, frame_name)
            cv2.imwrite(frame_path, frame)
            saved.append((frame_path, timestamp))

        frame_idx += 1

    cap.release()
    print(f'已提取 {len(saved)} 個影格，原始FPS: {native_fps:.2f}')
    return saved, native_fps


Overwriting scripts/extract_frames.py


## 物件偵測模組：`scripts/object_detector.py`

In [22]:
%%writefile scripts/object_detector.py
from __future__ import annotations
from typing import List, Dict, Tuple, Optional
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

from scripts.utils import normalize_label

DEFAULT_PROMPT = (
    'person, man, woman, child, elderly, baby, '
    'bed, pillow, blanket, sheet, mattress, nightstand, bedside table, '
    'lamp, desk lamp, floor lamp, ceiling light, '
    'sofa, couch, armchair, chair, dining chair, office chair, '
    'table, dining table, coffee table, desk, '
    'television, tv, monitor, computer, laptop, '
    'stove, oven, refrigerator, fridge, microwave, '
    'sink, faucet, cabinet, cupboard, drawer, '
    'pan, pot, bowl, cup, plate, knife, fork, spoon, '
    'toilet, bathtub, shower, mirror, towel, '
    'door, window, curtain, blind, '
    'carpet, rug, floor, wall, ceiling, '
    'plant, flower, book, clock, picture, painting'
)

_processor: Optional[AutoProcessor] = None
_model: Optional[AutoModelForZeroShotObjectDetection] = None
_device: str = 'cuda' if torch.cuda.is_available() else 'cpu'

def setup_grounding_dino() -> None:
    """初始化 Grounding DINO 模型"""
    global _processor, _model
    if _processor is not None and _model is not None:
        print('Grounding DINO 模型已載入')
        return

    print('正在載入 Grounding DINO 模型...')
    model_name = "IDEA-Research/grounding-dino-tiny"
    _processor = AutoProcessor.from_pretrained(model_name)
    _model = AutoModelForZeroShotObjectDetection.from_pretrained(model_name)
    _model.to(_device)
    print(f'Grounding DINO 模型已載入至 {_device}')

def detect_objects(
    image_path: str,
    text_prompt: str = DEFAULT_PROMPT,
    box_threshold: float = 0.25,
    top_k: int = 15,
    visualize: bool = False
) -> Tuple[List[Dict], Optional[Image.Image]]:
    """使用 Grounding DINO 進行物件偵測"""
    if _processor is None or _model is None:
        raise RuntimeError('請先呼叫 setup_grounding_dino() 載入模型。')

    image = Image.open(image_path).convert('RGB')
    inputs = _processor(images=image, text=text_prompt, return_tensors="pt")
    inputs = {k: v.to(_device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = _model(**inputs)

    target_sizes = torch.tensor([image.size[::-1]]).to(_device)
    results = _processor.post_process_grounded_object_detection(
        outputs, target_sizes=target_sizes, threshold=box_threshold
    )[0]

    detections: List[Dict] = []
    if 'boxes' in results:
        boxes = results['boxes'].cpu()
        scores = results['scores'].cpu()
        labels = results['labels']

        for box, score, label in zip(boxes, scores, labels):
            raw_label = str(label).lower().strip()
            canonical = normalize_label(raw_label)
            if not canonical:
                continue
            x1, y1, x2, y2 = box.tolist()
            detections.append({
                'label': canonical,
                'raw_label': raw_label,
                'score': float(score),
                'bbox_xyxy': [x1, y1, x2, y2],
                'area': (x2 - x1) * (y2 - y1)
            })

    detections = sorted(detections, key=lambda d: d['score'], reverse=True)[:top_k]

    annotated_image = None
    if visualize and detections:
        annotated_image = create_detection_visualization(image, detections)

    return detections, annotated_image

def create_detection_visualization(image: Image.Image, detections: List[Dict]) -> Image.Image:
    vis_image = image.copy()
    draw = ImageDraw.Draw(vis_image)

    colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown', 'pink', 'gray', 'olive', 'cyan']

    try:
        font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 16)
    except Exception:
        font = ImageFont.load_default()

    for i, detection in enumerate(detections[:10]):
        x1, y1, x2, y2 = detection['bbox_xyxy']
        color = colors[i % len(colors)]

        draw.rectangle([x1, y1, x2, y2], outline=color, width=3)

        label_text = f"{detection.get('raw_label', detection['label'])} ({detection['score']:.2f})"
        text_bbox = draw.textbbox((x1, y1 - 25), label_text, font=font)
        draw.rectangle(text_bbox, fill=color)
        draw.text((x1, y1 - 25), label_text, fill='white', font=font)

    return vis_image

def summarize_detections(detections: List[Dict], top_k: int = 8) -> str:
    if not detections:
        return '未偵測到重點物件'

    unique: Dict[str, Dict] = {}
    for det in detections:
        label_key = det.get('label') or normalize_label(det.get('raw_label', ''))
        if not label_key:
            continue
        if label_key not in unique or det['score'] > unique[label_key]['score']:
            unique[label_key] = det

    sorted_dets = sorted(unique.values(), key=lambda d: d['score'], reverse=True)
    items = [f"{d['label']}({d['score']:.2f})" for d in sorted_dets[:top_k]]
    return ', '.join(items)

def analyze_scene_context(detections: List[Dict]) -> Dict[str, any]:
    context = {
        'person_count': 0,
        'furniture_items': [],
        'appliances': [],
        'room_indicators': [],
        'dominant_objects': []
    }

    for det in detections:
        raw_label = det.get('raw_label', det.get('label', ''))
        label = raw_label.lower()
        score = det['score']

        if any(person_word in label for person_word in ['person', 'man', 'woman', 'child', 'people', 'elderly', 'baby']):
            context['person_count'] += 1

        furniture_keywords = ['sofa', 'chair', 'table', 'bed', 'desk', 'cabinet', 'couch', 'nightstand']
        if any(keyword in label for keyword in furniture_keywords):
            context['furniture_items'].append((det.get('label', label), score))

        appliance_keywords = ['tv', 'television', 'refrigerator', 'microwave', 'oven', 'stove', 'washer', 'dryer']
        if any(keyword in label for keyword in appliance_keywords):
            context['appliances'].append((det.get('label', label), score))

        if score > 0.3:
            context['room_indicators'].append((det.get('label', label), score))

    for det in detections[:5]:
        if det['score'] > 0.4 and det.get('area', 0) > 1000:
            context['dominant_objects'].append((det.get('label'), det['score']))

    return context


Overwriting scripts/object_detector.py


## 文字敘述模組：`scripts/generate_caption.py`

In [21]:
%%writefile scripts/generate_caption.py
from __future__ import annotations
from typing import List, Dict, Optional, Tuple
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration, pipeline
from opencc import OpenCC

# 全局變數
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
_blip_processor = None
_blip_model = None
_translator = None
_converter = None

def initialize_models():
    """初始化所有必要的模型"""
    global _blip_processor, _blip_model, _translator, _converter

    if _blip_processor is None:
        print('正在載入 BLIP 模型...')
        _blip_processor = BlipProcessor.from_pretrained('Salesforce/blip-image-captioning-large')
        _blip_model = BlipForConditionalGeneration.from_pretrained('Salesforce/blip-image-captioning-large').to(DEVICE)
        print(f'BLIP 模型已載入至 {DEVICE}')

    if _translator is None:
        print('正在載入翻譯模型...')
        _translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-zh', device=0 if torch.cuda.is_available() else -1)
        print('翻譯模型已載入')

    if _converter is None:
        _converter = OpenCC('s2t')

# 物件中英對照表
OBJECT_TRANSLATIONS = {
    'person': '人物', 'people': '多人', 'man': '男性', 'woman': '女性', 'child': '兒童', 'baby': '嬰兒',
    'bed': '床鋪', 'pillow': '枕頭', 'blanket': '毛毯', 'sheet': '床單', 'mattress': '床墊',
    'nightstand': '床頭櫃', 'bedside table': '床邊桌',
    'lamp': '檯燈', 'desk lamp': '桌燈', 'floor lamp': '立燈', 'ceiling light': '天花板燈',
    'sofa': '沙發', 'couch': '沙發', 'armchair': '扶手椅', 'chair': '椅子',
    'dining chair': '餐椅', 'office chair': '辦公椅',
    'table': '桌子', 'dining table': '餐桌', 'coffee table': '茶几', 'desk': '書桌',
    'television': '電視', 'tv': '電視', 'monitor': '螢幕', 'computer': '電腦', 'laptop': '筆電',
    'stove': '爐子', 'oven': '烤箱', 'refrigerator': '冰箱', 'fridge': '冰箱', 'microwave': '微波爐',
    'sink': '水槽', 'faucet': '水龍頭', 'cabinet': '櫥櫃', 'cupboard': '櫃子', 'drawer': '抽屜',
    'pan': '平底鍋', 'pot': '湯鍋', 'bowl': '碗', 'cup': '杯子', 'plate': '盤子',
    'knife': '刀具', 'fork': '叉子', 'spoon': '湯匙',
    'toilet': '馬桶', 'bathtub': '浴缸', 'shower': '淋浴設備', 'mirror': '鏡子', 'towel': '毛巾',
    'door': '門', 'window': '窗戶', 'curtain': '窗簾', 'blind': '百葉窗',
    'carpet': '地毯', 'rug': '地墊', 'floor': '地板', 'wall': '牆壁', 'ceiling': '天花板',
    'plant': '植物', 'flower': '花朵', 'book': '書籍', 'clock': '時鐘', 'picture': '圖畫', 'painting': '畫作'
}

# 房間關鍵物件映射
ROOM_KEYWORDS = {
    '臥室': {'bed', 'pillow', 'blanket', 'nightstand', 'bedside table', 'mattress', 'sheet'},
    '廚房': {'stove', 'oven', 'refrigerator', 'fridge', 'microwave', 'sink', 'cabinet', 'pan', 'pot', 'bowl', 'cup', 'plate'},
    '廁所': {'toilet', 'bathtub', 'shower', 'sink', 'mirror', 'towel', 'faucet'},
    '客廳': {'sofa', 'couch', 'television', 'tv', 'coffee table', 'armchair', 'carpet', 'rug'}
}

# 環境描述模板
ENVIRONMENT_TEMPLATES = {
    '臥室': ['溫馨的睡眠空間', '私人休息區域', '舒適的臥房環境'],
    '廚房': ['烹飪與用餐空間', '家庭料理區域', '廚房工作環境'],
    '廁所': ['衛浴清潔空間', '個人盥洗區域', '浴室環境'],
    '客廳': ['家庭聚會空間', '休閒娛樂區域', '客廳起居環境']
}

@torch.no_grad()
def generate_base_caption(image_path: str, prompt: str = None) -> str:
    """使用 BLIP 生成基礎英文描述"""
    if _blip_processor is None or _blip_model is None:
        raise RuntimeError('請先呼叫 initialize_models() 載入模型')

    image = Image.open(image_path).convert('RGB')

    if prompt is None:
        prompt = "Describe this scene in detail, including the people, objects, activities, and environment."

    inputs = _blip_processor(images=image, text=prompt, return_tensors='pt').to(DEVICE)

    output = _blip_model.generate(
        **inputs,
        max_length=120,
        num_beams=8,
        no_repeat_ngram_size=3,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        length_penalty=1.2
    )

    caption = _blip_processor.decode(output[0], skip_special_tokens=True)

    if prompt and caption.startswith(prompt):
        caption = caption[len(prompt):].strip()

    return caption

def translate_to_traditional_chinese(text: str) -> str:
    """將英文翻譯為繁體中文"""
    if _translator is None or _converter is None:
        raise RuntimeError('請先呼叫 initialize_models() 載入模型')

    try:
        sentences = text.split('. ')
        translated_sentences = []
        for sentence in sentences:
            if sentence.strip():
                result = _translator(sentence.strip())[0]['translation_text']
                translated_sentences.append(result)
        translated_text = '。'.join(translated_sentences)
        if not translated_text.endswith('。'):
            translated_text += '。'
        return _converter.convert(translated_text)
    except Exception as e:
        print(f'翻譯過程中出現錯誤: {e}')
        return '無法生成中文描述。'

def infer_room_from_detections(detections: List[Dict]) -> Optional[str]:
    """從偵測結果推斷房間類型"""
    if not detections:
        return None
    labels = {d['label'].lower() for d in detections if d['score'] > 0.3}
    room_scores = {}
    for room, keywords in ROOM_KEYWORDS.items():
        score = len(labels & keywords)
        if score > 0:
            room_scores[room] = score
    if room_scores:
        return max(room_scores, key=room_scores.get)
    return None

def create_object_description(detections: List[Dict], max_objects: int = 10) -> str:
    """創建詳細的物件描述"""
    if not detections:
        return ""
    people, furniture, appliances, other_objects = [], [], [], []
    for det in detections[:max_objects]:
        label = det['label'].lower()
        score = det['score']
        zh_label = OBJECT_TRANSLATIONS.get(label, label)
        obj_desc = f"{zh_label}(信心度{score:.2f})"
        if any(w in label for w in ['person', 'man', 'woman', 'child', 'people', 'baby']):
            people.append(obj_desc)
        elif any(w in label for w in ['sofa', 'chair', 'table', 'bed', 'desk', 'cabinet']):
            furniture.append(obj_desc)
        elif any(w in label for w in ['tv', 'television', 'refrigerator', 'microwave', 'oven', 'stove']):
            appliances.append(obj_desc)
        else:
            other_objects.append(obj_desc)
    descriptions = []
    if people:
        descriptions.append(f"人物：{', '.join(people)}")
    if furniture:
        descriptions.append(f"家具：{', '.join(furniture[:5])}")
    if appliances:
        descriptions.append(f"電器：{', '.join(appliances[:3])}")
    if other_objects:
        descriptions.append(f"其他物件：{', '.join(other_objects[:5])}")
    return '；'.join(descriptions) + '。' if descriptions else ""

def analyze_spatial_relationships(detections: List[Dict]) -> str:
    """分析空間關係"""
    if len(detections) < 2:
        return ""
    relationships = []
    people = [d for d in detections if any(w in d['label'].lower() for w in ['person', 'man', 'woman', 'child'])]
    furniture = [d for d in detections if any(w in d['label'].lower() for w in ['sofa', 'chair', 'bed', 'table'])]
    if people and furniture:
        relationships.append("場景中有人物與家具的互動配置")
    if len(detections) > 8:
        relationships.append("空間中物件配置豐富")
    elif len(detections) > 4:
        relationships.append("空間中有適度的物件配置")
    return '；'.join(relationships) + '。' if relationships else ""

def generate_enhanced_caption(
    image_path: str,
    detections: List[Dict],
    room_hint: Optional[str] = None,
    knowledge_items: Optional[List[str]] = None,
    *,
    include_objects: bool = True,
    include_spatial: bool = True,
    include_room: bool = True,
    include_knowledge: bool = True,
) -> str:
    """
    生成詳細描述；可用開關控制是否拼接「物件摘要/空間關係/房間推斷/知識補充」。
    若要「只要敘述」，把四個 include_* 都設為 False。
    """
    initialize_models()

    # 生成多個不同角度的基礎描述
    base_captions = []
    prompts = [
        "Describe this indoor scene in detail, focusing on the people, furniture, and activities.",
        "What is happening in this room? Describe the environment, objects, and any people present.",
        "Provide a comprehensive description of this indoor space, including all visible elements."
    ]
    for prompt in prompts[:2]:
        try:
            cap = generate_base_caption(image_path, prompt)
            if cap and len(cap) > 10:
                base_captions.append(cap)
        except Exception as e:
            print(f"生成描述時出錯: {e}")
            continue
    if not base_captions:
        base_captions = [generate_base_caption(image_path)]
    primary_caption = max(base_captions, key=len) if base_captions else "Indoor scene"

    # 翻譯為中文（純敘述基底）
    zh_caption = translate_to_traditional_chinese(primary_caption)
    description_parts = [zh_caption]  # 基底：只敘述

    # 依開關決定是否附加其他說明
    if include_objects:
        object_desc = create_object_description(detections)
        if object_desc:
            description_parts.append(f"偵測到的主要物件包括：{object_desc}")
    if include_spatial:
        spatial_desc = analyze_spatial_relationships(detections)
        if spatial_desc:
            description_parts.append(f"空間配置特徵：{spatial_desc}")
    if include_room and room_hint:
        env_templates = ENVIRONMENT_TEMPLATES.get(room_hint, [f'{room_hint}環境'])
        env_desc = env_templates[0] if env_templates else f'{room_hint}環境'
        description_parts.append(f"場景分析：此為{env_desc}。")
    if include_knowledge and knowledge_items:
        kg_desc = '；'.join(knowledge_items)
        description_parts.append(f"知識圖譜補充：{kg_desc}。")

    final_description = ' '.join(description_parts)
    if not final_description.endswith('。'):
        final_description += '。'
    return final_description

def augment_with_knowledge_graph(caption: str, detections: List[Dict], knowledge_items: Optional[List[str]] = None) -> str:
    """預留與知識圖譜整合的接口"""
    if not knowledge_items:
        return caption
    extra_knowledge = '；'.join(knowledge_items)
    return f"{caption} 知識圖譜增強：{extra_knowledge}。"

__all__ = [
    'initialize_models',
    'generate_enhanced_caption',
    'infer_room_from_detections',
    'augment_with_knowledge_graph'
]


Overwriting scripts/generate_caption.py


## 工具模組：`scripts/utils.py`

In [19]:
%%writefile scripts/utils.py
from __future__ import annotations
from typing import List, Dict
from pathlib import Path

_LABEL_ALIASES: Dict[str, str] = {
    'people': 'person',
    'men': 'person',
    'man': 'person',
    'women': 'person',
    'woman': 'person',
    'elderly person': 'elderly',
    'old man': 'elderly',
    'old woman': 'elderly',
    'baby girl': 'baby',
    'baby boy': 'baby',
    'couch': 'sofa',
    'arm chair': 'chair',
    'armchair': 'chair',
    'dining chair': 'chair',
    'office chair': 'chair',
    'wheel chair': 'chair',
    'rocking chair': 'chair',
    'coffee table': 'coffeetable',
    'side table': 'coffeetable',
    'end table': 'coffeetable',
    'dining table': 'kitchentable',
    'kitchen table': 'kitchentable',
    'bedside table': 'nightstand',
    'night stand': 'nightstand',
    'television': 'tv',
    'tv monitor': 'tv',
    'flat screen tv': 'tv',
    'monitor': 'cpuscreen',
    'screen': 'cpuscreen',
    'desktop computer': 'computer',
    'laptop': 'computer',
    'notebook computer': 'computer',
    'desk lamp': 'tablelamp',
    'table lamp': 'tablelamp',
    'floor lamp': 'tablelamp',
    'ceiling light': 'ceilinglamp',
    'ceiling lamp': 'ceilinglamp',
    'wall light': 'walllamp',
    'wall lamp': 'walllamp',
    'cupboard': 'cabinet',
    'drawer': 'cabinet',
    'closet drawer': 'cabinet',
    'closet door': 'door',
    'wardrobe': 'closet',
    'kitchen cabinet': 'kitchencabinet',
    'kitchen counter': 'kitchencounter',
    'kitchen island': 'kitchencounter',
    'countertop': 'kitchencounter',
    'microwave oven': 'microwave',
    'fridge': 'fridge',
    'refrigerator': 'fridge',
    'freezer': 'fridge',
    'stovetop': 'stove',
    'cooktop': 'stove',
    'range': 'stove',
    'trash can': 'garbagecan',
    'trashcan': 'garbagecan',
    'trash bin': 'garbagecan',
    'garbage can': 'garbagecan',
    'garbage bin': 'garbagecan',
    'waste bin': 'garbagecan',
    'remote': 'remotecontrol',
    'remote control': 'remotecontrol',
    'television remote': 'remotecontrol',
    'picture frame': 'photoframe',
    'picture': 'wallpictureframe',
    'painting': 'wallpictureframe',
    'book shelf': 'bookshelf',
    'bookcase': 'bookshelf',
    'wine glass': 'wineglass',
    'glass': 'waterglass',
    'coffee cup': 'mug',
    'tea cup': 'mug',
    'cup': 'mug',
    'frying pan': 'fryingpan',
    'sauce pan': 'cookingpot',
    'cooking pan': 'fryingpan',
    'cooking pot': 'cookingpot',
    'wash basin': 'sink',
    'basin': 'sink',
    'bath tub': 'bathtub',
    'toilet paper': 'toiletpaper',
    'towel rail': 'towelrack',
    'towel bar': 'towelrack',
    'washing machine': 'washingmachine',
    'wash machine': 'washingmachine',
}

def _collapse_label(label: str) -> str:
    return ''.join(ch for ch in label if ch.isalnum())

def normalize_label(label: str) -> str:
    cleaned = (label or '').lower().strip()
    if not cleaned:
        return ''
    cleaned = cleaned.replace('-', ' ')
    cleaned = ' '.join(cleaned.split())
    canonical = _LABEL_ALIASES.get(cleaned, cleaned)
    collapsed = _collapse_label(canonical)
    if not collapsed:
        return ''
    return collapsed

def ensure_dir(path: Path) -> Path:
    path = Path(path)
    path.mkdir(parents=True, exist_ok=True)
    return path

def save_txt(lines: List[str], filepath: Path) -> None:
    filepath = Path(filepath)
    filepath.parent.mkdir(parents=True, exist_ok=True)
    with filepath.open('w', encoding='utf-8') as f:
        for line in lines:
            f.write(str(line).rstrip() + '\n')

def format_timestamp(seconds: float) -> str:
    seconds = int(seconds)
    h = seconds // 3600
    m = (seconds % 3600) // 60
    s = seconds % 60
    if h > 0:
        return f"{h:02d}:{m:02d}:{s:02d}"
    return f"{m:02d}:{s:02d}"

def format_detection_brief(timestamp: float, detections: List[Dict]) -> str:
    t = format_timestamp(timestamp)
    scores: Dict[str, float] = {}
    for det in detections or []:
        lbl = normalize_label(det.get('label') or det.get('raw_label', ''))
        if not lbl:
            continue
        score = float(det.get('score', 0.0) or 0.0)
        if lbl not in scores or score > scores[lbl]:
            scores[lbl] = score
    if not scores:
        return f"[{t}] 無偵測結果"
    ordered = sorted(scores.items(), key=lambda item: item[1], reverse=True)[:8]
    brief = ', '.join(f"{label}({value:.2f})" for label, value in ordered)
    return f"[{t}] {brief}"

def create_visualization_summary(detections: List[Dict], max_items: int = 10) -> str:
    if not detections:
        return "未偵測到物件"
    seen = set()
    items = []
    for det in detections:
        lbl = normalize_label(det.get('label') or det.get('raw_label', ''))
        if not lbl or lbl in seen:
            continue
        seen.add(lbl)
        items.append(f"{lbl}({float(det.get('score', 0.0) or 0.0):.2f})")
        if len(items) >= max_items:
            break
    return '、'.join(items) if items else "未偵測到物件"


Overwriting scripts/utils.py


In [12]:
%%writefile scripts/caption_postprocess.py
from typing import List, Dict, Any
import re
import os

from scripts.utils import normalize_label

class CaptionPostProcessor:
    def __init__(self, furniture_txt_path: str = 'furniture.txt', conf_th: float = 0.25):
        self.furniture_txt_path = furniture_txt_path
        self.conf_th = conf_th

        self.prompt_prefixes = [
            '描述環境、物體和在場的任何人',
            '詳細描述這個室內場景',
            '詳細描述這一室內場景',
            '詳細描述這個室內場景, 以人、傢俱和活動為重點',
            '詳細描述這個室內場景, 以人、傢俱和活動爲重點',
            '詳細描述這個室內場景, 關注人們、傢俱和活動',
            '詳細描述這個室內場景, 關注人羣、傢俱和活動',
            '詳細描述這個室內場景,關注人羣、傢俱和活動',
            '詳細描述這個室內場景,以人、傢俱和活動爲重點',
        ]

        self.object_aliases = {
            'person': '人',
            'elderly': '長者',
            'baby': '嬰兒',
            'bed': '床鋪',
            'pillow': '枕頭',
            'blanket': '毛毯',
            'sheet': '床單',
            'mattress': '床墊',
            'nightstand': '床頭櫃',
            'tablelamp': '檯燈',
            'ceilinglamp': '天花板燈',
            'walllamp': '壁燈',
            'sofa': '沙發',
            'chair': '椅子',
            'coffeetable': '茶几',
            'kitchentable': '餐桌',
            'desk': '書桌',
            'cabinet': '櫥櫃',
            'kitchencabinet': '廚房櫃',
            'kitchencounter': '流理台',
            'bookshelf': '書架',
            'wallshelf': '壁掛架',
            'closet': '衣櫃',
            'closetdrawer': '衣櫃抽屜',
            'tvstand': '電視櫃',
            'tv': '電視',
            'cpuscreen': '螢幕',
            'computer': '電腦',
            'keyboard': '鍵盤',
            'mouse': '滑鼠',
            'printer': '印表機',
            'speaker': '喇叭',
            'amplifier': '擴大機',
            'radio': '收音機',
            'coffeemaker': '咖啡機',
            'coffeepot': '咖啡壺',
            'microwave': '微波爐',
            'fridge': '冰箱',
            'stove': '爐台',
            'oven': '烤箱',
            'fryingpan': '平底鍋',
            'cookingpot': '湯鍋',
            'dishwasher': '洗碗機',
            'washingmachine': '洗衣機',
            'sink': '水槽',
            'faucet': '水龍頭',
            'dishbowl': '碗',
            'plate': '盤子',
            'mug': '馬克杯',
            'waterglass': '玻璃杯',
            'wineglass': '紅酒杯',
            'cutleryfork': '叉子',
            'cutleryknife': '刀子',
            'cuttingboard': '砧板',
            'garbagecan': '垃圾桶',
            'toaster': '烤麵包機',
            'toilet': '馬桶',
            'toiletpaper': '衛生紙',
            'bathtub': '浴缸',
            'shower': '淋浴設備',
            'towel': '毛巾',
            'towelrack': '毛巾架',
            'mirror': '鏡子',
            'door': '門',
            'doorjamb': '門框',
            'window': '窗戶',
            'curtains': '窗簾',
            'floor': '地板',
            'wall': '牆面',
            'ceiling': '天花板',
            'rug': '地毯',
            'plant': '植物',
            'flower': '花朵',
            'clock': '時鐘',
            'photoframe': '相框',
            'wallpictureframe': '壁掛畫',
            'book': '書籍',
            'notes': '筆記',
            'magazine': '雜誌',
            'paper': '紙張',
            'folder': '文件夾',
            'remotecontrol': '遙控器',
            'lightswitch': '開關',
            'powersocket': '插座',
            'washingsponge': '海綿',
            'barsoap': '肥皂',
            'toothbrush': '牙刷',
            'toothpaste': '牙膏',
            'facecream': '護膚霜',
            'deodorant': '除臭劑',
            'hairproduct': '美髮產品',
            'alcohol': '酒精',
            'bottlewater': '礦泉水',
            'milk': '牛奶',
            'juice': '果汁',
            'wine': '紅酒',
            'apple': '蘋果',
            'bananas': '香蕉',
            'bellpepper': '甜椒',
            'carrot': '胡蘿蔔',
            'salmon': '鮭魚',
            'chicken': '雞肉',
            'mincedmeat': '絞肉',
            'breadslice': '麵包片',
            'cereal': '麥片',
            'crackers': '餅乾',
            'cupcake': '杯子蛋糕',
            'pancake': '鬆餅',
            'pie': '派',
            'pudding': '布丁',
            'creamybuns': '奶油麵包',
            'cat': '貓',
            'toy': '玩具',
            'boardgame': '桌遊',
            'guitar': '吉他',
        }

        self.furniture_set = {
            '床鋪', '床頭櫃', '椅子', '沙發', '茶几', '餐桌', '書桌', '櫥櫃', '廚房櫃', '流理台',
            '書架', '壁掛架', '衣櫃', '衣櫃抽屜', '電視櫃'
        }

        self.room_rules = [
            ('廚房', {'流理台', '廚房櫃', '咖啡機', '爐台', '微波爐', '冰箱', '湯鍋', '平底鍋', '砧板'}),
            ('臥室', {'床鋪', '枕頭', '床頭櫃', '衣櫃', '書桌'}),
            ('客廳', {'沙發', '茶几', '電視', '電視櫃', '遙控器', '地毯'}),
            ('浴室', {'馬桶', '浴缸', '水槽', '毛巾', '毛巾架', '鏡子'}),
        ]

        try:
            if os.path.exists(self.furniture_txt_path):
                os.remove(self.furniture_txt_path)
        except OSError:
            pass

    def process(self, ts_hhmm: str, det_items: List[Dict[str, Any]], raw_caption: str) -> str:
        categories = self._dedupe_by_category(det_items, self.conf_th)
        self._append_furniture(ts_hhmm, categories)

        caption = self._clean_text(raw_caption or '')
        if not caption:
            has_person = self._has_person(categories, raw_caption or '')
            room_hint = self._guess_room(categories)
            caption = self._compose_neutral_sentence(room_hint, categories, has_person)

        objects_str = self._format_objects(categories)
        return f"[{ts_hhmm}] {caption} 偵測到的主要物件包括：{objects_str}。"

    def _zh_label(self, label: str, raw_label: str = '') -> str:
        key = normalize_label(label or raw_label)
        if key in self.object_aliases:
            return self.object_aliases[key]
        candidate = raw_label.strip() or label.strip()
        candidate = re.sub(r'[^\w一-龥]', '', candidate)
        return candidate

    def _auto_category(self, canonical: str, zh_name: str) -> str:
        furniture_keys = {
            'bed', 'nightstand', 'sofa', 'chair', 'coffeetable', 'kitchentable', 'desk',
            'cabinet', 'kitchencabinet', 'kitchencounter', 'bookshelf', 'wallshelf', 'closet',
            'closetdrawer', 'tvstand'
        }
        appliance_keys = {
            'tv', 'cpuscreen', 'computer', 'printer', 'speaker', 'amplifier', 'radio',
            'coffeemaker', 'microwave', 'fridge', 'stove', 'oven', 'dishwasher',
            'washingmachine', 'toaster', 'tablelamp', 'ceilinglamp', 'walllamp'
        }
        if canonical in furniture_keys:
            return '家具'
        if canonical in appliance_keys:
            return '電器'
        if zh_name in self.furniture_set:
            return '家具'
        return '其他物件'

    def _dedupe_by_category(self, det_items: List[Dict[str, Any]], conf_th: float) -> Dict[str, List[str]]:
        output = {'家具': set(), '電器': set(), '其他物件': set()}
        for det in det_items or []:
            score = float(det.get('score', 0.0) or 0.0)
            if score < conf_th:
                continue
            raw_label = det.get('raw_label', '') or det.get('label', '')
            canonical = normalize_label(raw_label)
            if not canonical:
                continue
            zh_name = self._zh_label(canonical, raw_label)
            if not zh_name:
                continue
            category = det.get('category')
            if not category:
                category = self._auto_category(canonical, zh_name)
            if category not in output:
                category = '其他物件'
            output[category].add(zh_name)
        return {key: sorted(values) for key, values in output.items() if values}

    def _append_furniture(self, ts_hhmm: str, categories: Dict[str, List[str]]) -> None:
        furniture_names = [name for name in categories.get('家具', []) if name in self.furniture_set]
        line = f"[{ts_hhmm}] 家具：" + ('、'.join(furniture_names) if furniture_names else '無')
        try:
            with open(self.furniture_txt_path, 'a', encoding='utf-8') as fp:
                fp.write(line + '\n')
        except OSError:
            pass

    def _clean_text(self, text: str) -> str:
        cleaned = text
        for prefix in self.prompt_prefixes:
            cleaned = cleaned.replace(prefix, '')
        cleaned = ' '.join(token for token in cleaned.split() if not token.startswith('##'))
        cleaned = cleaned.replace('，,', '，').replace(',,', ',').replace('..', '.')
        return cleaned.strip(' 、。,.；;')

    def _has_person(self, categories: Dict[str, List[str]], caption: str) -> bool:
        all_words = ''.join(sum(categories.values(), [])) + caption
        return '人' in all_words or 'person' in all_words.lower()

    def _guess_room(self, categories: Dict[str, List[str]]) -> str:
        bag = set(sum(categories.values(), []))
        for room, triggers in self.room_rules:
            if bag & triggers:
                return room
        return ''

    def _compose_neutral_sentence(self, room_hint: str, categories: Dict[str, List[str]], has_person: bool) -> str:
        room_text = f'在{room_hint}' if room_hint else '在室內'
        furniture = '、'.join(categories.get('家具', [])[:3])
        appliances = '、'.join(categories.get('電器', [])[:2])
        others = '、'.join(categories.get('其他物件', [])[:2])
        obj_text = '、'.join(filter(None, [furniture, appliances, others]))
        if has_person:
            if obj_text:
                return f'{room_text}可見有人，周圍有{obj_text}。'
            return f'{room_text}可見有人。'
        if obj_text:
            return f'{room_text}可見{obj_text}。'
        return f'{room_text}環境整潔，物件稀疏。'

    def _format_objects(self, categories: Dict[str, List[str]]) -> str:
        parts = []
        for name in ('家具', '電器', '其他物件'):
            entries = categories.get(name, [])
            if entries:
                parts.append(f"{name}：" + '、'.join(entries))
        return '；'.join(parts) if parts else '無明顯物件'


Writing scripts/caption_postprocess.py


## 主程式：`main.py`

In [20]:
%%writefile main.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse
from pathlib import Path
from typing import Iterable
from tqdm import tqdm

from scripts.extract_frames import extract_frames
from scripts.object_detector import setup_grounding_dino, detect_objects, summarize_detections, analyze_scene_context, DEFAULT_PROMPT
from scripts.generate_caption import initialize_models, generate_enhanced_caption, infer_room_from_detections
from scripts.utils import ensure_dir, save_txt, format_detection_brief, format_timestamp, create_visualization_summary

VIDEO_EXTS = {'.mp4', '.mov', '.mkv', '.avi', '.webm'}

def iter_videos(targets: Iterable[Path]):
    """遍歷所有影片檔案"""
    for target in targets:
        target = Path(target)
        if target.is_file() and target.suffix.lower() in VIDEO_EXTS:
            yield target
        elif target.is_dir():
            for path in sorted(target.rglob('*')):
                if path.suffix.lower() in VIDEO_EXTS:
                    yield path

def process_video(
    video_path: Path,
    output_root: Path,
    fps_sample: float = 1.0,
    text_prompt: str = DEFAULT_PROMPT,
    show_detection_images: bool = True,
    max_preview_images: int = 5
) -> Path:
    """處理單一影片"""
    video_path = Path(video_path)
    video_output = ensure_dir(output_root / video_path.stem)
    frames_dir = ensure_dir(video_output / 'frames')
    detections_dir = ensure_dir(video_output / 'detections')

    print(f'\n=== 開始處理影片：{video_path.name} ===')
    print('正在提取影格...')
    frame_entries, native_fps = extract_frames(str(video_path), str(frames_dir), target_fps=fps_sample)
    if not frame_entries:
        raise RuntimeError(f'影片 {video_path} 未擷取到任何影格，請確認檔案是否為有效影片。')
    print(f'成功提取 {len(frame_entries)} 個影格，原始FPS: {native_fps:.2f}')

    subtitle_lines = []
    detection_lines = []
    preview_count = 0

    for idx, (frame_path, timestamp) in enumerate(tqdm(frame_entries, desc=f'分析 {video_path.name}')):
        try:
            detections, annotated_image = detect_objects(
                frame_path,
                text_prompt=text_prompt,
                visualize=True
            )
            if annotated_image is not None:
                annotated_path = detections_dir / f'{Path(frame_path).stem}_detection.jpg'
                annotated_image.save(annotated_path)
                if show_detection_images and preview_count < max_preview_images:
                    print(f'\n第 {idx+1} 個影格偵測結果：')
                    vis_summary = create_visualization_summary(detections, len(detections))
                    print(vis_summary)
                    print(f'高信心度物件：{summarize_detections(detections, top_k=5)}')
                    preview_count += 1

            scene_context = analyze_scene_context(detections)
            room_hint = infer_room_from_detections(detections)

            # 重要：字幕只留「純敘述」，關閉所有附加片段
            pure_caption = generate_enhanced_caption(
                frame_path,
                detections,
                room_hint=room_hint,
                include_objects=False,
                include_spatial=False,
                include_room=False,
                include_knowledge=False,
            )
            time_str = format_timestamp(timestamp)
            subtitle_lines.append(f'[{time_str}] {pure_caption}')

            # 偵測摘要（保持原樣，不動）
            detection_summary = format_detection_brief(timestamp, detections)
            if scene_context['person_count'] > 0:
                detection_summary += f' | 人數:{scene_context["person_count"]}'
            if room_hint:
                detection_summary += f' | 房間:{room_hint}'
            detection_lines.append(detection_summary)

        except Exception as e:
            print(f'處理第 {idx+1} 個影格時發生錯誤: {e}')
            error_msg = f'[{format_timestamp(timestamp)}] 處理錯誤: {str(e)}'
            subtitle_lines.append(error_msg)
            detection_lines.append(error_msg)

    subs_path = video_output / f'{video_path.stem}_captions.txt'
    det_path = video_output / f'{video_path.stem}_detections.txt'
    save_txt(subtitle_lines, subs_path)
    save_txt(detection_lines, det_path)

    report_lines = [
        f'影片處理報告 - {video_path.name}',
        '=' * 50,
        f'原始FPS: {native_fps:.2f}',
        f'取樣FPS: {fps_sample}',
        f'處理影格數: {len(frame_entries)}',
        f'影片總時長: {format_timestamp(frame_entries[-1][1])}' if frame_entries else '無',
        f'輸出檔案:',
        f'  - 詳細字幕: {subs_path.name}',
        f'  - 偵測摘要: {det_path.name}',
        f'  - 標註圖像目錄: {detections_dir.name}/',
        f'  - 原始影格目錄: {frames_dir.name}/'
    ]
    report_path = video_output / f'{video_path.stem}_report.txt'
    save_txt(report_lines, report_path)

    print(f'\n處理完成！')
    print(f'詳細字幕檔: {subs_path}')
    print(f'偵測摘要檔: {det_path}')
    print(f'處理報告: {report_path}')
    print(f'標註圖像: {detections_dir}/')
    return subs_path

def main() -> None:
    """主程式入口"""
    parser = argparse.ArgumentParser(description='增強版影片分析系統 - Grounding DINO + BLIP')
    parser.add_argument('--video', help='指定單一影片路徑')
    parser.add_argument('--dir', help='處理資料夾下的所有影片')
    parser.add_argument('--fps', type=float, default=1.0, help='取樣頻率(每秒擷取影格數)')
    parser.add_argument('--output-dir', default='outputs', help='輸出根目錄')
    parser.add_argument('--prompt', default=DEFAULT_PROMPT, help='物件偵測提示詞')
    parser.add_argument('--no-preview', action='store_true', help='不顯示偵測預覽圖像')
    parser.add_argument('--max-preview', type=int, default=5, help='最多顯示幾張預覽圖像')
    args = parser.parse_args()

    if not args.video and not args.dir:
        parser.error('請至少指定 --video 或 --dir 其中之一。')

    print('正在初始化模型...')
    setup_grounding_dino()
    initialize_models()
    print('所有模型已準備完成。')

    targets = []
    if args.video:
        targets.append(Path(args.video))
    if args.dir:
        targets.append(Path(args.dir))
    output_root = ensure_dir(args.output_dir)

    processed_count = 0
    for video in iter_videos(targets):
        try:
            process_video(
                video,
                output_root,
                fps_sample=args.fps,
                text_prompt=args.prompt,
                show_detection_images=not args.no_preview,
                max_preview_images=args.max_preview
            )
            processed_count += 1
        except Exception as e:
            print(f'處理影片 {video} 時發生錯誤: {e}')
            continue

    print('\n=== 所有處理完成 ===')
    print(f'成功處理 {processed_count} 個影片檔案')
    print(f'輸出目錄: {output_root}')

if __name__ == '__main__':
    main()


Overwriting main.py


## 執行

In [24]:
# 單支影片
# !python main.py --video data/videos/你的影片.mp4 --fps 2 --output-dir outputs

# 批次處理資料夾
!python main.py --dir data/videos --fps 1 --output-dir outputs

2025-09-21 15:45:41.428557: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758469541.449776    2472 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758469541.456100    2472 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1758469541.471929    2472 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758469541.471960    2472 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758469541.471963    2472 computation_placer.cc:177] computation placer alr

In [15]:
import IPython.display as display

# 執行完這段會呼叫前端 JS，讓 Colab 斷線
# display.display(display.Javascript('google.colab.kernel.disconnect()'))


## 可調參數與修改位置導覽（完整標示）
- 修改偵測提示詞：`main.py` 的 `--prompt` 參數或 `scripts/object_detector.py` 的 `DEFAULT_PROMPT`。
- 偵測信心閾值與回傳數量：`scripts/object_detector.py` 的 `detect_objects(..., box_threshold=0.25, top_k=15)`。
- 影格取樣頻率：執行參數 `--fps`；或在 `scripts/extract_frames.py` 的 `target_fps` 預設值。
- 翻譯與中文轉換：`scripts/generate_caption.py` 的 `initialize_models()` 使用 `Helsinki-NLP/opus-mt-en-zh` 及 `OpenCC('s2t')`。
- 房間推斷關鍵詞：`scripts/generate_caption.py` 中的 `ROOM_KEYWORDS` 字典。
- 環境描述模板：`scripts/generate_caption.py` 的 `ENVIRONMENT_TEMPLATES`。
- 物件中英對照：`scripts/generate_caption.py` 的 `OBJECT_TRANSLATIONS`。
- 文字輸出檔名與摘要邏輯：`scripts/utils.py` 的 `save_txt`, `format_detection_brief`, `format_timestamp`。

