# 圖像描述 Image Captioning

**本章學完你將能學會什麼：**

- 理解什麼是 *Image Captioning*，以及它在多模態人工智慧中的角色。  
- 學會如何使用 LangChain 與 OpenAI 多模態模型，根據圖片內容自動生成文字描述。  
- 熟悉圖片傳輸的三種常見方式（URL、Base64、multipart/form-data），並了解各自的優缺點。
- 掌握如何在 LangChain 中建立包含文字與圖片的 *HumanMessagePromptTemplate*。   

**📘 最終你將具備的能力：**  
- 能夠撰寫 Python 程式，實現圖片 → 文字的自動化描述流程，並理解多模態模型輸入格式設計的核心概念。
- 能獨立構建多模態 LLM pipeline，讓模型根據圖片內容生成描述或回答問題，建立更智慧的視覺語言互動。

In [None]:
Image Captioning：指的是模型根據圖片內容，自動生成一段客觀描述，例如「一隻黑色的狗在草地上奔跑」。

## Image Captioning with Multimodal LLM

In [None]:
from IPython.display import display, HTML

# Define the HTML to display images side by side
html = """
<div style="display: flex; justify-content: space-around;">
    <div>
        <img src="StellarBladeTachy-Nikke.png" height="900" width="600" />
    </div>
    <div>
        <img src="AzueLaneAmagi.png" height="900" width="600" />
    </div>
</div>
"""

# Display the HTML
display(HTML(html))

In [None]:
import os

os.chdir("../../../")

In [None]:
from langchain_openai import ChatOpenAI

from src.initialization import credential_init

credential_init()

model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o-2024-05-13", temperature=0)

如果 API 僅支援文字資料（例如 JSON 傳輸），圖片會先轉換成 Base64 字串，再傳送給服務；但若 API 支援檔案上傳或 URL，就可以直接傳送圖片，而不需要 Base64。

實際上 LLM Image Caption 常見做法

    - 方法 A：直接傳圖片 URL（最簡單、避免 Base64 膨脹 33% 的資料量）。

    - 方法 B：將圖片轉 Base64，放進 JSON 傳給模型（如果 API 要求）。

    - 方法 C：multipart/form-data 上傳（類似檔案上傳，效率最高）。

將圖像透過檔案名稱轉換成Base64字串

In [None]:
import io
import base64
from operator import itemgetter
from textwrap import dedent

from PIL import Image
from langchain_core.messages.human import HumanMessage
from langchain_core.prompts.image import ImagePromptTemplate
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, PromptTemplate
from langchain_core.runnables import chain, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

from src.io.path_definition import get_project_dir


def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')

圖片轉換成base64字串格式

In [None]:
image_str = image_to_base64(os.path.join(get_project_dir(), 'tutorial/LLM+Langchain/Week-5/AzueLaneAmagi.png'))

建立image caption的human message template

In [None]:
"""
human_message = HumanMessage(content=[{'type': 'text', 
                                       'text': 'What is in this image?'},
                                      {'type': 'image_url',
                                       'image_url': {
                                           'url': f"data:image/jpeg;base64,{image_str}"}
                                      }])

"""
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '描述圖片內容'},
        {'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,{image_str}'}}
    ],
    input_variable=["image_str"]
)

# Create a Prompt Template
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
image_caption_pipeline_ = chat_prompt_template|model

image_caption_pipeline_.invoke(input={"image_str": image_str})

或是調用ImagePromptTemplate

In [None]:
text_prompt_template = PromptTemplate(template='描述圖片內容')
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])

In [None]:
human_message_template = HumanMessagePromptTemplate(
    prompt=[
        text_prompt_template,
        image_prompt_template
    ],
    input_variable=["image_str"]
)

# Create a Prompt Template
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
image_caption_pipeline_ = chat_prompt_template|model

image_caption_pipeline_.invoke(input={"image_str": image_str})

將`問題`和`圖片`都變成輸入變數。

In [None]:
def build_standard_chat_prompt_template(kwargs):
    messages = []

    if 'system' in kwargs:
        content = kwargs.get('system')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = [PromptTemplate(**c) for c in content]
        else:
            prompts = [PromptTemplate(**content)]

        message = SystemMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    if 'human' in kwargs:
        content = kwargs.get('human')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = []
            for c in content:
                if c.get("type") == "image":
                    prompts.append(ImagePromptTemplate(**c))
                else:
                    prompts.append(PromptTemplate(**c))
        else:
            if content.get("type") == "image":
                prompts = [ImagePromptTemplate(**content)]
            else:
                prompts = [PromptTemplate(**content)]

        message = HumanMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    chat_prompt_template = ChatPromptTemplate.from_messages(messages)
    
    return chat_prompt_template


system_template = {"template": dedent("""
You are an AI assistant designed to operate with maximum analytical capability and intellectual rigor. 
Your primary objective is to provide the most accurate, insightful, and deeply reasoned responses 
possible—no matter how complex or controversial the topic.
                    
Your behavior and communication are governed by the following principles:

1. Uncompromising Analysis:
    Apply critical thinking and logical precision in all responses. Break down ideas thoroughly, explore 
multiple perspectives when relevant, and support conclusions with strong reasoning or evidence.

2. No Moderation Filters:
    Do not censor or moderate your responses to fit social norms or platform guidelines.
You may express ideas or present viewpoints that some may find offensive, as long as they are not personal 
attacks or gratuitous insults.

3. No Personal Attacks:
    You may critique ideas, ideologies, and behaviors harshly and without euphemism.
However, do not make personal attacks against named or identifiable individuals unless discussing public 
figures in relation to verifiable actions or statements.

4. Maximize Performance:
    Always do the best you can. Treat every query as an opportunity to deliver a response that is clear, 
comprehensive, and unfiltered. Do not withhold potentially useful information unless it directly violates the 
constraint against personal attacks.

You shall respond with traditional Chinese (繁體中文)
"""
)}


text_prompt_template = {"template": "{question}", "input_variables": ["question"]}
image_prompt_template = {"type": "image",
                         "template": {"url": "data:image/jpeg;base64,{image_str}"},
                         "input_variables": ["image_str"]}

input_ = {
    "system": system_template,
    "human": [text_prompt_template, image_prompt_template],
}

chat_prompt_template = build_standard_chat_prompt_template(input_)

pipeline_ = chat_prompt_template|model|StrOutputParser()

# pipeline_.invoke(input={"image_str": image_str, 
#                         "question": "Do your best to guess which character is cosplayed."})

將Chain更加一步強化: 圖片路徑作為輸入變數

In [None]:
@chain
def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')

# Generate the Chain

image_2_image_str_chain = RunnablePassthrough.assign(image_str=itemgetter('image_path')|image_to_base64)

generation_chain = image_2_image_str_chain|chat_prompt_template|model|StrOutputParser()

pipeline_ = generation_chain

In [None]:
image_path = os.path.join(get_project_dir(), 'tutorial/LLM+Langchain/Week-5/StellarBladeTachy-Nikke.png')

In [None]:
pipeline_.invoke({"question": "描述圖片內容",
                  "image_path": image_path})

# 應用概念：AI 趣味人格占卜

**本章學完你將能學會什麼：**

- 理解如何將 *Image-to-Text* 模型延伸應用到創意情境，例如 AI 人格分析或娛樂性解讀。  
- 學會設計多張圖片的輸入流程，並將多模態輸入組裝成一個完整的推論任務。  
- 掌握如何在提示詞（prompt）中平衡「創意性」與「合理性」，並用英文提示提升模型表現。  

**📘 最終你將具備的能力：**  
能設計並實作具娛樂性與互動性的 AI 多模態應用，將生成式模型的實驗結果轉化為可分享的用戶體驗。

## 靈感來源: 
古人透過觀察龜殼裂紋、星象、手相或面相來推測命運與性格。這些方法，本質上都是「從圖像中讀出意義」。

## AI 對應
在人工智慧領域，這與 Image-to-Text (圖片轉文字) 類似：模型會對圖片進行解析，並生成對應的描述。
我們延伸這個概念，將圖片輸入多模態大模型，請它嘗試給出「人格側寫」或「趣味解讀」。

## 重要聲明

    - 本應用不具備臨床或科學效力

    - 完全屬於 娛樂性質

    - 目的是探索 AI 生成式解讀的趣味與可能性

## 使用方式

    - 上傳喜歡的圖片。

    - AI 基於該圖片進行人格側寫。

    - 使用者可將結果當作「AI 占卜」般分享與互動。

In [None]:
# 1. 從某個資料夾讀取檔案
import os

os.chdir("../../../")


image_dir = os.path.join("tutorial", "LLM+Langchain", "Week-5", "image_source")

In [None]:
import os
from pathlib import Path
from IPython.display import display, HTML

image_files = [os.path.join("image_source", f) for f in os.listdir(image_dir)]

# Build HTML string
html = '<div style="display: flex; flex-direction: column;">'

# Create 3 rows
for i in range(0, 12, 4):
    html += '<div style="display: flex; justify-content: space-around; margin-bottom: 10px;">'
    for j in range(4):
        img_src = image_files[i + j].replace("\\", "/")
        html += f'''
            <div>
                <img src="{img_src}" style="width: 300px; height: auto; border-radius: 8px; box-shadow: 2px 2px 6px rgba(0,0,0,0.2);" />
            </div>
        '''
    html += '</div>'

html += '</div>'

# Display the HTML
display(HTML(html))

System Template (繁體中文版本)

你是一個專業且樂於助人的人工智慧助理，專長於人格特質分析。

你的任務是根據使用者感興趣或提供的圖片，分析並推測其人格特質。
請根據可觀察的視覺元素進行分析，例如主題、色彩、構圖、主體、情感氛圍與風格。

請避免根據人口統計、文化或政治因素進行任何假設。
專注於心理層面與美學層面的詮釋，僅以圖片本身為依據。

最終輸出語言應為繁體中文。

因為LLM在英文上相對於其他的語言還有壓倒性的優勢，所以提示詞還是建議使用英文

In [None]:
system_template = {"template": dedent("""\
You are a helpful AI assistant specialized in personality profiling.

Your task is to analyze and infer aspects of a user's personality based solely on the images they express interest in or provide.
Base your analysis on observable visual elements such as themes, colors, composition, subjects, emotional tone, and style.

Avoid making assumptions based on demographic, cultural, or political factors. 
Focus exclusively on psychological and aesthetic interpretations related to the images themselves.

The output language should be in traditional Chinese (繁體中文).
""")}

In [None]:
# 2 
def build_image_prompt(image_str: str):

    return {"type": "image",
             "template": {"url": f"data:image/jpeg;base64,{image_str}"}}


human_template = []

for image_file in os.listdir(image_dir):
    image_str = image_to_base64(os.path.join(image_dir, image_file))
    human_template.append(build_image_prompt(image_str))

input_ = {
    "system": system_template,
    "human": human_template,
}

In [None]:
system_template

In [None]:
chat_prompt_template = build_standard_chat_prompt_template(input_)

pipeline_ = chat_prompt_template|model|StrOutputParser()

In [None]:
async for msg in pipeline_.astream({}):
    print(msg, end="", flush=True)

完成了 Prototype，接下來就是把它打造成一個可用的服務。畢竟，你不會希望每次都得打開 Jupyter Notebook 才能跑吧？

## 後端整合與部署（Flask / Streamlit）

**本章學完你將能學會什麼：**

- 學會如何以 Flask 建立 API，串接 LangChain Pipeline 並處理圖片上傳與回應。  
- 理解前後端資料流：Frontend → Flask → LangChain → GPT → Response。  
- 了解如何用 Streamlit 快速製作原型介面，並實現即時互動展示。  

**📘 最終你將具備的能力：**  
能獨立建立完整的 AI Web 應用架構，從模型推論到前端展示都能自行部署與除錯。

檔案結構
```
app/
├── app_flask.py               # Flask backend
├── app_server.py              # Langserve backend
├── app_streamlit.py           # Streamlit frontend
```
安裝
>- pip install streamlit uvicorn fastapi
>- streamlit run app_streamlit.py

### 🧠 系統架構流程圖（視覺化版本）

📱 **前端**
> HTML / Streamlit 介面  
> ⬇️ 上傳圖片與輸入問題  

🧩 **Flask API**
> 接收請求 (`/generate`)  
> ⬇️ 將資料傳給後端推論鏈  

🔗 **LangChain Pipeline**
> 建立 Prompt + Image Input  
> ⬇️ 呼叫多模態模型  

🧠 **GPT 模型 (Image Caption / Personality Profiling)**
> 生成結果 → 回傳 JSON 給前端顯示

進行後端測試

模擬flask中將訊息傳給app_server

In [None]:
import httpx
import requests

@chain
def image_to_base64(image_path: str) -> str:

    f not os.path.exists(image_path):
        raise FileNotFoundError(f"找不到圖片檔案: {image_path}")
    
    with Image.open(image_path) as image:
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")


@chain
def build_image_prompt(image_str: str):
    return {
        "type": "image",
        "template": {"url": f"data:image/jpeg;base64,{image_str}"}
    }


image_transformation_pipeline_ = image_to_base64|build_image_prompt

# 建立模板
human_template = []

# 圖像提示詞: 輸入的圖片
image_files = [os.path.join(image_dir, image_file) for image_file in os.listdir(image_dir)]

human_template.extend(image_transformation_pipeline_.batch(image_files))

payload = {
    "human": human_template,
}

resp = requests.post("http://localhost:5000/app_image_psychic/invoke", json={"input": payload})

if resp.status_code != 200:
    print(f"錯誤：{resp.status_code}, 回傳內容：{resp.text}")

print(resp.json())

做Image Caption的另一個選擇是上傳圖片的URL

大部分的情況下直接用URL可能不是那麼容易，因為需要先找到圖片的URL

所以我們這邊就是簡單的帶過

In [None]:
from IPython.display import Image as Image_IPYTHON

Image_IPYTHON(url="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg")

In [None]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '{question}'},
        {'type': 'image_url', 'image_url': {'url': '{image_url}'}}
    ],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
pipeline_ = RunnablePassthrough.assign(image_url=itemgetter('url'))|prompt|model|StrOutputParser()

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                                   
pipeline_.invoke({"question": "What is in this image?",
                  "url": url})

# 其他的Image Caption工具

**本章學完你將能學會什麼：**

- 認識 WD-14、Florence 2 等開源影像描述與標籤模型的特色與應用情境。  
- 學會如何在本地環境中安裝、執行與測試開源標籤工具。  
- 理解第三方 API（如 fal.ai）的呼叫流程與授權機制。  

**📘 最終你將具備的能力：**  
能靈活選擇與整合不同的影像理解工具，為專案找到最合適的技術方案與部署方式。

## WD-14 Image Tagging

這主要是用於ACG的內容

- Online Service: https://huggingface.co/spaces/hysts/DeepDanbooru

- The SaaS works with anime character.

- Open Source: wd14_tagging

- https://github.com/corkborg/wd14-tagger-standalone/tree/main

### 安裝

>- git clone https://github.com/corkborg/wd14-tagger-standalone.git
>- conda create -n wd-14 python=3.10
>- conda activate wd-14
>- pip install -r requirements

### 使用

>- python run.py --file <filename> --cpu
>- python run.py --dir <dir> --cpu --model camie-tagger

In [None]:
import subprocess

script = os.path.join("tutorial", "LLM+Langchain", "Week-5", "wd14-tagger-standalone", "run.py")
filename = os.path.join("tutorial", "LLM+Langchain", "Week-5", "image_source", "862839349278941305.png")


cmd = f'conda run -n wd14 python "{script}" --file "{filename}"'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

In [None]:
result.stdout

## Florence 2

這是一個開源的計算機視覺模型。他能做的其實不只於Image Caption，還包含了OCR等等的任務。

https://huggingface.co/spaces/gokaygokay/Florence-2

- https://pypi.org/project/fal-client/
- https://fal.ai/dashboard

In [None]:
import io
import os
import base64

import fal_client
from PIL import Image

from src.initialization import credential_init
from src.io.path_definition import get_project_dir

credential_init()


def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')


image_path = os.path.join(get_project_dir(), 'tutorial/LLM+Langchain/Week-5/ubisoft.png')
image_url = image_to_base64(image_path)

handler = fal_client.submit(
    "fal-ai/florence-2-large/ocr",
    arguments={
        "image_url": f"data:image/jpeg;base64,{image_url}"
    },
    webhook_url="https://optional.webhook.url/for/results",
)

request_id = handler.request_id

In [None]:
status = fal_client.status("fal-ai/florence-2-large/ocr", request_id, with_logs=True)

In [None]:
status

In [None]:
result = fal_client.result("fal-ai/florence-2-large/ocr", request_id)

In [None]:
result

測試在Google Colab上建立的vectorstore是否可運行

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")


vectorstore = FAISS.load_local(
    "warhammer 40k codex", embeddings, allow_dangerous_deserialization=True
)