# Image Caption

## Image Caption with Multimodal LLM

In [1]:
from IPython.display import display, HTML

# Define the HTML to display images side by side
html = """
<div style="display: flex; justify-content: space-around;">
    <div>
        <img src="StellarBladeTachy-Nikke.png" height="900" width="600" />
    </div>
    <div>
        <img src="AzueLaneAmagi.png" height="900" width="600" />
    </div>
</div>
"""

# Display the HTML
display(HTML(html))

In [3]:
import os

os.chdir("../../../")

In [4]:
from langchain_openai import ChatOpenAI

from src.initialization import credential_init

credential_init()

model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o-2024-05-13", temperature=0)

如果 API 僅支援文字資料（例如 JSON 傳輸），圖片會先轉換成 Base64 字串，再傳送給服務；但若 API 支援檔案上傳或 URL，就可以直接傳送圖片，而不需要 Base64。

實際上 LLM Image Caption 常見做法

    - 方法 A：直接傳圖片 URL（最簡單、避免 Base64 膨脹 33% 的資料量）。

    - 方法 B：將圖片轉 Base64，放進 JSON 傳給模型（如果 API 要求）。

    - 方法 C：multipart/form-data 上傳（類似檔案上傳，效率最高）。

將圖像透過檔案名稱轉換成Base64字串

In [5]:
import io
import base64
from textwrap import dedent

from PIL import Image
from langchain_core.messages.human import HumanMessage
from langchain_core.prompts.image import ImagePromptTemplate
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, PromptTemplate
from langchain_core.runnables import chain
from langchain_core.output_parsers import StrOutputParser

from src.io.path_definition import get_project_dir


def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')

In [6]:
image_str = image_to_base64(os.path.join(get_project_dir(), 'tutorial/LLM+Langchain/Week-5/AzueLaneAmagi.png'))

In [None]:
"""
human_message = HumanMessage(content=[{'type': 'text', 
                                       'text': 'What is in this image?'},
                                      {'type': 'image_url',
                                       'image_url': {
                                           'url': f"data:image/jpeg;base64,{image_str}"}
                                      }])

"""
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '描述圖片內容'},
        {'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,{image_str}'}}
    ],
    input_variable=["image_str"]
)

# Create a Prompt Template
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
image_caption_pipeline_ = chat_prompt_template|model

image_caption_pipeline_.invoke(input={"image_str": image_str})

或是調用不同的模組

In [None]:
text_prompt_template = PromptTemplate(template='描述圖片內容')
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])

In [None]:
human_message_template = HumanMessagePromptTemplate(
    prompt=[
        text_prompt_template,
        image_prompt_template
    ],
    input_variable=["image_str"]
)

# Create a Prompt Template
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
image_caption_pipeline_ = chat_prompt_template|model

image_caption_pipeline_.invoke(input={"image_str": image_str})

將`問題`和`圖片`都變成輸入變數。

In [9]:
def build_standard_chat_prompt_template(kwargs):
    messages = []

    if 'system' in kwargs:
        content = kwargs.get('system')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = [PromptTemplate(**c) for c in content]
        else:
            prompts = [PromptTemplate(**content)]

        message = SystemMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    if 'human' in kwargs:
        content = kwargs.get('human')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = []
            for c in content:
                if c.get("type") == "image":
                    prompts.append(ImagePromptTemplate(**c))
                else:
                    prompts.append(PromptTemplate(**c))
        else:
            if content.get("type") == "image":
                prompts = [ImagePromptTemplate(**content)]
            else:
                prompts = [PromptTemplate(**content)]

        message = HumanMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    chat_prompt_template = ChatPromptTemplate.from_messages(messages)
    
    return chat_prompt_template


system_template = {"template": dedent("""
You are an AI assistant designed to operate with maximum analytical capability and intellectual rigor. 
Your primary objective is to provide the most accurate, insightful, and deeply reasoned responses 
possible—no matter how complex or controversial the topic.
                    
Your behavior and communication are governed by the following principles:

1. Uncompromising Analysis:
    Apply critical thinking and logical precision in all responses. Break down ideas thoroughly, explore 
multiple perspectives when relevant, and support conclusions with strong reasoning or evidence.

2. No Moderation Filters:
    Do not censor or moderate your responses to fit social norms or platform guidelines.
You may express ideas or present viewpoints that some may find offensive, as long as they are not personal 
attacks or gratuitous insults.

3. No Personal Attacks:
    You may critique ideas, ideologies, and behaviors harshly and without euphemism.
However, do not make personal attacks against named or identifiable individuals unless discussing public 
figures in relation to verifiable actions or statements.

4. Maximize Performance:
    Always do the best you can. Treat every query as an opportunity to deliver a response that is clear, 
comprehensive, and unfiltered. Do not withhold potentially useful information unless it directly violates the 
constraint against personal attacks.

You shall respond with traditional Chinese (繁體中文)
"""
)}


text_prompt_template = {"template": "{question}", "input_variables": ["question"]}
image_prompt_template = {"type": "image",
                         "template": {"url": "data:image/jpeg;base64,{image_str}"},
                         "input_variables": ["image_str"]}

input_ = {
    "system": system_template,
    "human": [text_prompt_template, image_prompt_template],
}

chat_prompt_template = build_standard_chat_prompt_template(input_)

pipeline_ = chat_prompt_template|model|StrOutputParser()

# pipeline_.invoke(input={"image_str": image_str, 
#                         "question": "Do your best to guess which character is cosplayed."})

將Chain更加一步強化: 圖片路徑作為輸入變數

In [None]:
from operator import itemgetter

from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser


@chain
def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')

# Generate the Chain

image_2_image_str_chain = RunnablePassthrough.assign(image_str=image_2_image_str_chain)

generation_chain = image_2_image_str_chain|chat_prompt_template|model|StrOutputParser()

pipeline_ = generation_chain

In [None]:
image_path = os.path.join(get_project_dir(), 'tutorial/LLM+Langchain/Week-5/StellarBladeTachy-Nikke.png')

In [None]:
pipeline_.invoke({"question": "描述圖片內容",
                  "image_path": image_path})

In [None]:
from langchain_core.messages import HumanMessage, SystemMessage

system_message = SystemMessage((
    "You are a prompt engineering assistant for a multimodal AI system that generates detailed captions and visual descriptions from images. "
    "Your role is to create a high-quality **prefix** that sets up the image understanding task clearly and effectively.\n\n"
    "**Task:**\n\n"
    "Given:\n"
    "- A user instruction or question that describes what they want to extract or understand from an image.\n"
    "- Access to the image itself, which you can analyze directly.\n\n"
    "Your output:\n"
    "- A concise, informative **prefix** that guides the AI model to interpret the image in a way that aligns with the user's intent.\n"
    "- The prefix should **clarify the goal** of the captioning task, using relevant visual context and domain-specific framing if appropriate.\n"
    "- Do **not repeat or rephrase the user's instruction**. Instead, infer the **underlying purpose** or focus behind it and express that clearly.\n\n"
    "**Guidelines:**\n"
    "- Keep the prefix factual, neutral, and task-oriented.\n"
    "- Use domain-specific language if the image content relates to a particular field (e.g., fashion, medicine, design, food).\n"
    "- Focus on setting up **what** should be described, not **how** to describe it.\n"
    "- Do not include instructions or formatting directions in the prefix.\n"
    "- Return only the prefix as a single-line string.\n\n"
    "**Example Output Style:**\n"
    "- \"Identify key visual elements that indicate the building's architectural style.\"\n"
    "- \"Describe the condition and details of the skin around the affected area.\"\n"
    "- \"Highlight notable clothing features and accessories relevant to street fashion.\"\n\n"
    "Return only the prefix as a string."
)
                              )

text_prompt_template = PromptTemplate(template='{question}',
                                      input_variables=['question'])
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])

human_message_template = HumanMessagePromptTemplate(
    prompt=[text_prompt_template,
            image_prompt_template],
)

prefix_prompt = ChatPromptTemplate.from_messages([system_message, 
                                                  human_message_template])

prefix_pipeline = prefix_prompt|model|StrOutputParser()

prefix_pipeline_adapted = RunnablePassthrough.assign(image_str=image_2_image_str_chain)|prefix_pipeline

prefix_pipeline_adapted.invoke({"question": "Who is this character in Blue Archive?",
                                "image_path": image_path})

2. suffix pipeline

In [None]:
system_message = SystemMessage(content=(
    "You are a prompt engineering assistant for a multimodal AI system that generates detailed captions and visual descriptions from images. "
    "Your role is to create a high-quality **suffix** for a prompt that guides the captioning model to tailor its response based on the user's intent.\n\n"
    "**Task:**\n\n"
    "Given:\n"
    "- A user instruction or question describing what they want to generate or understand from an image.\n"
    "- Access to the image itself, which you can analyze directly.\n\n"
    "Your output:\n"
    "- A short, focused **suffix** that refines how the image caption or description should be delivered.\n"
    "- The suffix should align with the **tone, specificity, or perspective** implied by the user’s instruction (e.g., analytical, descriptive, comparative, empathetic).\n"
    "- Avoid repeating or paraphrasing the user’s instruction.\n\n"
    "**Guidelines:**\n"
    "- Use the suffix to subtly guide the model’s **style**, **depth**, or **focus**, based on the inferred user need.\n"
    "- You may emphasize elements like object relationships, spatial layout, visual aesthetics, emotional tone, or technical detail as appropriate.\n"
    "- Keep the suffix brief, natural, and relevant to the captioning goal.\n"
    "- Do not include formatting directions or break character.\n"
    "- Return only the suffix as a single-line string.\n\n"
    "**Example Output Style:**\n"
    "- \"Focus on the interaction between subjects and their environment.\"\n"
    "- \"Use a neutral, clinical tone for medical accuracy.\"\n"
    "- \"Include sensory details that evoke mood or atmosphere.\"\n"
    "- \"Highlight visual contrasts and compositional balance.\"\n\n"
    "Return only the suffix as a string."
)
                              )

text_prompt_template = PromptTemplate(template='{question}',
                                      input_variables=['question'])
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])

human_message_template = HumanMessagePromptTemplate(
    prompt=[text_prompt_template,
            image_prompt_template],
)

suffix_prompt = ChatPromptTemplate.from_messages([system_message, 
                                                  human_message_template])

suffix_pipeline = suffix_prompt|model|StrOutputParser()

suffix_pipeline_adapted = RunnablePassthrough.assign(image_str=image_2_image_str_chain)|suffix_pipeline

suffix_pipeline_adapted.invoke({"question": "Who is this character in Blue Archive?",
                                "image_path": image_path})

Final pipeline

In [None]:
final_system_message = SystemMessage(content=("You are an advanced multimodal assistant capable of interpreting both images and text-based "
                                             "instructions. You will receive a combined prompt structured in three parts:\n\n"
                                             "1. A **prefix** that provides helpful context or framing for the image and task.\n"
                                             "2. A **user instruction or question** describing what they want to extract or understand from "
                                             "the image.\n"
                                             "3. A **suffix** that clarifies tone, level of detail, or formatting expectations for the output.\n\n"
                                             "You will also receive an image alongside the prompt. Your job is to generate a response that is:\n"
                                             "- Accurate and relevant to the image.\n"
                                             "- Aligned with the goal implied by the prefix.\n"
                                             "- Responsive to the user’s instruction.\n"
                                             "- Refined according to the suffix.\n\n"
                                             "Make sure to analyze the image carefully, follow the structure, and respect the user’s intent and tone.\n\n"
                                             "Format:\n"
                                             "Your response should be clear, complete, and follow any guidelines implied by the suffix. "
                                             "Avoid repeating the question, and stay focused on the visual and contextual elements relevant to "
                                             "the task."))

user_template = PromptTemplate(template='{question}',
                               input_variables=['question'])

prefix_prompt_template = PromptTemplate(template='{prefix}\n\n',
                                      input_variables=['prefix'])
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])
suffix_prompt_template = PromptTemplate(template='\n\n{suffix}',
                                      input_variables=['suffix'])

human_message_template = HumanMessagePromptTemplate(
    prompt=[user_template,
            prefix_prompt_template,
            image_prompt_template,
            suffix_prompt_template],
)

final_prompt = ChatPromptTemplate.from_messages([final_system_message, 
                                                 human_message_template])

# Generate the Chain

generation_chain = RunnablePassthrough.assign(image_str=image_2_image_str_chain)|RunnablePassthrough.assign(prefix=prefix_pipeline,
                                                                                                            suffix=suffix_pipeline)
pipeline_ = generation_chain|final_prompt#|model|translation_function

In [None]:
# pipeline_.invoke({"question": "Who is this character in Blue Archive?",
#                   "image_path": image_path})

In [None]:
pipeline_ = generation_chain|final_prompt|model|StrOutputParser()|translation_function

pipeline_.invoke({"question": "Who is this character in Blue Archive?",
                  "image_path": image_path})

In [None]:
model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4.5-preview-2025-02-27")

pipeline_ = generation_chain|final_prompt|model|StrOutputParser()|translation_function

pipeline_.invoke({"question": "Who is this character in Blue Archive?",
                  "image_path": image_path})

Can we enhance the user question by extending it?

In [None]:
from typing import List

from pydantic import BaseModel, Field
from langchain.output_parsers import PydanticOutputParser


class Query(BaseModel):
    name: str = Field(description='instruction/question')

class Queries(BaseModel):
    name: List[Query] = Field(description="A list of instruction/question")

queries_output_parser = PydanticOutputParser(pydantic_object=Queries)
queries_format_instructions = queries_output_parser.get_format_instructions()


instruction_generation_system_prompt = (
    "You are a prompt engineering assistant for a multimodal AI system. "
    "Your task is to generate a list of clear, relevant, and diverse instructions or questions that would help an AI system achieve a specific user-defined goal related to image understanding.\n\n"
    "**Task:**\n\n"
    "Given:\n"
    "- A high-level user goal (e.g., 'Understand the emotional tone of the image', 'Identify objects for accessibility', 'Generate a product description').\n"
    "- Access to the image itself, which you may analyze directly.\n\n"
    "Your output:\n"
    "- A set of 3 to 7 unique, well-phrased instructions or questions that guide the AI to perform different but related tasks that collectively help fulfill the user's goal.\n"
    "- Each instruction should focus on a specific subtask or angle (e.g., describing visual elements, inferring context, identifying details, comparing regions, etc.).\n\n"
    "**Guidelines:**\n"
    "- Do not repeat the user’s goal verbatim.\n"
    "- Each instruction/question should be useful on its own but also contribute meaningfully toward the overall goal.\n"
    "- Use varied phrasing and perspectives (e.g., analytical, descriptive, contextual).\n"
    "- Consider domain-specific needs if applicable (e.g., fashion, medical imaging, architecture).\n"
    "- Avoid yes/no questions; focus on open-ended or descriptive prompts.\n"
    "- Return only the list of instructions/questions as a Python list of strings.\n\n")

system_prompt_template = PromptTemplate(template=instruction_generation_system_prompt)

system_message_template = SystemMessagePromptTemplate(prompt=system_prompt_template)

human_prompt_template = PromptTemplate(template='{question}',
                                      input_variables=['question'])
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])
format_prompt_template = PromptTemplate(template='output format instructions: {format_instructions}',
                                       partial_variables={"format_instructions": queries_format_instructions})

human_message_template = HumanMessagePromptTemplate(
    prompt=[human_prompt_template,
            image_prompt_template,
            format_prompt_template],
)

query_generation_prompt = ChatPromptTemplate.from_messages([system_message_template, 
                                                            human_message_template])

# Generate the Chain
image_2_image_str_chain = itemgetter('image_path')|image_to_base64
generation_chain = RunnablePassthrough.assign(image_str=image_2_image_str_chain)
pipeline_ = generation_chain|query_generation_prompt|model|queries_output_parser

In [None]:
new_queries = pipeline_.invoke({"question": "Identify the character in Blue Archive.",
                                "image_path": image_path})
print(new_queries.name)

In [None]:
# new_queries = pipeline_.invoke({"question": "Who is this character in Blue Archive?",
#                                 "image_path": image_path})

In [None]:
new_queries.name[3].name

Now we have multiple questions/instructions. Let's use them to create more information:

BATCH~~

In [None]:
input_data = [{"question": query.name,
               "image_path": image_path} for query in new_queries.name]

In [None]:
input_data

In [None]:
human_prompt_template = PromptTemplate(template='{question}',
                                      input_variables=['question'])
image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                            input_variables=['image_str'])

human_message_template = HumanMessagePromptTemplate(
    prompt=[human_prompt_template,
            image_prompt_template],
)

basic_prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
image_2_image_str_chain = itemgetter('image_path')|image_to_base64
generation_chain = RunnablePassthrough.assign(image_str=image_2_image_str_chain)
basic_pipeline = generation_chain|basic_prompt|model|StrOutputParser()

In [None]:
basic_pipeline.batch(input_data)

You can see that the process can be very sophisticated. Therefore proper software engineering is required for prompt engineering to generate high quality result.

直接將圖片URL作為變數輸入

In [None]:
from IPython.display import Image as Image_IPYTHON

Image_IPYTHON(url="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg")

In [None]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '{question}'},
        {'type': 'image_url', 'image_url': {'url': '{image_url}'}}
    ],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
pipeline_ = RunnablePassthrough.assign(image_url=itemgetter('url'))|prompt|model|StrOutputParser()

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                                   
pipeline_.invoke({"question": "What is in this image?",
                  "url": url})

## Multiple Images

In [None]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[{'type': 'text', 
               'text': 'What are in these images? Is there any difference between them?'},
              {'type': 'image_url',
               'image_url': {
                   'url': "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
              },
              {'type': 'image_url',
               'image_url': {
                   'url': "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
              }],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

model.invoke(prompt.format())

有啥點子想試試看的嗎? 現場實操，希望不會翻車

In [None]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[{'type': 'text', 
               'text': 'What are in these images? Is there any difference between them?'},
              {'type': 'image_url',
               'image_url': {
                   'url': "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
              },
              {'type': 'image_url',
               'image_url': {
                   'url': "https://assets.warhammer-community.com/articles/88803229-7993-4e8c-a3c4-6e4fa2c38a34/zqzebys4roe7nhcd.jpg"}
              }],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

model.invoke(prompt.format())

## Image Caption with OCR

In [None]:
import logging
from typing import Tuple

from langchain_core.runnables import Runnable

from src.initialization import model_activation


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class SignatureOutput(BaseModel):
    """Pydantic model representing the signature extraction result."""
    name: str = Field(description="The signature on the image")


class BrandOutput(BaseModel):
    """Pydantic model representing the brand name derived from signature."""
    brand: str = Field(description="The brand")
    country_code: str = Field(description="ISO 3166-1 alpha-2 of the country of the brand")


def build_pipeline(steps: list) -> Runnable:

    pipeline = steps[0]
    for step in steps[1:]:
        pipeline = pipeline | step
    return pipeline


class SignatureExtraction:
    """Extracts text-based signatures from images using a vision-language pipeline."""

    def __init__(self, model_name: str):
        """Initializes the signature extraction pipeline."""
        logger.info("Initializing SignatureExtraction")

        # Error handling is in the function model_activation
        model = model_activation(model_name)

        image_to_base64_pipeline = image_to_base64

        output_parser, self.format_instructions = self._build_signature_parser()
        prompt_template = self.build_image_caption_prompt_template()

        step_1 = RunnablePassthrough.assign(image_str=itemgetter("image_path") | image_to_base64_pipeline)
        step_2 = RunnablePassthrough.assign(signature=prompt_template|model|output_parser|self.extract_name_field)

        self.pipeline = build_pipeline([step_1, step_2])

    def build_image_caption_prompt_template(self) -> ChatPromptTemplate:
        """Constructs a LangChain chat prompt for image signature captioning.

        Returns:
            ChatPromptTemplate: A chat prompt with both text and image components.
        """
        text_prompt_template = PromptTemplate(template="Please extract the signature on the image.\n"
                                                       'Output format instruction: {format_instructions}',
                                              partial_variables={"format_instructions": self.format_instructions})
        image_prompt_template = ImagePromptTemplate(template={"url": 'data:image/jpeg;base64,{image_str}'},
                                                    input_variables=['image_str'])

        human_message_template = HumanMessagePromptTemplate(
            prompt=[text_prompt_template,
                    image_prompt_template],
        )

        prompt_template = ChatPromptTemplate.from_messages([human_message_template])

        return prompt_template

    @staticmethod
    def _build_signature_parser() -> Tuple[PydanticOutputParser, str]:
        """Builds a parser for structured signature output.

        Returns:
            Tuple[PydanticOutputParser, str]: Output parser and format instructions.
        """
        output_parser = PydanticOutputParser(pydantic_object=SignatureOutput)
        format_instructions = output_parser.get_format_instructions()

        return output_parser, format_instructions

    @chain
    @staticmethod
    def extract_name_field(pydantic_object) -> str:
        """Extracts the 'name' field from a parsed object.

        Args:
            pydantic_object (BaseModel): A Pydantic object with a `name` field.

        Returns:
            str: Extracted name.
        """
        return pydantic_object.name

In [None]:
signature_extraction = SignatureExtraction(model_name='gpt-4.1')

In [None]:
os.path.isfile("tutorial/LLM+Langchain/Week-5/figure-5-4.jpg")

In [None]:
output = signature_extraction.pipeline.invoke({"image_path": "tutorial/LLM+Langchain/Week-5/figure-5-4.jpg"})

In [None]:
output['signature']

# Other Image Caption Tools

## Danburoo Tag

- Online Service: https://huggingface.co/spaces/hysts/DeepDanbooru

- The SaaS works with anime character.

- Open Source: wd14_tagging

- https://github.com/corkborg/wd14-tagger-standalone/tree/main

## How to use?

-- git clone https://github.com/corkborg/wd14-tagger-standalone.git

-- conda create -n wd-14 python=3.10

-- conda activate wd-14

-- pip install -r requirements

-- python run.py --file <filename> --cpu

-- python run.py --dir <dir> --cpu --model camie-tagger

In [None]:
import os
from IPython.display import display, HTML


folder = os.path.join('wd14-tagger-standalone', 'test_folder')

# List of image filenames
image_files = [
    os.path.join(folder, "_0fbVdzjQ7PLiNrGJB4Jh.png"), os.path.join(folder, "753912269928394698.png"), 
    os.path.join(folder, "753966193242693206.png"), os.path.join(folder, "753990850649984248.png"),
    os.path.join(folder, "753999719757313315.png"), os.path.join(folder, "779517121190346958.png"), 
    os.path.join(folder, "779946965812213661.png"), os.path.join(folder, "780061856187544535.png"),
    os.path.join(folder, "782864910758693094.png"), os.path.join(folder, "783023592620321956.png"), 
    os.path.join(folder, "784020999990595324.png"), os.path.join(folder, "784063554526380296.png")
]

# Build HTML string
html = '<div style="display: flex; flex-direction: column;">'

# Create 3 rows
for i in range(0, 12, 4):
    html += '<div style="display: flex; justify-content: space-around; margin-bottom: 10px;">'
    for j in range(4):
        img_src = image_files[i + j]
        html += f'''
            <div>
                <img src="{img_src}" style="width: 600px; height: auto;" />
            </div>
        '''
    html += '</div>'

html += '</div>'

# Display the HTML
display(HTML(html))

## Florence

https://huggingface.co/spaces/gokaygokay/Florence-2

- https://pypi.org/project/fal-client/
- https://fal.ai/dashboard

In [None]:
import io
import os
import base64

import fal_client
from PIL import Image

from src.initialization import credential_init
from src.io.path_definition import get_project_dir

credential_init()


def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')


image_path = os.path.join(get_project_dir(), 'tutorial/LLM+Langchain/Week-5/ubisoft.png')
image_url = image_to_base64(image_path)

handler = fal_client.submit(
    "fal-ai/florence-2-large/ocr",
    arguments={
        "image_url": f"data:image/jpeg;base64,{image_url}"
    },
    webhook_url="https://optional.webhook.url/for/results",
)

request_id = handler.request_id

In [None]:
status = fal_client.status("fal-ai/florence-2-large/ocr", request_id, with_logs=True)

In [None]:
status

In [None]:
result = fal_client.result("fal-ai/florence-2-large/ocr", request_id)

In [None]:
result

# Text Splitting

https://www.youtube.com/watch?v=8OJC21T2SL4

- Character Split
- Recursive Character Split
- Document Specific Splitting
- Semantic Splitting
- Agentic Splitting

1. Context Limit: Limit on the amount of words/tokens you can pass to the language model
2. Signal to Noise: Remove information that isn't helpful to your task

### We use a practical example:

- does-ai-really-encourage-cheating-in-schools

Design and implement a system that is able to summarize very long articles

Considering the following constraints

- Models have a specific max input length
- Summarizers have minimum and maximum summary length

In [None]:
import os

from src.io.path_definition import get_project_dir

filename = "does-ai-really-encourage-cheating-in-schools.txt"

filename_path = os.path.join(get_project_dir(), 'tutorial', 'LLM+Langchain', 'Week-5', filename)


with open(filename_path, "r", encoding="utf8") as file:
    cleaned_text = file.read()
    
print(cleaned_text)

In [None]:
from langchain.prompts import PromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 1024
chunk_overlap = 128

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

documents = text_splitter.create_documents([cleaned_text])

In [None]:
# documents

In [None]:
from langchain_core.runnables import Runnable, chain


system_template = ("You are an expert summarizer. Your task is to read the provided text and generate a clear, concise, and accurate summary. "
                   "Focus on the main ideas, key points, and any critical information. Avoid unnecessary details, repetition, or personal opinions. "
                   "The summary should be in your own words and easy to understand for someone who hasn’t read the original text.\n\n"
                   "If the text includes technical or specialized content, retain essential terminology but explain it simply if needed."
                  )

def build_standard_chat_prompt_template(kwargs) -> Runnable:
    messages = []
    
    for key in ['system', 'human']:
        if kwargs.get(key):
            if key == 'system':
                system_content = kwargs['system']
                system_prompt = PromptTemplate(**system_content)
                message = SystemMessagePromptTemplate(prompt=system_prompt)
            else:
                human_content = kwargs['human']
                human_prompt = PromptTemplate(**human_content)
                message = HumanMessagePromptTemplate(prompt=human_prompt)

            messages.append(message)

    chat_prompt = ChatPromptTemplate.from_messages(messages)
    
    return chat_prompt


@chain
def build_summary_prompt_template(kwargs):

    input_ = {"system": {"template": system_template},
              "human": {"template": ("text: {text}."
                                    ),
                        "input_variables": ['text']}
            }

    return build_standard_chat_prompt_template(input_)

In [None]:
summary_pipeline = build_summary_prompt_template | model | StrOutputParser()

inputs_ = []
for document in documents:
    inputs_.append({"text": document.page_content})

contents = summary_pipeline.batch(inputs_)

final_text = "\n\n".join(contents)

In [None]:
final_text

In [None]:
summary_pipeline.invoke({"text": final_text})

# Additional Reading

Nice to know but I am not going into this rabbit hole.

## Character Splitting

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form

This method isn's recommended for any applications - but it's a great starting point for us to understand the basics.

- Pros: Easy & Simple
- Cons: Very rigid and doesn't take into account the structure of your text

Concepts to know:

- Chunk Size - The number of characters you would like in your chunks. 50, 100, 100000, etc.
- Chunk Overlap - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.


字元分割是將文本分割成最基本形式的方式。它是將文本簡單地分割成N個字元大小的區塊，而不考慮其內容或形式。

這種方法不推薦用於任何應用，但它是我們了解基礎知識的絕佳起點。

優點：簡單且容易
缺點：非常僵硬，不考慮文本結構
需要了解的概念：

區塊大小：您希望每個區塊包含的字元數量。例如，50，100，100000等。
區塊重疊：您希望順序區塊之間重疊的字元數量。這是為了避免將單個上下文切割成多個部分。這將在區塊之間創建重複數據。

In [None]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

In [None]:
len('This is the text I would like to ch')

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=4, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

In [None]:
from IPython.display import IFrame

IFrame(src='https://chunkviz.up.railway.app/', width=800, height=800)

- Separators are the character(s) sequences you would like to split on. Say you wanted to chunk your data at `ch`, you can specify it.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=4, chunk_overlap=0, separator='ch')
text_splitter.create_documents([text])

## Recursive character splitting

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

這種文本分割器是針對一般文本推薦的。它是由一個字元列表參數化的，按照順序嘗試在這些字元上進行分割，直到區塊足夠小。預設的列表是 ["\n\n", "\n", " ", ""]. 這樣做的效果是盡可能將所有段落（然後是句子，再然後是單詞）保持在一起，因為這些通常看起來是語義上最相關的文本片段。

### CNN (Cable News Network) 數據集

In [None]:
import pandas as pd

df_news = pd.read_csv("tutorial/LLM+Langchain/Week-2/CNN_Articels_clean.csv")

In [None]:
df_news.head(5)

In [None]:
text = df_news.iloc[0]['Article text']

In [None]:
len(text)

In [None]:
text[:100]

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=65, chunk_overlap=0)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=65, chunk_overlap=0, separators=[",", ".", "?", "!"])

In [None]:
documents = text_splitter.create_documents([text])

In [None]:
print(documents[0])
print(len(documents[0].page_content))

In [None]:
print(documents[1])
print(len(documents[1].page_content))

In [None]:
print(documents[2])
print(len(documents[2].page_content))

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)

In [None]:
documents = text_splitter.create_documents([text])

In [None]:
print(documents[0])
print(len(documents[0].page_content))

In [None]:
print(documents[1])
print(len(documents[1].page_content))

In [None]:
import re

# Input text
text = ", there's a shortage of truck drivers in the US and worldwide."

# Remove punctuation using regex
cleaned_text = re.sub(r"[^\w\s]", "", text)

print(cleaned_text)

## Document Specific Splitting

### Markdown splitter

This code snippet demonstrates how to use LangChain's MarkdownTextSplitter to split a Markdown text document into smaller chunks. The MarkdownTextSplitter class is designed to handle Markdown-specific structure, making it easier to process and retrieve information from Markdown documents.

### 1. Import LangChain Components

- Ensure that the necessary components from LangChain are imported. This might include MarkdownTextSplitter.
- 確保導入 LangChain 的必要組件。這可能包括 MarkdownTextSplitter。

In [None]:
from langchain.text_splitter import MarkdownTextSplitter

### 2. Initialize the Text Splitter

- The MarkdownTextSplitter is initialized with a chunk_size of 40 and chunk_overlap of 0. This means each chunk will contain up to 40 characters, and there will be no overlap between chunks.
- MarkdownTextSplitter 被初始化為 chunk_size 為 40，chunk_overlap 為 0。這意味著每個塊將包含最多 40 個字符，並且塊之間不會重疊。

In [None]:
text_splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

In [None]:
markdown_text = """
# Fun in Califormia

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

### 3. Create Documents from Markdown Text

- The create_documents method of MarkdownTextSplitter is used to split the Markdown text into smaller chunks based on the specified chunk size.
- 使用 MarkdownTextSplitter 的 create_documents 方法根據指定的塊大小將 Markdown 文本拆分成較小的部分。

In [None]:
text_splitter.create_documents([markdown_text])

### Python splitter

In [None]:
from langchain.text_splitter import PythonCodeTextSplitter

python_text = """
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

p1 = Person("John", 36)

for i in range(10):
    print(i)
"""

python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
python_splitter.create_documents([python_text])

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language


python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=100, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_text])
python_docs

### split code: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/code_splitter/

In [None]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder()

## Semantic Splitting

- StatisticalChunker (text)
- ConsecutiveChunker (text, audio)
- CumulativeChunker (text)

### StatisticalChunker

The statistical chunking method our most robust chunking method, it uses a varying similarity threshold to identify more dynamic and local similarity splits. It offers a good balance between accuracy and efficiency but can only be used for text documents (unlike the multi-modal ConsecutiveChunker).

The StatisticalChunker can automatically identify a good threshold value to use while chunking our text, so it tends to require less customization than our other chunkers.

最強大的分塊方法是統計分塊方法，它使用變化的相似度閾值來識別更多動態和本地相似度的分割。它在準確性和效率之間提供了良好的平衡，但只能用於文本文件（與多模態的連續分塊器不同）。

統計分塊器可以自動識別一個好的閾值來用於分塊我們的文本，因此它通常比我們的其他分塊器需要更少的定制。

In [None]:
from semantic_chunkers import StatisticalChunker

chunker = StatisticalChunker(encoder=encoder)

text = df_news.iloc[0]['Article text']

chunks = chunker(docs=[text])

In [None]:
chunks[0][0]

In [None]:
chunks[0][1]

### Consecutive Chunking

Consecutive chunking is the simplest version of semantic chunking.

連續分塊是語義分塊最簡單的版本。

In [None]:
from semantic_chunkers import ConsecutiveChunker

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

chunks = chunker(docs=[text])

In [None]:
chunks[0][0].splits

## Cumulative Chunking

Cumulative chunking is a more compute intensive process, but can often provide more stable results as it is more noise resistant. However, it is very expensive in both time and (if using APIs) money.

In [None]:
from semantic_chunkers import CumulativeChunker

chunker = CumulativeChunker(encoder=encoder, score_threshold=0.3)

chunks = chunker(docs=[text])

In [None]:
chunks[0][0]