# 📖 Agent 實作範例：自動化有聲書生成

本 Notebook 展示如何透過 Agent 串接 **故事 → 圖像 → 語音** 的自動化流程。  
每個步驟都會將輸出保存為檔案，避免重複 Token 消耗，並方便後續流程使用。

---

## 🔹 整體流程
1. 故事內容生成  
2. 圖片內容生成  
3. 語音內容生成  

---

## 1️⃣ 故事內容生成
**輸入**  
- 草稿文字  
- 既有內容 (`.txt` 檔案)  

**處理流程**  
1. 將文字送入 Langserve 服務  
2. 接收生成的故事段落  

**輸出**  
- 將結果存為 `.txt` 檔  
- 避免 Agent 一直傳遞整份故事，降低 Token 消耗  

---

## 2️⃣ 圖片內容生成
**輸入**  
- 最新生成的故事文字  
- 既有圖片 (`.png`)  

**處理流程**  
1. 將圖片編碼為 base64  
2. 文字與圖片送入 Langserve 服務  
3. 接收回傳的圖片（base64 格式）  

**輸出**  
- 將 base64 解碼為二進位資料，存成 `.png` 檔  

---

## 3️⃣ 語音內容生成
**輸入**  
- 最新生成的故事文字  

**處理流程**  
1. 將文字送入 Langserve 服務  
2. 接收回傳的語音（base64 格式）  

**輸出**  
- 將 base64 解碼為二進位資料，存成 `.mp3` 或 `.wav` 檔  

---

## 🔄 流程圖

```mermaid
flowchart TD
    A[故事草稿/舊內容 .txt] --> B[送入 Langserve 生成故事]
    B --> C[故事內容 .txt]
    C --> D[送入 Langserve 生成圖片 (base64)]
    D --> E[解碼並存為 .png]
    C --> F[送入 Langserve 生成語音 (base64)]
    F --> G[解碼並存為 .mp3 / .wav]


## LangServe 服務測試

In [None]:
import os
import requests

os.chdir("../../../")

測試故事生成服務

In [None]:
response = requests.post(
    "http://localhost:8080/story_telling/invoke",
    json={"input":{'scratch': "Create a chapter of a baby owl capturing a rodent in the night as his dinner",
                   'context': ""}
         }
)

In [None]:
story_json = response.json()

In [None]:
story_json['output']

測試影像生成服務

In [None]:
import importlib

image_generation_module = importlib.import_module("tutorial.LLM+Langchain.Week-8.logic.image_generation")
image_create_pipeline = image_generation_module.image_create_pipeline(image_generation_module.system_template)

In [None]:
response_image = requests.post(
    "http://localhost:8080/image_generation/invoke",
    json={"input": {'story': story_json['output'],
                    "image_io": []}
    }
)

In [None]:
response_image.json()['output']['nl_prompt']

In [None]:
import base64
# Decode to bytes
image_bytes = base64.b64decode(response_image.json()['output']['image_base64'])

with open("tutorial/LLM+Langchain/Week-8/story_2_image.png", "wb") as fh:
    fh.write(image_bytes)

測試生成後續後續的故事

In [None]:
response_next_tory = requests.post(
    "http://localhost:8080/story_telling/invoke",
    json={"input":{'scratch': "Create the next chapter following the context",
                   'context': story_json['output']}
         }
)

In [None]:
next_chapter = response_next_tory.json()['output']

根據故事和上一張圖片，產生出下一張圖片

透過requests送出base64 string

In [None]:
response = requests.post(
    "http://localhost:8080/image_generation/invoke",
    json={"input": {'story': next_chapter + f"\nPrevious image description:\n\n{response_image.json()['output']['nl_prompt']}",
                    # 'image_io': [response_image.json()['output']['image_base64']]
                    'image_io': []
                   }
    }
)

In [None]:
response.json()['output']['nl_prompt']

In [None]:
import base64

image_bytes = base64.b64decode(response.json()['output']['image_base64'])

with open("tutorial/LLM+Langchain/Week-8/story_3_image.png", "wb") as fh:
    fh.write(image_bytes)

測試語音生成

In [None]:
response = requests.post(
    "http://localhost:8080/audio_generation/invoke",
    json={"input": {'input': "How are you doing?"}}
)

In [None]:
audio_bytes = base64.b64decode(response.json()['output'])

with open("tutorial/LLM+Langchain/Week-8/test_sample.mp3", "wb") as f:
    f.write(audio_bytes)


## 生成工具模板

In [None]:
from textwrap import dedent
from typing import Optional, Any, List, Tuple, Callable
from pathlib import Path

from langchain_openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.tools import BaseTool
from langchain_core.runnables import Runnable
from pydantic import BaseModel, Field
from pydantic import FilePath

from src.initialization import credential_init


class ToolTemplate(BaseTool):

    """
    ToolTemplateHTTP: 一個專門用來呼叫 Langserve REST API 的 Agent Tool

    - 使用 PydanticOutputParser 保證輸入格式正確
    - 支援多欄位 input/output 處理器
    - 對 API 呼叫加上錯誤處理
    """
    
    runnable: str = Field(..., description='The Langserve endpoint')
    name: str
    input_parser: PydanticOutputParser
    description: str
    input_data_processors: Optional[List[Tuple[str, Callable[[Any], Any]]]] = None
    output_data_processors: Optional[List[Tuple[Optional[str], Callable[[Any], Any]]]] = None
    
    @classmethod
    def create(cls, runnable: str, name: str, description_template: str,
               input_parser: PydanticOutputParser, input_data_processors: Optional=None,
               output_data_processors: Optional=None):

        """建立 Tool 實例，會自動把輸入格式需求加入 description"""
        
        input_format_instruction = input_parser.get_format_instructions()
        
        description = description_template.format(
            input_format_instruction=input_format_instruction
        )
        
        return cls(runnable=runnable, name=name, description=description,
                   input_parser=input_parser, input_data_processors=input_data_processors,
                   output_data_processors=output_data_processors)
    
    def _run(self, query: str):

        """執行 Tool，同步版本"""
        
        # 1. 驗證 & parse 輸入
        try:
            input_ = self.input_parser.parse(query)
        except Exception as e:
            raise ValueError(f"Failed to parse input with parser: {e}, query={query}")
        
        runnable_inputs = input_.model_dump()

        # 2. input processors（前處理）
        if self.input_data_processors:
            for field, fn in self.input_data_processors:
                if field in runnable_inputs:
                    runnable_inputs[field] = fn(runnable_inputs[field])
            
        # 3. 呼叫 Langserve REST API
        try:
            response = requests.post(
                str(self.runnable),
                json={"input": runnable_inputs},
                timeout=60,
            )
            response.raise_for_status()
            result = response.json()
        except Exception as e:
            raise RuntimeError(f"Langserve call failed: {e}, inputs={runnable_inputs}")

        if "output" not in result:
            raise RuntimeError(f"Invalid response format from Langserve: {result}")
        
        output = result['output']

        # 4. update the state varaibles:
        for key in session_state.keys():
            if key in output:
                session_state[key] = output[key]
        
        # 5. output processors（後處理）
        if self.output_data_processors:
            for field, fn in self.output_data_processors:
                if not field:
                    fn(output, runnable_inputs['filename'])
                else:
                    fn(output[field], runnable_inputs['filename'])
                    
        # 預設回傳「檔名」如果有 filename，否則回傳輸出的字串
        return runnable_inputs.get("filename", output)

    def _arun(self, query: str):

        raise NotImplementedError("This tool does not support async")

### State Variables

In [None]:
session_state = {}

session_state['nl_prompt'] = None

In [None]:
from pathlib import Path

from pydantic import BaseModel, Field


class StoryInput(BaseModel):
    scratch: str = Field(description=dedent("""\
                                            The draft, notes, or rough idea for the current page of the story.
                                           （故事當前頁面的草稿、筆記或初步構想
                                            """))
    context: List[FilePath] = Field(default_factory=list, description=dedent("""\
                                                              A list of previously generated .txt files that contain story content.  
                                                              Used to maintain narrative consistency and continuity across images.  
                                                              先前生成的 .txt 檔案清單，其中包含故事內容。  
                                                              用於保持影像生成過程中的敘事一致性與連貫性。
                                                              """))
    filename: str = Field(..., description=dedent("""\
                                             The file path where the generated story text will be saved.
                                            （生成的故事文本將被儲存的檔案路徑）
                                             """))


def export_to_txt(text, filename: Path):

    dir_ = Path(filename).parent

    if not os.path.isdir(dir_):
        os.makedirs(dir_)
    
    with open(filename, "w", encoding="utf-8") as file:
        file.write(text)


def read_from_txt(filename) -> str:

    with open(filename, "r", encoding="utf-8") as file:
        content = file.read()

    return content


def read_from_list_of_text(filenames) -> str:

    return "\n\n".join([read_from_txt(f) for f in filenames])


story_input_data_processors = [("context", read_from_list_of_text)]

story_output_data_processors = [(None, export_to_txt)]

story_telling_tool = ToolTemplate.create(
    runnable="http://localhost:8080/story_telling/invoke",
    name="Story generation tool",
    description_template=dedent("""\
                                This tool is designed to generate a story one page at a time.
                                Provide a draft or idea for the current page (`scratch`), along with 
                                the preceding story context stored as .txt files (`context`), 
                                and specify where the generated text should be saved (`filename`).
                                    
                                Input format: {input_format_instruction}
                                """
                                ),
    input_parser=PydanticOutputParser(pydantic_object=StoryInput),
    output_data_processors = story_output_data_processors,
    input_data_processors = story_input_data_processors
)

### 影像生成

In [None]:
class ImageInput(BaseModel):
    story: str = Field(description=dedent("""\
                                          The narrative or context used to generate the image prompt.  
                                          故事情節或上下文，用於生成影像提示。
                                          """)
                      )
    # FilePath ensures the input path exists and is a valid file
    image_io: List[str] = Field([], description=dedent("""\
                                                   Path to the previously generated image.  
                                                   Used in img2img generation to maintain visual and texture consistency.  
                                                   先前生成影像的路徑。  
                                                   在 img2img 生成中用於保持視覺與材質的一致性。
                                                   """)
                                    )
    # Path allows flexibility: file may not exist yet but must be a valid path object.
    filename: str = Field(..., description=dedent("""\
                                                  Destination file path where the generated image will be saved.  
                                                  生成影像的儲存檔案路徑。
                                                  """)
                          )


def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')


def image_to_base64_from_list(filenames) -> List[Optional[str]]:

    # return [image_to_base64(f) for f in filenames]
    return []


def export_to_image(content, filename):
    
    dir_ = Path(filename).parent

    if not os.path.isdir(dir_):
        os.makedirs(dir_)
    
    image_bytes = base64.b64decode(content)
    
    with open(filename, "wb") as fh:
        fh.write(image_bytes)


def story_adaptation(story: str):

    nl_prompt = session_state['nl_prompt']
    
    if nl_prompt:
        story += f"\nPrevious image description:\n\n{nl_prompt}"

    print(f"******\n{story}\n*******")
    
    return story
    


image_output_data_processors = [('image_base64', export_to_image)]

image_input_data_processors = [("story", story_adaptation),
                                ("image_io", image_to_base64_from_list)]

image_tool = ToolTemplate.create(
    runnable="http://localhost:8080/image_generation/invoke",
    name="Image generation tool",
    description_template=dedent("""\
                                This tool is designed to generate an image to the correspoinding narrative.
                                Provide a narrative (`story`), along with 
                                the preceding images stored as .png files (`image_io`), 
                                and specify where the generated text should be saved (`filename`).
                                    
                                Input format: {input_format_instruction}
                                """
                                ),
    input_parser=PydanticOutputParser(pydantic_object=ImageInput),
    output_data_processors = image_output_data_processors,
    input_data_processors = image_input_data_processors
)

### 語音生成

In [None]:
class AudioInput(BaseModel):
    input: str = Field(description=dedent("""\
                                          The narrative or context used to generate the audio content with text to sound (TTS).  
                                          故事情節或上下文，用於TTS文字轉語音。
                                          """)
                                    )
    # Path allows flexibility: file may not exist yet but must be a valid path object.
    filename: str = Field(..., description=dedent("""\
                                                  Destination mp3 file path where the generated audio will be saved.  
                                                  mp3語音檔的儲存檔案路徑。
                                                  """)
                         )

def export_to_audio(content, filename):

    dir_ = Path(filename).parent

    if not os.path.isdir(dir_):
        os.makedirs(dir_)
    
    audio_bytes = base64.b64decode(content)
    
    with open(filename, "wb") as fh:
        fh.write(audio_bytes)


audio_output_data_processors = [(None, export_to_audio)]

audio_tool = ToolTemplate.create(
    runnable="http://localhost:8080/audio_generation/invoke",
    name="Audio generation tool",
    description_template=dedent("""\
                                This tool is designed to generate an .mp3 file to a corresponding narrative.
                                Provide a narrative (`story`), along with 
                                and specify where the generated audio should be saved (`filename`).
                                    
                                Input format: {input_format_instruction}
                                """
                                ),
    input_parser = PydanticOutputParser(pydantic_object=AudioInput),
    output_data_processors = audio_output_data_processors,
)

In [None]:
from langchain.prompts import PromptTemplate
from langchain.agents import AgentExecutor, create_react_agent

from src.agent.react_zero_shot import prompt_template as zero_shot_prompt_template

prompt = PromptTemplate.from_template(zero_shot_prompt_template)

tools = [story_telling_tool, image_tool, audio_tool]

credential_init()

model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o", temperature=0, 
                  )

zero_shot_agent = create_react_agent(
    llm=model,
    tools=tools,
    prompt=prompt,
)

agent_executor = AgentExecutor(agent=zero_shot_agent, tools=tools, verbose=True, handle_parsing_errors=True)

In [None]:
query = dedent("""
Create a chapter of a baby owl capturing a rodent in the night as his dinner.
After having the final answer, please create a corresponding image and a corresponding mp3 file.
The saved image (.png), text (.txt), and audio (.mp3) should have same name in the folder `tutorial/LLM+Langchain/Week-8/story_test`
""")

agent_executor.invoke({"input": query})

In [None]:
session_state

成功的生成了一頁的內容，Agent可以幫我們生成整個故事嗎?

In [None]:
prompt = """
         I want to create an 4 pages story for a child. He likes snow owl.
         For each page, please create a corresponding image and record the story as an mp3.
         After having the final answer, please create a corresponding image and record the story as an mp3. 
         The saved image and mp3 should have same name, following the structure of 
         <Page - idx>, with idx as a number starting from 1, in the folder `tutorial/LLM+Langchain/Week-8/story_automation`
         """

agent_executor.invoke({"input": prompt})

## 🎧 互動式有聲書內容生成

- 不一定需要完整的 **Agent 架構**，因為流程的每一步（故事 → 圖像 → 語音）都已經明確定義，能由使用者主動觸發。  
- 可以直接基於 **聊天機器人** 的互動形式進行，每次輸入使用者的需求或指令後，系統依照指定步驟生成對應內容。  
- 使用者可以在故事生成過程中即時調整方向，例如指定角色、情節走向或語氣，提升 **客製化體驗**。  
- 這種互動方式非常適合 **語言學習** 場景：  
  - 學習者能一邊閱讀故事、一邊聽有聲輸出  
  - 可即時修改故事情節，產生更貼近學習需求的內容  
  - 搭配圖片與語音，提升沉浸式學習效果  

# Langgraph

In [None]:
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

from langchain.output_parsers import StructuredOutputParser, ResponseSchema

output_response_schemas = [
        ResponseSchema(name="story", description="the story content in the page"),
        ResponseSchema(name="page index", description="The page number of the story"),
    ]

output_parser = StructuredOutputParser.from_response_schemas(output_response_schemas)

output_format_instructions = output_parser.get_format_instructions()


template = """
           Create a story page {idx}, based on the description: {text}

           The answer continues from previous content:
           {context}

           After having the final answer, please create a corresponding image and record the story as an mp3. 
           The saved image and mp3 should have same name, following the structure of 
           <Page - idx>, in the folder `tutorial/LLM+Langchain/Week-8`

           The output should have the following format: {output_format_instruction}
           """

prompt_template = PromptTemplate(template=template,
                                 input_variables=["text", "context", "idx"],
                                 partial_variables={"output_format_instruction": output_format_instructions})

agent_chain = RunnablePassthrough.assign(input=prompt_template)|agent_executor

In [None]:
Q = agent_chain.invoke({"text": "A little cat just woke up in the morning",
                        "context": "The beginning of the story:\n",
                        "idx": str(1)})

若是以下步驟失敗，嘗試重新生成。這是大語言模型，沒有保證可以100%產出你希望的格式。我們只能盡可能提高成功輸出的機率。

In [None]:
Q['output']

In [None]:
output_parser.parse(Q['output'])

In [None]:
output_parser.parse(Q['output'])['story']

In [None]:
output_parser.parse(Q['output'])['page index']

### 第二頁

In [None]:
context_list = [output_parser.parse(Q['output'])['story']]
print(context_list)

In [None]:
Q_2 = agent_chain.invoke({"text": "Whisker found a dove and wanted to hunt it down!",
                          "context": ":\n".join(context_list),
                          "idx": str(2)})

In [None]:
context_list

In [None]:
output_parser.parse(Q_2['output'])['story']

### okay, it looks fine, let us see how to make it a interactive

In [None]:
# "前情提要"
context_list = []

# 頁面起始
idx = 1

while True:
    if len(context_list) == 0:
        context = "The beginning of the story:\n"
    else:
        context = "\n".join(context_list)

    text = input("請輸入故事內容: 若想要結束 請輸入 `QUIT`")

    if text == "QUIT":
        break
    
    Q = agent_chain.invoke({"text": text,
                            "context": context,
                            "idx": str(idx)})

    story = output_parser.parse(Q['output'])['story']
    
    # 下一頁
    idx += 1

    context_list.append(output_parser.parse(Q['output'])['story'])