# 語音轉文字 (Speech to Text)

在許多應用中，語音資料需要被轉換為文字，例如字幕產生、語音助理、會議紀錄或多語言翻譯。OpenAI 提供了 **Whisper** 模型，可以高效地進行以下兩種主要任務：

- **轉錄 (Transcribe)：** 將音訊檔案轉換成相同語言的文字。  
- **翻譯轉錄 (Translate & Transcribe)：** 將音訊檔案的內容轉錄並翻譯成英文。  

目前檔案上傳限制為 **25 MB**，支援的音訊檔案格式包括：  
`mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, `webm`

---

# 文字轉語音 (Text to Speech, TTS)

除了語音轉文字，OpenAI 也提供了 **文字轉語音 (TTS)** 功能。透過 TTS，我們可以將輸入的文字轉換為自然流暢的語音，適合應用於：  

- **語音助理**：讓機器能以自然語音回覆使用者。  
- **內容創作**：自動生成旁白或有聲讀物。  
- **輔助功能**：幫助視覺障礙者或閱讀困難者更容易獲取資訊。  

TTS 支援多種語音風格與音質選擇，能夠在不同應用場景中提供更自然的使用體驗。

---

接下來，我們將透過程式範例來展示如何使用 Python 串接 **Whisper** (語音轉文字) 與 **TTS** (文字轉語音)，並逐步完成音訊與文字的雙向轉換。


## Audio models

| Model   | Usage                                            |
|---------|--------------------------------------------------|
| Whisper |  \$ 0.006 / minute rounded to the nearest second     |
| TTS     |  \$ 15.00 / 1M characters                          |
| TTS HD  |  \$ 30.00 / 1M characters                          |






In [None]:
import os

os.chdir("../../../")

In [None]:
from src.initialization import credential_init
from src.io.path_definition import get_project_dir, get_file


credential_init()

In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

## Transcription

### Batch Transcription

在 非串流模式（使用 transcriptions.create 上傳檔案）時，Whisper 會等待整個音訊檔案完全上傳後，才會對完整音檔進行轉錄，並在完成後一次性回傳結果。

In [None]:
audio_file= open("tutorial/LLM+Langchain/Week-6/孩子上網，小心上當.mp3", "rb")

In [None]:
transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file
)

In [None]:
transcription.text

假設你需要時間軸

In [None]:
transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file,
  response_format='srt'
)

In [None]:
transcription

In [None]:
transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file,
  response_format='srt',
  prompt="這是一個關於防詐騙的宣傳, 私line",
)

In [None]:
transcription

### Stream

如果你在呼叫時使用 stream=True，SDK 仍然會先上傳整個音訊檔（它並不是一段一段分塊上傳），但不同的是，它不會等到完整的轉錄結束才回傳，而是會在 Whisper 生成的同時，逐步傳回結果。

因此，你會在轉錄過程中陸續看到部分文字片段。

等到 Whisper 完成後，你就會得到最終的完整轉錄結果。

- 只支援 gpt-4o-mini or gpt-4o transcription

- 在gpt-4o-transcript 和 gpt-4o-mini-transcript 模型記得加上語言編碼，不然中文輸出極大概率是簡體中文。

In [None]:
audio_file= open("tutorial/LLM+Langchain/Week-6/孩子上網，小心上當.mp3", "rb")

stream = client.audio.transcriptions.create(
  file=audio_file,
  model="gpt-4o-mini-transcribe",
  stream=True,
  language='zh-tw'
)

# for event in stream:
#   print(event)

for event in stream:
    # Depending on the event format; e.g. maybe:
    # - event.type might be “transcript.text.delta”
    # - or event.delta or event.partial
    try:
        # Example if event has attribute 'delta'
        partial = event.delta  
        print(partial, end="", flush=True)
    except AttributeError:
        # fallback / final result
        print(event)

In [None]:
stream = client.audio.transcriptions.create(
  file=audio_file,
  model="gpt-4o-transcribe",
  stream=True,
  language='zh-tw'
)

for event in stream:
    # Depending on the event format; e.g. maybe:
    # - event.type might be “transcript.text.delta”
    # - or event.delta or event.partial
    try:
        # Example if event has attribute 'delta'
        partial = event.delta  
        print(partial, end="", flush=True)
    except AttributeError:
        # fallback / final result
        print(event)

## 增加可靠度

### Prompt

- 給予Context
- 給予關鍵字 (私line，應該只有台灣人用這個詞，別太指望OpenAI)

### Timestamp_granularities

- 可能有用

一音多字可能沒辦法改善，畢竟把所有一音多字的情況標出來也不實際

In [None]:
stream = client.audio.transcriptions.create(
  file=audio_file,
  model="gpt-4o-transcribe",
  stream=True,
  language='zh-tw',
  prompt="這是一個關於防詐騙的宣傳, 私line",
  timestamp_granularities="word",
  temperature=0
)

buffer = ""
for event in stream:
    # Depending on the event format; e.g. maybe:
    # - event.type might be “transcript.text.delta”
    # - or event.delta or event.partial
    try:
        # Example if event has attribute 'delta'
        partial = event.delta
        buffer += partial
        print("\r" + buffer, end="", flush=True)
        # print(partial, end="", flush=True)
    except AttributeError:
        # fallback / final result
        print("\r" + " " * len(buffer), end="")  # 清掉這行
        print("\r" + event.text)  # 印最終結果

# 閩南語語音辨識（ASR）可行性評估

## ✅ 你的觀點正確的地方
1. **書寫標準不統一是瓶頸**  
   - 語音模型訓練需要「語音 + 對應文字轉錄」資料。  
   - 閩南語存在多種書寫系統（漢字、台羅文、白話字/羅馬拼音），缺乏一致標準會造成資料集混亂，難以收斂。  

2. **資源不足**  
   - 相比英語、中文，閩南語的「語音–文字平行語料」極少。  
   - 缺乏大量公開 dataset，從零開始訓練成本極高。  

3. **實用性受限**  
   - 沒有公認的書寫方式，ASR 輸出的結果難以被廣泛採用。  
   - 例如：同一句話輸出為「漢字版」或「台羅文版」，使用者群體可能互不接受。  

---

## 💡 技術上值得補充的觀點
1. **語音模型本身不依賴「官方標準」**  
   - 模型只需要「統一的訓練標籤」。  
   - 無論是台羅文、漢字或羅馬拼音，只要 dataset 標註一致，模型即可學習。  
   - 困難點在於「社群能否就某種書寫系統達成共識」。  

2. **可採「多輸出 / 後處理」策略**  
   - 訓練時使用 **台羅文**（與音素對應性較佳）。  
   - 應用層可透過轉換器將台羅文輸出轉換成漢字或其他拼音系統。  
   - 這樣可繞過「標準未定」的問題。  

3. **現代語音技術降低門檻**  
   - Whisper 等大模型已證明：低資源語言只要有數百小時資料即可微調出可用系統。  
   - 技術上「可行」，只是 **成本高 + 資料缺乏**。  

4. **平行案例**  
   - 客家話、藏語、威爾斯語、愛爾蘭語等語言也有「語料少、書寫多樣」的挑戰。  
   - 仍有人成功建立 ASR → 證明並非「技術上不可能」。  

---

## 📌 綜合評估
- **正確：** 在台灣現況下，缺乏書寫標準確實限制模型實用性與推廣。  
- **補充：** 從純技術角度，書寫標準不是必要條件，只要 dataset 標註一致即可建模。  
- **真正挑戰：**  
  1. 語料不足（收集成本高）  
  2. 缺乏社群共識（採哪一種文字標準）  
  3. 下游應用有限（市場需求小）  

➡️ **結論：** 技術上可行，但現實條件下效益有限。


### 試試看從電腦透過麥克風進行錄音然後使用 Transcribe

- https://www.gyan.dev/ffmpeg/builds/
- pip install sounddevice

In [None]:
!pip install sounddevice

In [None]:
import io

import numpy as np
import sounddevice as sd
from pydub import AudioSegment

# AudioSegment.converter = r"tutorial\LLM+Langchain\Week-6\ffmpeg-essentials_build\bin\ffmpeg.exe"


DURATION = 5  # seconds
FS = 44100    # sample rate

print("Recording...")

# Record audio
audio = sd.rec(int(DURATION * FS), samplerate=FS, channels=1, dtype='int16')
sd.wait()  # Wait until recording is finished
print("Recording finished.")

In [None]:
# Convert numpy array to AudioSegment
audio_bytes = audio.tobytes()

audio_segment = AudioSegment(
    data=audio_bytes,
    sample_width=audio.dtype.itemsize,
    frame_rate=FS,
    channels=1
)

# Save as MP3 in-memory (as an object)
mp3_io = io.BytesIO()
audio_segment.export(mp3_io, format="mp3")
mp3_io.seek(0)  # Rewind to start

# https://community.openai.com/t/openai-whisper-send-bytes-python-instead-of-filename/84786/3

mp3_io.name = "word.mp3"

In [None]:
# 將上述步驟寫成一函數

def audio_to_mp3_io(audio, FS:int, channels: int):

    # Convert numpy array to AudioSegment
    audio_bytes = audio.tobytes()
    
    audio_segment = AudioSegment(
        data=audio_bytes,
        sample_width=audio.dtype.itemsize,
        frame_rate=FS,
        channels=channels
    )
    
    # Save as MP3 in-memory (as an object)
    mp3_io = io.BytesIO()
    audio_segment.export(mp3_io, format="mp3")
    mp3_io.seek(0)  # Rewind to start

    mp3_io.name = "word.mp3"

    return mp3_io

In [None]:
stream = client.audio.transcriptions.create(
    model="gpt-4o-transcribe", 
    file=mp3_io,
    response_format="text",
    language='zh-tw',
    stream=True
)

buffer = ""
for event in stream:
    # Depending on the event format; e.g. maybe:
    # - event.type might be “transcript.text.delta”
    # - or event.delta or event.partial
    try:
        # Example if event has attribute 'delta'
        partial = event.delta
        buffer += partial
        print("\r" + buffer, end="", flush=True)
        # print(partial, end="", flush=True)
    except AttributeError:
        # fallback / final result
        print("\r" + " " * len(buffer), end="")  # 清掉這行
        print("\r" + event.text)  # 印最終結果

在先前的範例中，我們沒辦法決定從哪個時候開始錄音

現在我們加入起始和結束的控制

In [None]:
import threading

FS = 44100
CHANNELS = 1
dtype = 'int16'

recorded_frames = []

def callback(indata, frames, time, status):
    recorded_frames.append(indata.copy())

def record_audio():
    # 加入起始和結束的控制
    with sd.InputStream(samplerate=FS, channels=CHANNELS, 
                        dtype=dtype, callback=callback):
        input("Press Enter to start recording...")
        print("Recording... Press Enter again to stop.")
        input()
        print("Recording stopped.")


In [None]:
# Clear frames before each recording
recorded_frames.clear()
recording_thread = threading.Thread(target=record_audio)
recording_thread.start()
recording_thread.join()

In [None]:
# Combine recorded frames
if recorded_frames:
    audio_np = np.concatenate(recorded_frames, axis=0)
else:
    audio_np = np.array([], dtype=dtype)

mp3_io =  audio_to_mp3_io(audio=audio_np, FS=FS, channels=1)

In [None]:
stream = client.audio.transcriptions.create(
    model="gpt-4o-transcribe", 
    file=mp3_io,
    response_format="text",
    language='zh-tw',
    stream=True
)

buffer = ""
for event in stream:
    # Depending on the event format; e.g. maybe:
    # - event.type might be “transcript.text.delta”
    # - or event.delta or event.partial
    try:
        # Example if event has attribute 'delta'
        partial = event.delta
        buffer += partial
        print("\r" + buffer, end="", flush=True)
        # print(partial, end="", flush=True)
    except AttributeError:
        # fallback / final result
        print("\r" + " " * len(buffer), end="")  # 清掉這行
        print("\r" + event.text)  # 印最終結果

## Translations

翻譯 API 接收支持的任何語言的音頻文件作為輸入，並轉錄為英文。這與我們的 /Transcriptions 端點不同，因為輸出不是原始輸入語言的文本，而是轉換為英文文本。

In [None]:
recorded_frames.clear()
recording_thread = threading.Thread(target=record_audio)
recording_thread.start()
recording_thread.join()

# Combine recorded frames
if recorded_frames:
    audio_np = np.concatenate(recorded_frames, axis=0)
else:
    audio_np = np.array([], dtype=dtype)

mp3_io =  audio_to_mp3_io(audio=audio_np, FS=FS, channels=1)

# 這個接口(end point)不支援 stream
output = client.audio.translations.create(
    model="whisper-1", 
    file=mp3_io,
    response_format="text",
)

print(output.text)

## Text to Speech

The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used toming

- Narrate a written blog post
- Produce spoken audio in multiple languages
    - Afrikaans,
    - Arabic,
    - Armenian,
    - Azerbaijani,
    - Belarusian,
    - Bosnian,
    - Bulgarian,
    - Catalan,
    - Chinese,
    - Croatian,
    - Czech,
    - Danish,
    - Dutch,
    - English,
    - Estonian,
    - Finnish,
    - French,
    - Galician,
    - German,
    - Greek,
    - Hebrew,
    - Hindi,
    - Hungarian,
    - Icelandic,
    - Indonesian,
    - Italian,
    - Japanese,
    - Kannada,
    - Kazakh,
    - Korean,
    - Latvian,
    - Lithuanian,
    - Macedonian,
    - Malay,
    - Marathi,
    - Maori,
    - Nepali,
    - Norwegian,
    - Persian,
    - Polish,
    - Portuguese,
    - Romanian,
    - Russian,
    - Serbian,
    - Slovak,
    - Slovenian,
    - Spanish,
    - Swahili,
    - Swedish,
    - Tagalog,
    - Tamil,
    - Thai,
    - Turkish,
    - Ukrainian,
    - Urdu,
    - Vietnamese,
    - Welsh.
- Optimized for English
- Give real time audio output using streaming


!pip install playsound

In [None]:
from playsound import playsound

# By default, output format is in mp3

speech_file_path = os.path.join("tutorial/LLM+Langchain/Week-6/Sample.mp3")

response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Today is Saturday, how are you?")

# 儲存為檔案
response.stream_to_file(speech_file_path)

# 播放 (simple, blocking)
playsound(speech_file_path)

gpt-4o-mini-tts 和 gp4-4o-tts 可以透過 prompt來控制語調

In [None]:
speech_file_path = os.path.join("tutorial/LLM+Langchain/Week-6/Sample-gpt-4o-mini.mp3")

response = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input=("Today is Saturday, how are you?"),
    instructions="Speak in a tone of tiresome.")

response.stream_to_file(speech_file_path)

playsound(speech_file_path)

In [None]:
# response.content
# audio = AudioSegment.from_file(io.BytesIO(audio_bytes), format="mp3")

### pygame (good for GUI apps / async playback)

In [None]:
import pygame

pygame.mixer.init()
pygame.mixer.music.load(speech_file_path)
pygame.mixer.music.play()

while pygame.mixer.music.get_busy():
    pygame.time.Clock().tick(10)

### 直接撥放 + 串流

相較於之前是先將結果存成檔案再撥放檔案，我們可以跳過儲存的步驟，這樣效率上來說更快一點
將格式從壓縮的mp3換成無壓縮的wav

** 開源社群臥虎藏龍

https://community.openai.com/t/streaming-from-text-to-speech-api/493784/29

In [None]:
!pip install pyaudio

In [None]:
from textwrap import dedent

input_ = dedent("""
鳴大鐘一次！
推動杠杆，啟動活塞和泵……
鳴大鐘兩次！
按下按鈕，發動引擎，點燃渦輪，注入生命……
鳴大鐘三次！
齊聲歌唱，讚美萬機之神！
""")

In [None]:
import os
from time import sleep
import wave
import requests
import pyaudio

# WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding overhead.

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": f'Bearer {os.getenv("OPENAI_API_KEY")}',
}

data = {
    "model": "tts-1",
    "input": input_,
    "voice": "shimmer",
    "response_format": "wav",
}

# You send the request to the API (requests.post).
response = requests.post(url, headers=headers, json=data, stream=True)

CHUNK_SIZE = 1024

if response.ok:
    # You open the response audio data with wave.open.
    with wave.open(response.raw, 'rb') as wf:
        p = pyaudio.PyAudio()
        # You configure the PyAudio stream (p.open(...)).
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)

        while len(data := wf.readframes(CHUNK_SIZE)):
            #這行負責撥放聲音
            stream.write(data)

        # Sleep to make sure playback has finished before closing
        sleep(1)
        stream.close()
        p.terminate()
else:
    response.raise_for_status()

### 實時串流 (Real time streaming)

The Speech API provides support for realtime audio streaming using chunk transfer encoding. This means the audio can be played before the full file is generated and made accessible.

** For the fastest response times, we recommend using wav or pcm as the response format.

In [None]:
import asyncio

from openai import AsyncOpenAI
from openai.helpers import LocalAudioPlayer

async_openai = AsyncOpenAI()

async def main() -> None:
    async with async_openai.audio.speech.with_streaming_response.create(
        model="gpt-4o-mini-tts",
        voice="coral",
        input="Today is a wonderful day to build something people love!",
        instructions="Speak in a cheerful and positive tone.",
        response_format="pcm",
    ) as response:
        await LocalAudioPlayer().play(response)

await main()

## 語音自助點飲料系統

這只是一個DEMO，所以會非常簡陋。我也不清楚這到底有沒有做的價值

假設一個很簡單的使用情境: 一個人講了他對於飲料的需求，然後whisper從語音中抽取他對於飲料的需求:

- 哪種飲料
- 冰
- 糖

先不管利用TTS最回應機制，看看whisper能不能正確抽取內容

### 清心

正式做的話應該是直接開 WebScraping

- 珍珠蜂蜜鮮奶普洱
- 茶凍奶綠
- 嚴選高山茶
- 咖啡奶茶
- 冬瓜檸檬

冰熱: 正常冰 - 少冰 - 微冰 - 去冰
甜度: 無糖 - 微糖 - 半糖 - 少糖

In [3]:
import os

os.chdir("../../../")

In [17]:
import os
from textwrap import dedent
from typing import Literal, List

from langchain_openai import ChatOpenAI
from langchain_core.prompts.image import ImagePromptTemplate
from langchain_core.prompts import PromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate
from langchain_core.runnables import Runnable, chain
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

from src.initialization import credential_init


def build_standard_chat_prompt_template(kwargs):
    messages = []

    if 'system' in kwargs:
        content = kwargs.get('system')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = [PromptTemplate(**c) for c in content]
        else:
            prompts = [PromptTemplate(**content)]

        message = SystemMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    if 'human' in kwargs:
        content = kwargs.get('human')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = []
            for c in content:
                if c.get("type") == "image":
                    prompts.append(ImagePromptTemplate(**c))
                else:
                    prompts.append(PromptTemplate(**c))
        else:
            if content.get("type") == "image":
                prompts = [ImagePromptTemplate(**content)]
            else:
                prompts = [PromptTemplate(**content)]

        message = HumanMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    chat_prompt_template = ChatPromptTemplate.from_messages(messages)
    
    return chat_prompt_template



credential_init()

model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o-mini", 
                   temperature=0 # a range from 0-2, the higher the value, the higher the `creativity`
                  )

### 定義輸出格式

In [14]:
class Drink(BaseModel):

    name: Literal['珍珠蜂蜜鮮奶普洱', '茶凍奶綠', '嚴選高山茶', 
                  '咖啡奶茶', '冬瓜檸檬'] = Field(description="飲料名稱")
    ice_level: Literal['正常冰', '少冰', '微冰', '去冰'] = Field(description='冰熱程度')
    sugar_level: Literal['無糖', '微糖', '半糖' , '少糖'] = Field(description='糖度')


class Order(BaseModel):

    names: List[Drink] = Field(description=("用戶點的飲料"))

output_parser = PydanticOutputParser(pydantic_object=Order)
format_instructions = output_parser.get_format_instructions()

### 模擬從whisper得到用戶需求

In [21]:
human_template = dedent("""
                 {query}
                 format instruction: {format_instructions}
                 """)

input_ = {
          "human": {"template": human_template,
                    "input_variable": ["query"],
                    "partial_variables": {"format_instructions": 
                                          format_instructions}}}

chat_prompt_template = build_standard_chat_prompt_template(input_)

pipeline = chat_prompt_template | model | output_parser

## 透過語言模型提取關鍵字

In [22]:
output = pipeline.invoke({"query": "我要一杯冬瓜檸檬，微糖，去冰"})

In [23]:
output

Order(names=[Drink(name='冬瓜檸檬', ice_level='去冰', sugar_level='微糖')])

In [24]:
output = pipeline.invoke({"query": "我要一杯冬瓜檸檬和一杯嚴選高山茶。冬瓜檸檬微糖去冰，嚴選高山茶無糖微冰"})
output

Order(names=[Drink(name='冬瓜檸檬', ice_level='去冰', sugar_level='微糖'), Drink(name='嚴選高山茶', ice_level='微冰', sugar_level='無糖')])

In [25]:
import io
import threading

import numpy as np
import sounddevice as sd
from openai import OpenAI
from pydub import AudioSegment

AudioSegment.converter = r"tutorial\LLM+Langchain\Week-6\ffmpeg-essentials_build\bin\ffmpeg.exe"

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

FS = 44100
CHANNELS = 1
dtype = 'int16'

recorded_frames = []

def callback(indata, frames, time, status):
    recorded_frames.append(indata.copy())

def record_audio():
    # 加入起始和結束的控制
    with sd.InputStream(samplerate=FS, channels=CHANNELS, 
                        dtype=dtype, callback=callback):
        input("Press Enter to start recording...")
        print("Recording... Press Enter again to stop.")
        input()
        print("Recording stopped.")

def audio_to_mp3_io(audio, FS:int, channels: int):

    # Convert numpy array to AudioSegment
    audio_bytes = audio.tobytes()
    
    audio_segment = AudioSegment(
        data=audio_bytes,
        sample_width=audio.dtype.itemsize,
        frame_rate=FS,
        channels=channels
    )
    
    # Save as MP3 in-memory (as an object)
    mp3_io = io.BytesIO()
    audio_segment.export(mp3_io, format="mp3")
    mp3_io.seek(0)  # Rewind to start

    mp3_io.name = "word.mp3"

    return mp3_io



我要一杯冬瓜檸檬和一杯嚴選高山茶。冬瓜檸檬微糖去冰，嚴選高山茶無糖微冰

In [27]:
recorded_frames.clear()
recording_thread = threading.Thread(target=record_audio)
recording_thread.start()
recording_thread.join()

Press Enter to start recording... 


Recording... Press Enter again to stop.


 


Recording stopped.


In [48]:
from datetime import datetime
# Combine recorded frames
if recorded_frames:
    audio_np = np.concatenate(recorded_frames, axis=0)
else:
    audio_np = np.array([], dtype=dtype)

mp3_io =  audio_to_mp3_io(audio=audio_np, FS=FS, channels=1)

# 這個接口(end point)不支援 stream
time_begin = datetime.now()
whisper_output = client.audio.translations.create(
    model="whisper-1", 
    file=mp3_io,
    response_format="text",
    prompt="用戶在點飲料, 微糖, 微冰",
)
time_end = datetime.now()

print(time_end - time_begin)

0:00:04.080706


In [49]:
whisper_output

'我要一杯冬瓜檸檬和一杯嚴選高山茶, 冬瓜檸檬微糖去冰, 嚴選高山茶無糖微冰。\n'

## 計算帳單和回應

In [50]:
human_template = dedent("""
                 {query}
                 format instruction: {format_instructions}
                 """)

input_ = {
          "human": {"template": human_template,
                    "input_variable": ["query"],
                    "partial_variables": {"format_instructions": 
                                          format_instructions}}}

chat_prompt_template = build_standard_chat_prompt_template(input_)

pipeline = chat_prompt_template | model | output_parser

In [51]:
chat_prompt_template

ChatPromptTemplate(input_variables=['query'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=[PromptTemplate(input_variables=['query'], input_types={}, partial_variables={'format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"$defs": {"Drink": {"properties": {"name": {"description": "飲料名稱", "enum": ["珍珠蜂蜜鮮奶普洱", "茶凍奶綠", "嚴選高山茶", "咖啡奶茶", "冬瓜檸檬"], "title": "Name", "type": "string"}, "ice_level": {"description": "冰熱程度", "enum": ["正常冰", "少冰", "微冰", "去冰"], "title": "Ice Level", "type": "string"}, "sugar_level": {"description": "糖度", "en

In [56]:
orders = pipeline.invoke({"query": whisper_output})

In [55]:
price_map = {"珍珠蜂蜜鮮奶普洱": 70,
         "茶凍奶綠": 50,
         "嚴選高山茶": 35,
         "咖啡奶茶": 75,
         "冬瓜檸檬": 60}

In [61]:
orders.names

[Drink(name='冬瓜檸檬', ice_level='去冰', sugar_level='微糖'),
 Drink(name='嚴選高山茶', ice_level='微冰', sugar_level='無糖')]

In [62]:
total_price = 0

for order in orders.names:
    total_price += price_map[order.name]

In [63]:
total_price

95

In [72]:
import asyncio

from openai import AsyncOpenAI
from openai.helpers import LocalAudioPlayer

async_openai = AsyncOpenAI()

async def main(price) -> None:
    async with async_openai.audio.speech.with_streaming_response.create(
        model="gpt-4o-mini-tts",
        voice="alloy",
        input=f"一共{price}元 💕",
        instructions="Speak in a sweet, energetic, anime-girl style with a cute and playful tone.",
        response_format="pcm",
    ) as response:
        await LocalAudioPlayer().play(response)

await main(price=total_price)

In [12]:
import json
import requests


whisper_output = '我要一杯冬瓜檸檬和一杯嚴選高山茶, 冬瓜檸檬微糖去冰, 嚴選高山茶無糖微冰。'

payload = {'input': {"query": whisper_output}}

"""
- ensure_ascii=False → keeps Chinese characters as-is.

- .encode('utf-8') → ensures bytes sent are UTF-8.

- Header "charset=utf-8" → tells the server the encoding.
"""

response = requests.post(
        "http://localhost:8080/drinking_app/invoke",
        json={'input': {"query": whisper_output}}
    )

In [15]:
order = eval(response.text)

In [21]:
order['output']['names']

[{'name': '冬瓜檸檬', 'ice_level': '去冰', 'sugar_level': '微糖'},
 {'name': '嚴選高山茶', 'ice_level': '微冰', 'sugar_level': '無糖'}]

In [23]:
import pandas as pd

price_map = {"珍珠蜂蜜鮮奶普洱": 70,
         "茶凍奶綠": 50,
         "嚴選高山茶": 35,
         "咖啡奶茶": 75,
         "冬瓜檸檬": 60}

df = pd.DataFrame(order['output']['names'])

df['price'] = df['name'].map(price_map)   

In [24]:
df['price'].sum()

95

## Ollama

This package enables you using open-source LLM with ease.

We borrow the content from last week

https://medium.com/@abonia/running-ollama-in-google-colab-free-tier-545609258453

- curl https://ollama.ai/install.sh | sh
- ollama serve &
- ollama pull llama3:8b
- ollama pull dolphin-llama3:8b
- ollama pull huihui_ai/qwen2.5-abliterate:14b

In [None]:
!pip install -U ollama

In Colab

In [None]:
# !pip install colab-xterm
# %load_ext colabxterm

In [None]:
# %xterm

In [None]:
import os

from torch import cuda
from langchain.prompts import PromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate
from langchain_core.prompts.image import ImagePromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.runnables import Runnable, chain

from src.io.path_definition import get_project_dir


def build_standard_chat_prompt_template(kwargs):
    messages = []

    if 'system' in kwargs:
        content = kwargs.get('system')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = [PromptTemplate(**c) for c in content]
        else:
            prompts = [PromptTemplate(**content)]

        message = SystemMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    if 'human' in kwargs:
        content = kwargs.get('human')

        # allow list of prompts for multimodal
        if isinstance(content, list):
            prompts = []
            for c in content:
                if c.get("type") == "image":
                    prompts.append(ImagePromptTemplate(**c))
                else:
                    prompts.append(PromptTemplate(**c))
        else:
            if content.get("type") == "image":
                prompts = [ImagePromptTemplate(**content)]
            else:
                prompts = [PromptTemplate(**content)]

        message = HumanMessagePromptTemplate(prompt=prompts)
        messages.append(message)

    chat_prompt_template = ChatPromptTemplate.from_messages(messages)
    
    return chat_prompt_template


Ollama 使用模型跟 OpenAI API 沒有區別

但建議架設在有強大算力的機器上，並且使用Langserve呼叫服務，來減輕本地的算力需求。

In [None]:
system_template = "You are a helpful AI assistant with excellent writing skill"

In [None]:
from tqdm import tqdm

from langchain_ollama import ChatOllama

stop_token_ids = None
model_id = "dolphin-llama3:8b"

device = f"cuda:{cuda.current_device()}" if cuda.is_available() else 'cpu'

model = ChatOllama(model=model_id, temperature=0)

summary_prompt_template = build_summary_prompt_template()

summary_pipeline = summary_prompt_template | model | StrOutputParser()

text_as_list = []
for document in tqdm(documents):
    content = summary_pipeline.invoke({"text": document.page_content})
    text_as_list.append(content)

final_text = "\n".join(text_as_list)

In [None]:
summary_pipeline.invoke({"text": final_text})