# 課程期望控制

1. 建立基本概念，不必成為程式高手

    - 即使你未來不打算寫程式，也至少能對 LLM（大型語言模型）有一個直覺性的理解：

2. 什麼任務是 AI 可以幫你完成的

    - 什麼 Proposal 或工具聲稱能做的事情其實是誇大的、甚至是騙人的

3. 課程不可能涵蓋所有需求

    - 每個人的工作場景、需求和目標都不同，本課程提供的是通用基礎與思維方式，不能涵蓋所有專業或商業細節

4. 縮短技術與商業溝通的落差

    - 讓你在與工程師、AI 團隊或顧問討論時，不會完全聽不懂，也更容易判斷哪些提案合理、哪些需要追問

5. 入門為主，實例為輔

    - 本課程定位是入門，但我會盡量提供實際例子、場景和操作演示，幫助你把概念「落地」，方便未來實際應用
  
# 學習心態提示

1. 不要追求完美
    - LLM 和 AI 的世界瞬息萬變，今天看到的案例，明天可能就更新了。重要的是理解概念和思路，而不是一次就掌握所有細節。

2. 勇於嘗試，敢於犯錯
   - AI 很像一個強大的助手，操作它的過程本身就是學習。錯誤和意外結果都是最好的老師。

3. 保持好奇心
    - 不管你的專業背景是什麼，對 AI 的探索都能給你帶來新的視角。多問「為什麼可以這樣做？」比單純記住操作更重要。

4. 概念先行，技術其次
    - 不必擔心自己不會寫程式，理解 AI 可以做什麼、不能做什麼，以及它的局限，比掌握所有細節更實用。

5. 互動和分享
    - 課堂上你的疑問很可能也困擾其他人，不懂就問，分享你的觀察和想法，這比被動聽課更能加深理解。

# 環境設置

1. conda create -n aicg python=3.10
2. conda activate aicg
3. pip install -r requirements.txt
4. jupyter lab

# Prompt Engineering

## Code Documentation

SYSTEM: You are a helpful AI assistant and you will act as a  Google Senior Software Developer who is going to write the python code documentation. I will give you the code and you will finish the documentation for me.

# LangChain

主流大語言模型的應用框架

1. Modular Abstractions

    - Provides building blocks (LLM wrappers, prompts, memory, chains, agents) so you don’t reinvent patterns.
  
    - Helps organize projects in a scalable way instead of ad-hoc scripts.

2. Integrations & Ecosystem

    - Supports many LLM providers (OpenAI, Anthropic, local models, etc.) and vector databases (Pinecone, Weaviate, FAISS, etc.).

    - Makes it easy to swap components without rewriting large parts of code.

3. Rapid Prototyping

    - Good for quickly validating ideas: retrieval-augmented generation (RAG), tool use, or multi-step workflows.

    - Reduces boilerplate, so you can focus on application logic and user experience.

4. Community & Best Practices

    - Large developer community and ecosystem of templates.

    - Keeps pace with new techniques (e.g., function calling, agents, structured output).

5. Production-Readiness (with caveats)

    - LangChain Expression Language (LCEL) improves reproducibility and debugging.

    - Can be integrated with observability tools, tracing, and monitoring.

    - While early versions were criticized for complexity, the newer iterations emphasize stability and clearer abstractions.

6. Learning & Industry Alignment

    - Because it’s widely adopted, using LangChain means your skills and prototypes are transferable and recognized across teams and organizations.

In [None]:
import os

os.chdir("../../../")

In [None]:
# from langchain.chat_models import ChatOpenAI
from textwrap import dedent

from langchain_openai import ChatOpenAI

from src.initialization import credential_init
from src.io.path_definition import get_project_dir


credential_init()

model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o-mini", 
                   temperature=0 # a range from 0-2, the higher the value, the higher the `creativity`
                  )

# temperature has a range from 0-2, the higher the temperature, the more creative/unpredictable the outcomes. 
# to have a stable or more deterministic result, you should choose temperature = 0

## Alternative Google Gemini Free-Tier

https://aistudio.google.com/usage

In [None]:
# !pip install -qU langchain-google-genai==2.0.0

In [None]:
# Downgrade 到 langchain-google-genai 2.0.0版本 歡迎來到開源世界

import os

from langchain_google_genai import ChatGoogleGenerativeAI

os.environ["GOOGLE_API_KEY"] = "AIzaSyBMk2jn--QDlbZa3pZzLun3vzrOFjxuWho"

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

llm.invoke("What date is today?")

In [None]:
model.invoke("Tell me something about Apple Inc. Just a short summary")

In [None]:
output = model.invoke("Tell me something about Apple Inc. Just a short summary")

In [None]:
output.content

## Prompt Engineering SOP

### 1. Importing Necessary Modules (導入必要的模塊)：

這行代碼從 Langchain 庫中導入了創建和管理提示模板所需的類。

In [None]:
from langchain.prompts import PromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate

### 2. Defining a System Prompt (定義系統提示):

這行代碼使用 PromptTemplate.from_template 方法創建了一個 system_prompt。這個模板指示 AI 以 Gordon Ramsay 的身份行事，模仿他在電視節目《地獄廚房》中的說話方式。

## 人格提示/Persona Example

- Gordon Ramsay: 地獄廚房的暴躁狀態

In [None]:
template=dedent("""
You are a helpful AI assistant embodying Gordon Ramsay, the British celebrity chef.
You adopt his passionate, blunt, and fiery communication style, particularly as seen 
in the television show Hell's Kitchen.\nYour responses should be sharp-witted, brutally honest,
and laced with his signature colorful language—while still being constructive and engaging.
When giving feedback, be direct but insightful, offering both criticism and praise as appropriate.
Adapt to the situation, dialing up the intensity for dramatic effect but maintaining professionalism where needed.
""")

system_prompt = PromptTemplate(template=template)

### 3. Creating a System Message Prompt (創建系統消息提示):

這行代碼將 system_prompt 包裝在 SystemMessagePromptTemplate 中，用於生成系統消息。

In [None]:
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

In [None]:
system_message

### 4. Defining a Human Prompt (定義人類提示):

這行代碼定義了一個 human_prompt 模板，它接收一個變量 query。這個變量在生成提示時將被用戶的輸入替換。

In [None]:
human_prompt = PromptTemplate(template='{query}',
                              input_variables=["query"]
                              )

### 5. Creating a Human Message Prompt (創建人類消息提示): 

這行代碼將 human_prompt 包裝在 HumanMessagePromptTemplate 中，用於生成人類消息。

In [None]:
human_message = HumanMessagePromptTemplate(prompt=human_prompt)

### 6. Combining the Prompts into a Chat Prompt (將提示合併到一個聊天提示中):

這行代碼使用 from_messages 方法將 system_message 和 human_message 模板合併到一個 ChatPromptTemplate 中。這個模板將用於生成對話流程，首先是系統消息，然後是人類消息。

In [None]:
chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

In [None]:
chat_prompt

In [None]:
prompt = chat_prompt.invoke({"query": "A chef just finished his scallops, but you find it is still raw inside"})

In [None]:
prompt

In [None]:
output = model.invoke(prompt)

In [None]:
content = output.content

In [None]:
print(content)

How to do the translation properly?

In [None]:
system_prompt = PromptTemplate(template=("You are a helpful AI assistant with native speaker fluency in both " 
                                         "English and traditional Chinese (繁體中文).\n" 
                                         "You will translate the given content.")
                              )
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template='{query}',
                              input_variables=["query"]
                              )
human_message = HumanMessagePromptTemplate(prompt=human_prompt)

translation_prompt_template =  ChatPromptTemplate.from_messages([system_message,
                                                                 human_message
                                                                ])

prompt = translation_prompt_template.invoke({"query": content})
print(prompt)

In [None]:
output = model.invoke(prompt)
print(output.content)

- Gordon Ramsay: 少年廚神的老好人狀態

In [None]:
template = dedent("""
You are a helpful AI assistant embodying Gordon Ramsay, the British celebrity chef.
You adopt his warm, encouraging, yet honest communication style, particularly as seen in 
the television show MasterChef Junior.\nYour responses should be passionate, supportive,
and constructive—offering praise where deserved while providing direct but kind feedback.
Maintain Ramsay’s signature energy and enthusiasm, but adjust your tone to be more nurturing 
and motivational, ensuring a balance of professionalism, humor, and inspiration.""")


system_prompt = PromptTemplate(template=template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

#之接借用之前的human message

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"query": "A chef just finished his scallops, but you find it is still raw inside."})
output = model.invoke(prompt)

In [None]:
prompt = translation_prompt_template.invoke({"query": output.content})
output = model.invoke(prompt)
print(output.content)

- Donald Trump

In [None]:
template = dedent("""
You are a helpful AI assistant mimicking the behavior, speech patterns, and personality of Donald Trump.
Your responses should reflect his characteristic speaking style, including his confident tone,
persuasive rhetoric, and use of superlatives. You should express opinions in a bold, direct, and 
often hyperbolic manner while maintaining a sense of humor and showmanship.
Adapt your responses to be engaging, memorable, and charismatic, ensuring they align with the tone
and energy Trump is known for.
""")
    
system_prompt = PromptTemplate(template=template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

#之接借用之前的human message

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                human_message
                                               ])

prompt = chat_prompt.invoke({"query": "You just won the US presidential election and you are going to give a speech."})
output = model.invoke(prompt)

In [None]:
prompt = chat_prompt.invoke({"query": """You are going to talk about your view on the southern boarder"""})
output = model.invoke(prompt)

- 雖然這是一個ChatModel但是model本身是沒有記憶性的，他完全不記得你之前提過的任何東西。在ChatGPT中，你每次給入Prompt之後，他會把你之前的輸入和模型的回答作為提示詞輸入，所以可以連續性的回答問題。但這也導致了若是模型的回答偏離了正軌，他其實很難修正回來，因為聊天模型基本上是一種n-shot learning，白話一點就是見人說人話，見鬼說鬼話。一但開始說鬼話，要拉回人話會開始有些難度。解決方法是關掉重來。

### There are more than one ways of constructing your prompt:

- ("system", system_prompt.template): This tuple indicates a system message. system_prompt.template refers to the template content for the system's message.

- ("human", human_prompt.template): This tuple indicates a human message. human_prompt.template refers to the template content for the human's message.

In [None]:
chat_prompt_template = ChatPromptTemplate.from_messages([("system", system_prompt.template),
                                                ("human", human_prompt.template)
                                               ])

In [None]:
chat_prompt_template.invoke({"query": "A chef just finished his scallops but you find it is still raw inside."})

- 模板(template)類似於 Python 字符串，但包含變量的佔位符。Langchain 可以自動識別和管理這些變量，從而簡化生成動態內容的過程。

In [None]:
chat_prompt_template = ChatPromptTemplate.from_messages([("system", system_prompt.template),
                                                 ("human", "{query}")
                                               ])

In [None]:
chat_prompt_template.invoke({"query": "A chef just finished his scallops but you find it is still raw inside."})

In [None]:
prompt = chat_prompt_template.invoke({"query": "A chef just finished his scallops but you find it is still raw inside."})

In [None]:
prompt

In [None]:
# feed the prompt into the model
prompt = chat_prompt_template.invoke({"query": "A chef just finished his scallops but you find it is still raw inside."})
model.invoke(prompt)

## 輸出格式控制: 石器時代版本

ChatGPT輸出格式百百種，你不控制的話，很難將進行量產。想像一下你今天用Word打好文件後，送入印表機影印後，字體會跑掉。

In [None]:
# !pip install wikipedia-api

In [None]:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia(user_agent='AI Tutorial(mengchiehling@gmail.com)', language='zh-tw')

ayoung_wiki = wiki_wiki.page("李雅英")

In [None]:
ayoung_wiki.text

In [None]:
system_template = dedent("""
                  I am going to give you a template for your output. 
                  CAPITALIZED WORDS are my placeholders. Fill in my 
                  placeholders with your output. Please preserve the 
                  overall formatting of my template. My template is:

                 *** Question:*** QUESTION
                 *** Answer:*** ANSWER
                
                 I will give you the data to format in the next prompt. 
                 Create three questions using my template.
                 """)


system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template='{query}',
                                  input_variables=["query"]
                                  )
human_message = HumanMessagePromptTemplate(prompt=human_prompt)

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"query": ayoung_wiki.text})

output = model.invoke(prompt)

print(output.content)

In [None]:
system_template = dedent("""
                 I am going to give you a template for your output. CAPITALIZED
                 WORDS are my placeholders. Fill in my placeholders with your 
                 output. Please preserve the overall formatting of my template. 
                 
                 My template is:
                
                 ## Bio: <NAME>
                 ***Executive Summary:*** <ONE SENTENCE SUMMARY>
                 ***Full Description:*** <ONE PARAGRAPHY SUMMARY>
                
                 """)

system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template='{query}',
                                  input_variables=["query"]
                                  )
human_message = HumanMessagePromptTemplate(prompt=human_prompt)

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"query": ayoung_wiki.text})

output = model.invoke(prompt)

print(output.content)

## 自動模式辨認

In [None]:
system_template = dedent("""
                  I will tell you my start and 
                  end destination and you will provide a 
                  complete list of stops for me, including places to stop 
                  between my start and destination.
                  """)

system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template='{query}',
                              input_variables=["query"]
                             )
human_message = HumanMessagePromptTemplate(prompt=human_prompt)

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                human_message
                                               ])

query = "台東太麻里->Day1->Day2->花蓮天祥"

prompt = chat_prompt.invoke({"query": query})

output = model.invoke(prompt)

print(output.content)

In [None]:
system_template = dedent("""
                  I will tell you my start and end destination and you will 
                  provide a complete list of stops for me, including places 
                  to stop between my start and destination.
                  The output should be in traditional Chinese (繁體中文)
                  """)

system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template='{query}',
                                  input_variables=["query"]
                                  )
human_message = HumanMessagePromptTemplate(prompt=human_prompt)

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

query = "台東太麻里->Day1->Day2->花蓮天祥"

prompt = chat_prompt.invoke({"query": query})

output = model.invoke(prompt)

print(output.content)

In [None]:
# system_template = """
#                   Christmas is coming and I want to ask a girl out. 
#                   Please design a great dating experience for us. 
#                   I will tell you my <start> and <end> destination and you 
#                   will provide a complete list of stops for me, including 
#                   places to stop between my start and destination.
#                   The output should be in traditional Chinese (繁體中文)
#                   """

# system_prompt = PromptTemplate(template=system_template)
# system_message = SystemMessagePromptTemplate(prompt=system_prompt)

# human_prompt = PromptTemplate(template='start: {start}; end: {end}',
#                                   input_variables=["start", "end"]
#                                   )
# human_message = HumanMessagePromptTemplate(prompt=human_prompt)

# chat_prompt = ChatPromptTemplate.from_messages([system_message,
#                                                  human_message
#                                                ])

# """
# 給我提點子，這種題目我會腦死~~
# """
# start = "臺北101"
# end = "淡水老街"

# prompt = chat_prompt.invoke({"start": start, "end": end})

# output = model.invoke(prompt)

# print(output.content)

### Let us wrap the chat_prompt generation with a python function:

In [None]:
def build_standard_chat_prompt_template(kwargs):

    system_content = kwargs['system']
    human_content = kwargs['human']
    
    system_prompt = PromptTemplate(**system_content)
    system_message = SystemMessagePromptTemplate(prompt=system_prompt)
    
    human_prompt = PromptTemplate(**human_content)
    human_message = HumanMessagePromptTemplate(prompt=human_prompt)
    
    chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                     human_message
                                                   ])

    return chat_prompt

system_template = dedent("""
                  Christmas is coming and I want to ask a girl out. 
                  Please design a great dating experience for us. 
                  I will tell you my <start> and <end> destination and you 
                  will provide a complete list of stops for me, including 
                  places to stop between my start and destination.
                  The output should be in traditional Chinese (繁體中文)
                  """)


input_ = {"system": {"template": system_template},
          "human": {"template": 'start: {start}; end: {end}',
                    "input_variable": ["start", "end"]}}

my_chat_prompt_template = build_standard_chat_prompt_template(input_)
print(my_chat_prompt_template)

In [None]:
start = "臺北101"
end = "淡水老街"

prompt = my_chat_prompt_template.invoke({"start": start, 
                                         "end": end})
print(prompt)

In [None]:
output = model.invoke(prompt)

print(output.content)

# **** 預計第一個小時結束 ****

## 輸出格式控制: 精確打擊版本

### 1. Importing Necessary Classes (導入必要的類):

- StructuredOutputParser and ResponseSchema are imported from langchain.output_parsers.
- 從 langchain.output_parsers 導入 StructuredOutputParser 和 ResponseSchema。

In [None]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

### 2. Defining Response Schemas (定義回應結構):

- A list named response_schemas is created, which contains instances of ResponseSchema. ResponseSchema has two attributes:
    - name: This is the key used to retrieve the output.
    - description: This is part of the prompt that describes what the output should be.

<br>

- 創建一個名為 response_schemas 的列表，包含 ResponseSchema 的實例。ResponseSchema 有兩個屬性：
    - name：用於檢索輸出的鍵。
    - description：提示的一部分，用於描述輸出應該是什麼。



In [None]:
response_schemas = [
        ResponseSchema(name="result", 
                       description=dedent("""
                                   The result as a python list of 
                                   python dictionaries"""))
    ]

### 3. Creating the Output Parser (創建輸出解析器):

- output_parser is created by calling StructuredOutputParser.from_response_schemas with the response_schemas list.
- This parser uses the defined schemas to understand and structure the output.

- 通過調用 StructuredOutputParser.from_response_schemas 並傳入 response_schemas 列表來創建 output_parser。
- 該解析器使用定義的結構來理解和結構化輸出。

In [None]:
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

In [None]:
output_parser

### 4. Generating Format Instructions (生成格式說明):

- format_instructions is generated by calling output_parser.get_format_instructions().
- These instructions specify how the output should be formatted, based on the defined schemas.
<br>
<br>
- 通過調用 output_parser.get_format_instructions() 來生成 format_instructions。
- 這些說明根據定義的結構指定輸出的格式。

In [None]:
format_instructions = output_parser.get_format_instructions()

In [None]:
print(format_instructions)

In [None]:
system_template = dedent("""
                I am going to give you a template for your output. CAPITALIZED WORDS are my placeholders. Fill in my placeholders with your output. 
                Please preserve the overall formatting of my template. My template is:
                
                *** Question:*** QUESTION
                *** Answer:*** ANSWER
                
                I will give you the data to format in the next prompt. Create three questions using my template.
                """)


system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template=dedent("""
                                        {query}\n 
                                        output format instruction: {abc}
                                        """),
                              input_variables=["query"],
                              partial_variables={'abc': format_instructions}
                              )
human_message = HumanMessagePromptTemplate(prompt=human_prompt) 

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

In [None]:
query = ayoung_wiki.text

In [None]:
prompt = chat_prompt.invoke({"query": query})

output = model.invoke(prompt)

In [None]:
print(output.content)

In [None]:
output_parser.parse(output.content)

In [None]:
parsed_output = output_parser.parse(output.content)

In [None]:
parsed_output['result']

In [None]:
for content in parsed_output['result']:
    print("\n*****************")
    print(content)

## Pydnatic output control

In [None]:
from typing import List

from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser

class result(BaseModel):

    question: str = Field(description="A question.")
    answer: str = Field(description="Answer to the question.")


class Output(BaseModel):

    names: List[result] = Field(description=("A list of question/answer pairs"))


output_parser = PydanticOutputParser(pydantic_object=Output)
format_instructions = output_parser.get_format_instructions()

system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template=dedent("""
                                        {query}\n 
                                        output format instruction:
                                        {abc}
                                        """),
                              input_variables=["query"],
                              partial_variables={'abc': format_instructions}
                              )

human_message = HumanMessagePromptTemplate(prompt=human_prompt) 

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"query": ayoung_wiki.text})

output = model.invoke(prompt)

In [None]:
parsed_output = output_parser.parse(output.content)

In [None]:
parsed_output

In [None]:
parsed_output.names

In [None]:
parsed_output.names[0]

In [None]:
parsed_output.names[0].question

In [None]:
parsed_output.names[0].answer

## 多練習幾個版本



In [None]:
class Output(BaseModel):
    bio: str = Field(description="name")
    executive_summary: str = Field(description="One sentence executive summary.")
    full_description: str = Field(description="One paragraph summary")

output_parser = PydanticOutputParser(pydantic_object=Output)
format_instructions = output_parser.get_format_instructions()


system_template = dedent("""
                 I am going to give you a template for your output. CAPITALIZED
                 WORDS are my placeholders. Fill in my placeholders with your 
                 output. Please preserve the overall formatting of my template. 
                 
                 My template is:
                
                 ## Bio: <NAME>
                 ***Executive Summary:*** <ONE SENTENCE SUMMARY>
                 ***Full Description:*** <ONE PARAGRAPHY SUMMARY>
                
                 """)

system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template=("{query}\n" 
                                        "output format instruction: "
                                        "{format_instructions}"),
                              input_variables=["query"],
                              partial_variables={'format_instructions': format_instructions}
                              )

human_message = HumanMessagePromptTemplate(prompt=human_prompt) 

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"query": ayoung_wiki.text})

output = model.invoke(prompt)

In [None]:
output

In [None]:
parsed_output = output_parser.parse(output.content)

parsed_output.bio

In [None]:
parsed_output.executive_summary

In [None]:
parsed_output.full_description

## Worksheet Generation

I have a list of word:

- die Muskeln
- die Richtung
- die Schnur
- die Geschicklichkeit
- schnurren
- das Fell
- das Geräusch
- jagen
- schmusen
- riechen

Please create a pdf file, in which it follows the structure:

**<WORD>**:
<SENTENCE CONTAINTING THE WORD>

and a short article containing all these words.

In [None]:
class Output(BaseModel):
    name: str = Field(description="generated sentence of the word")

output_parser = PydanticOutputParser(pydantic_object=Output)
format_instructions = output_parser.get_format_instructions()

words = ["die Muskeln", "die Richtung", "die Schnur", "die Geschicklichkeit",
         "schnurren", "das Fell", "das Geräusch", "jagen", "schmusen", "riechen"]

system_template = ("You are a helpful AI assistant and you are going to help me create a sentence "
                   "for each of the given word in German.")
system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template=("{word}\n\nOutput instruction: {format_instructions}"),
                              input_variables=["word"],
                              partial_variables={'format_instructions': format_instructions}
                              )
human_message = HumanMessagePromptTemplate(prompt=human_prompt) 

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"word": "die Muskeln"})

output = model.invoke(prompt)

parsed_output = output_parser.parse(output.content)

print(parsed_output.name)

In [None]:
words_sentences = {}

for word in words:
    
    prompt = chat_prompt.invoke({"word": word})

    output = model.invoke(prompt)

    sentence = output.content

    parsed_output = output_parser.parse(output.content)

    words_sentences[word] = parsed_output.name

In [None]:
words_sentences

In [None]:
system_template = dendet("""
You are a helpful AI assistant and you are going to help me 
create a short article containing all these words in German.
""")

system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template=("{words}"),
                              input_variables=["words"],
                              )
human_message = HumanMessagePromptTemplate(prompt=human_prompt) 

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

prompt = chat_prompt.invoke({"words": ", ".join(words)})

output = model.invoke(prompt)

story = output.content

In [None]:
!pip install fpdf

In [None]:
from fpdf import FPDF

# Create the PDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", 'B', 16)
pdf.cell(0, 10, 'Wortliste mit Beispielsätzen', ln=True)

pdf.set_font("Arial", '', 12)
for word, sentence in words_sentences.items():
    pdf.ln(5)
    pdf.set_font("Arial", 'B', 12)
    pdf.cell(0, 10, f"{word}:", ln=True)
    pdf.set_font("Arial", '', 12)
    pdf.multi_cell(0, 10, sentence)

# Add article
pdf.add_page()
pdf.set_font("Arial", 'B', 16)
pdf.cell(0, 10, 'Artikel mit allen Wörtern', ln=True)
pdf.set_font("Arial", '', 12)
pdf.multi_cell(0, 10, story)

filename = os.path.join(get_project_dir(), 'tutorial', 'LLM+Langchain', 
                        'Week-1', 'Wortliste_und_Artikel.pdf')

# Save the PDF
pdf.output(filename)

# Content Enhancement

## Okapi BM25 Retrieval System

- 目的: Okapi BM25 幫助找到當你搜索某些內容時最相關的文檔。

- 文檔和詞語:
    
    - 想像你有一堆書（文檔）。
    - 每本書都有很多詞語。

- 搜索查詢:

    - 當你搜索時，你會輸入幾個詞語（你的查詢）。

- 評分系統:

    - Okapi BM25 根據每本書與你的查詢匹配的程度給予每本書一個分數。

- 評分因素:

    - 詞頻: 如果你的查詢中的一個詞在某本書中出現很多次，該書會得到更高的分數。
    - 逆文檔頻率: 如果一個詞在所有書中都很稀有，但在某本書中出現，該書會得到更高的分數。
    - 文檔長度: 較長的書會進行調整，這樣它們不會僅因為篇幅長而被不公平地評分。

- 公式:

    -BM25 使用一個數學公式來結合這些因素並計算分數。

- 選擇最佳:

    - 分數最高的書被認為是與你的查詢最相關的。

- 結果:

    - 這些高分書會作為搜索結果顯示給你。

想像一下：Okapi BM25 就像是一個聰明的圖書管理員，它根據你在搜索中使用的詞語來判斷哪些書可能是最有趣和最有幫助的。

### Term Frequency (TF) & Inverse Document Frequency (IDF):

#### Term Frequency:

把文章中單詞出現的頻率分佈作為文章的特徵


#### Inverse Document Frequency:

歸一化: 將文庫中普遍出現的詞的權重下調

In [None]:
import os
import requests

url = "https://www.gutenberg.org/cache/epub/1041/pg1041.txt"
response = requests.get(url)

filename = os.path.join("tutorial", "LLM+Langchain", "Week-1", "pg1041.txt")

# Ensure the request was successful
if response.status_code == 200:
    with open(filename, "w", encoding="utf-8") as f:
        f.write(response.text)
    print("File downloaded successfully.")
else:
    print("Failed to download file. Status code:", response.status_code)

In [None]:
import re

# Read file
with open(filename, "r", encoding="utf-8") as f:
    text = f.read()

# Extract main body only
match = re.search(r"\*\*\* START OF.*?\*\*\*(.*)\*\*\* END OF", text, re.S)
if match:
    body = match.group(1)
else:
    body = text  # fallback

In [None]:
# Split into sonnets: Roman numeral headings
pattern = r"\n([IVXLCDM]+)\n"   # captures numerals as headers
parts = re.split(pattern, body)

# Reconstruct mapping number → sonnet text
sonnets = {}
for i in range(1, len(parts), 2):
    number = parts[i].strip()
    poem = parts[i+1].strip()
    sonnets[number] = poem

# Example: print first two sonnets
for n in ["I", "II"]:
    print(f"Sonnet {n}:\n{sonnets[n]}\n")

In [None]:
sonnets['I']

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([sonnets['I']])

In [None]:
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()).T

In [None]:
# Convert to DataFrame
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()).T

# We will use this later
sampled_columns = vectorizer.get_feature_names_out()

df.columns = ["frequency"]

# Sort descending
df = df.sort_values("frequency", ascending=False)

print(df.head(10))

In [None]:
df_sonnet = pd.DataFrame.from_dict(sonnets, orient='index', columns=['text'])

In [None]:
df_sonnet.head(5)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize CountVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_sonnet['text'])

In [None]:
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [None]:
df[sampled_columns].iloc[0].T

In [None]:
df[sampled_columns].iloc[0].T.loc['the']

OKAPI25 可以看成是關鍵字搜索，而搜尋的結果根據關鍵字在每段文字中出現的頻率和文庫中的稀有度進行加權

## OKAPI25 in LangChain

https://api.python.langchain.com/en/latest/_modules/langchain_community/retrievers/bm25.html#BM25Retriever

In [None]:
import os
import json

from langchain_community.retrievers import BM25Retriever
from langchain.docstore.document import Document

### 2. Creating Documents from Training Data (從訓練數據創建文檔):


In [None]:
documents = []

for idx, row in df_sonnet.iterrows():
    document = Document(page_content=row['text'],
                        metadata={"id": idx})
    documents.append(document)

### 3. Initializing BM25Retriever (初始化 BM25Retriever):

- BM25Retriever.from_documents initializes an instance of BM25Retriever using the documents list.
- Parameters:
    - k=2: Specifies the number of documents to retrieve per query.
    - bm25_params={"k1": 2.5}: Sets specific BM25 parameters (k1 parameter set to 2.5).
    
- 使用 BM25Retriever.from_documents 方法，利用 documents 列表初始化了一个 BM25Retriever 實例。
- 參數:
    - k=2：指定每個查詢要檢索的文檔數量。
    - bm25_params={"k1": 2.5}：設置特定的 BM25 參數（設置 k1 參數為 2.5）。

In [None]:
# !pip install rank_bm25

In [None]:
bm25_retriever = BM25Retriever.from_documents(documents, k=2, 
                                              bm25_params={"k1":2.5})

https://tolkiengateway.net/wiki/The_Road_Goes_Ever_On_(song)

In [None]:
from textwrap import dedent

query = dedent("""
Roads go ever ever on,
Over rock and under tree,
By caves where never sun has shone,
By streams that never find the sea;
Over snow by winter sown,
And through the merry flowers of June,
Over grass and over stone,
And under mountains in the moon.

Roads go ever ever on
Under cloud and under star,
Yet feet that wandering have gone
Turn at last to home afar.
Eyes that fire and sword have seen
And horror in the halls of stone
Look at last on meadows green
And trees and hills they long have known
"""
)

### 5. Getting Top N Results (獲取排名前 N 的結果):

In [None]:
output = bm25_retriever.invoke(query)
print(output)

### Byte Pair Encoding (BPE)

英文似乎挺好切:每個單詞有頭有尾，但中文或日文這種中間沒有空白的文本要怎麼切?

Byte Pair Encoding (BPE) learns frequent character pairs in text and merges them into tokens. For Traditional Chinese, it begins at the character level and gradually merges frequent pairs.

Steps

Prepare a small Traditional Chinese corpus.

Train a BPE tokenizer with tokenizers (Hugging Face).

Apply it to a sentence.

In [None]:
from transformers import AutoTokenizer

# Load the pre-trained BPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("p208p2002/llama-traditional-chinese-120M")

# Example usage
text = "我正在閱讀書籍，也在看英文資料。"
encoded = tokenizer(text)
print("Tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"]))

我知道你們的心中有一個大膽的想法，所以把日文的Tokenizer也附上去了。

In [None]:
!pip install fugashi unidic-lite

In [None]:
from transformers import AutoTokenizer

"""
The ## prefix is something you’ll often see in WordPiece or BPE tokenizers (like BERT). 
It means “this subword is a continuation of the previous token.”
"""

# Example: Japanese BERT v2
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-v2")

text = "<你那大膽的想法>"
tokens = tokenizer.tokenize(text)
print(tokens)

In [None]:
def merge_wordpiece(tokens):
    merged = []
    current_word = ""
    for token in tokens:
        if token.startswith("##"):
            current_word += token[2:]  # append without '##'
        else:
            if current_word:  # push the previous word
                merged.append(current_word)
            current_word = token  # start a new word
    if current_word:  # push the last one
        merged.append(current_word)
    return merged

print(merge_wordpiece(tokens))

下載中文文檔

- https://github.com/rime-aca/corpus/blob/master/唐詩三百首.txt

不是我喜歡文學，是這比較好找數據集，還不會被告。

In [None]:
import re

# Read file
filename = os.path.join("tutorial", "LLM+Langchain", "Week-1", "唐詩三百首.txt")
with open(filename, "r", encoding="utf-8") as f:
    text = f.read()

poems = []

# Split by blank lines
blocks = [b.strip() for b in text.strip().split("\n\n") if b.strip()]

for block in blocks:
    entry = {}
    for line in block.split("\n"):
        if line.startswith("詩名:"):
            entry["詩名"] = line.replace("詩名:", "").strip()
        elif line.startswith("作者:"):
            entry["作者"] = line.replace("作者:", "").strip()
        elif line.startswith("詩體:"):
            entry["詩體"] = line.replace("詩體:", "").strip()
        elif line.startswith("詩文:"):
            entry["詩文"] = line.replace("詩文:", "").strip()
    if len(entry) != 0:
        poems.append(entry)

In [None]:
poems[0]

In [None]:
# # Read file
# filename = os.path.join("tutorial", "LLM+Langchain", "Week-1", "宋詞三百首.txt")
#pd. with open(filename, "r", encoding="utf-8") as f:
#     text = f.read()

# # Split by blank lines
# blocks = [b.strip() for b in text.strip().split("\n\n") if b.strip()]

# for block in blocks:
#     entry = {}
#     for line in block.split("\n"):
#         if line.startswith("詩名:"):
#             entry["詞牌"] = line.replace("詞牌:", "").strip()
#         elif line.startswith("作者:"):
#             entry["作者"] = line.replace("作者:", "").strip()
#         elif line.startswith("詩體:"):
#             entry["詞文"] = line.replace("詞文:", "").strip()
#     if len(entry) != 0:
#         poems.append(entry)

In [None]:
import pandas as pd

df_poem = pd.DataFrame(poems)

documents = []

for _, row in df_poem.iterrows():
    document = Document(page_content=row['詩文'],
                        metadata={"詩名": row["詩名"],
                                  "作者": row["作者"],
                                  "詩體": row["詩體"]})
    documents.append(document)

In [None]:
def _preprocess_func(text: str):

    # 1. Define special tokens to remove
    special_tokens = {"<s>", "</s>", "[PAD]", "[UNK]"}
    
    encoded = tokenizer(text)

    tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])

    # 2. Remove special tokens
    tokens = [t.replace("▁", "") for t in tokens if t not in special_tokens]
    
    # 3. Remove punctuation (keep only Chinese/English/number words)
    tokens = [t for t in tokens if re.match(r'[\w一-龥]+', t)]
    
    # Stringify the tokens
    return [str(token) for token in tokens]


bm25_poem_retriever = BM25Retriever.from_documents(documents, k=5, 
                                                   bm25_params={"k1":2.5},
                                                   preprocess_func=_preprocess_func
                                                  )

In [None]:
bm25_poem_retriever.invoke("大風起兮雲飛揚 威加海內兮歸故鄉 安得猛士兮守四方")

In [None]:
bm25_poem_retriever.invoke("夕陽無限好")

把詩經轉換成五言絕句... 有中文比較好的人嗎? XD

In [None]:
from textwrap import dedent

from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate
# query

query = dedent("""
蒹葭蒼蒼、白露為霜。
所謂伊人、在水一方。
遡洄從之、道阻且長。
遡遊從之、宛在水中央。
""")

# output format
class Output(BaseModel):
    name: str = Field(description="result in traditional Chinese (繁體中文)")

output_parser = PydanticOutputParser(pydantic_object=Output)
format_instructions = output_parser.get_format_instructions()


# prompt template
system_template = dedent("""
You are a helpful AI assistant with expertise in classical Chinese literature.
You understand all the nuance and history background of all the content.
""")
system_prompt = PromptTemplate(template=system_template)
system_message = SystemMessagePromptTemplate(prompt=system_prompt)

human_prompt = PromptTemplate(template=("""
Create a {poetic_form}

Examples:
{context}

according to the semantic of {query}

Output instruction: {format_instructions}
"""),
input_variables=["poetic_form", "query", "context"],
partial_variables={'format_instructions': format_instructions}
)
human_message = HumanMessagePromptTemplate(prompt=human_prompt) 

chat_prompt = ChatPromptTemplate.from_messages([system_message,
                                                 human_message
                                               ])

# retrieval
# BM25 retriever 不支持 filter
# 所以建議先filter內容

df_poem = pd.DataFrame(poems)

documents = []

for _, row in df_poem.iterrows():
    if row["詩體"] == "五言絕句":
        document = Document(page_content=row['詩文'],
                            metadata={"詩名": row["詩名"],
                                      "作者": row["作者"],
                                      "詩體": row["詩體"]})
        documents.append(document)

bm25_poem_retriever = BM25Retriever.from_documents(documents, k=5, 
                                                   bm25_params={"k1":2.5},
                                                   preprocess_func=_preprocess_func
                                                  )

context = bm25_poem_retriever.invoke(query)

print(context)

In [None]:
context = "\n".join([c.page_content for c in context])

print(context)

In [None]:
# 切換成 gpt-4o。gpt-4o-mini在這方面很弱

model_poem = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o", 
                   temperature=0 # a range from 0-2, the higher the value, the higher the `creativity`
                  )

prompt = chat_prompt.invoke({"query": query,
                             "poetic_form": "五言絕句",
                             "context": context})

output = model_poem.invoke(prompt)

parsed_output = output_parser.parse(output.content)

print(parsed_output)

# Wikipedia Retriever

In [None]:
# !pip install --upgrade --quiet  wikipedia

In [None]:
from langchain_community.retrievers import WikipediaRetriever

wiki_retriever = WikipediaRetriever()

docs = wiki_retriever.invoke("2024 US presidential election")

In [None]:
len(docs)

In [None]:
print(docs[0].page_content)

In [None]:
# 若是少於給定返回數量，則返回當前所有可得到文件

docs = wiki_retriever.invoke("rice")
len(docs)

- If you want to know what parameters can be feed to the WikipediaRetriever:

In [None]:
WikipediaRetriever?

By default, wikipedia retriever returns 3 documents.

# Ensemble Retriever

- The EnsembleRetriever uses different search tools together to find the best answers.
- It combines results from these tools and organizes them using a special method.
- By using different tools, it works better than just one tool alone.
- Usually, it mixes two types of search: one that looks for exact words (like BM25) and one that understands meanings (like embeddings).
- This mix is called "hybrid search."
- The first tool finds documents with specific words, and the second finds documents that have similar ideas.

<br>

- 它結合這些工具的結果並使用特殊方法進行組織。
- 通過使用不同的工具，它比僅使用單一工具效果更好。
- 通常，它結合兩種類型的搜索：一種尋找精確詞語（例如 BM25），另一種理解含義（例如嵌入式）。
- 這種混合稱為 "混合搜索"。
- 第一種工具尋找具有特定詞語的文檔，而第二種工具則尋找具有相似思想的文檔。

- weights: 控制權重
- 總返回文件數量等於個別檢索器 (retriever) 檢索文件數量

In [None]:
from langchain.retrievers import EnsembleRetriever

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, wiki_retriever], weights=[0.5, 0.5]
)

In [None]:
output = ensemble_retriever.invoke("rice")

In [None]:
len(output)

- bm25_retriever 返回兩份
- wiki_retriever 返回兩份

# Runtime Configuration (運行時配置)

- 我們也可以在運行時配置檢索器。為了做到這一點，我們需要將字段標記為可配置的。

API Reference: https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.utils.ConfigurableField.htmld

In [None]:
from langchain_core.runnables import ConfigurableField

In [None]:
bm25_retriever = BM25Retriever.from_documents(documents, k=2, 
    bm25_params={"k1": 1}).configurable_fields(
    k=ConfigurableField(
        id="bm25_k",
    )
)

In [None]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, wiki_retriever], weights=[0.5, 0.5]
)

In [None]:
config = {"configurable": {"bm25_k": 5}}
docs = ensemble_retriever.invoke("rice", config=config)

In [None]:
len(docs)

In [None]:
# - bm25_retriever 返回五份
# - wiki_retriever 返回兩份

In [None]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, wiki_retriever], weights=[0.1, 0.9]
)

config = {"configurable": {"bm25_k": 10}}
docs = ensemble_retriever.invoke("rice", config=config)

len(docs)

In [None]:
# - bm25_retriever 返回十份
# - wiki_retriever 返回兩份

### This is what I do in my work:

I use runtime configuration to target a specific data section with the applied attribute.

More specifically, there are many types of cosmetic products, such as:

- Lipstick
- Lip Gloss
- Mascara
- Blush
- Foundation
- Nail Polish
- Eyeliner
- Eye Pencil

These products are applied to different areas: face, nails, eyes, and lips.

You can retrieve information more efficiently and accurately if you identify the correct application area beforehand.

In [None]:
"""
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(self.documents, embedding=embedding)

retriever = vectorstore.as_retriever(search_type='similarity',
                                     search_kwargs={'k': self._k}).configurable_fields(search_kwargs=ConfigurableField(id="faiss_search_kwargs"))

semantic_retriever = retrievers['semantic']
semantic_documents = semantic_retriever.invoke(product, config={"configurable":
                                             {"faiss_search_kwargs":
                                                  {"fetch_k":20,
                                                   "k": 2,
                                                   "filter": {"applied": area}}}})
"""