# Websearch 優缺點與應用策略

## 優點
- **多元化來源**：涵蓋範圍廣，能提供多角度資訊。
- **即時性**：能快速取得公開網頁上的最新內容。
- **靈活性**：適合需要「多來源比對」的問題。

## 缺點
- **碎片化**：資訊分散、格式不一，難以直接系統化使用。
- **品質參差不齊**：來源可靠度不同，可能存在錯誤或過時資訊。
- **限制與風險**：部分 API 或搜尋過程可能因政策、安全或授權而阻擋特定內容。

---

## 何時適合使用 Websearch
- 需要 **多角度觀點**（如新聞、論壇、社群資訊）。
- **開放探索**，對來源精確度要求不高。
- 無法透過單一可靠資料庫滿足需求時。

---

## 何時更適合使用特定來源
- **專門領域**：如 Wikipedia、Fandom Wiki（例如遊戲、小說、Warhammer 40k）。
- **結構化資料需求**：來源有規則的網址與內容組織，便於程式化檢索。
- **高可信度需求**：減少處理過多雜訊。

---

## API 使用注意事項
- **安全審查阻擋**：可能因涉及「不允許或敏感內容」而無法獲取公開資料。
- **授權與限制**：包含付費牆、Rate Limit、隱私規範等。
- **備援角色**：Websearch 可作為補充方案，但不一定是萬能解決方式。

---

## 總結
Websearch 提供了 **廣泛而多元的資訊**，但也帶來 **碎片化與品質問題**。  
若需求明確、可依靠結構化且可信的來源（如 Wikipedia、Fandom），應優先選用。  
若需要多角度、開放探索或無特定資料庫可依賴時，Websearch 才能發揮最大價值。


## 返無 歸一

- 假設你確定在某個來源肯定有資訊的時候，取得該來源的網址
- 使用第一周和第三周的技巧爬取網址的內容
- 透過LLM提取你要的訊息

In [1]:
import os

os.chdir("../../../")

In [2]:
from openai import OpenAI
from textwrap import dedent
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI

from src.initialization import credential_init


def build_standard_chat_prompt_template(kwargs):

    messages = []
 
    if 'system' in kwargs:
        content = kwargs.get('system')
        prompt = PromptTemplate(**content)
        message = SystemMessagePromptTemplate(prompt=prompt)
        messages.append(message)  

    if 'human' in kwargs:
        content = kwargs.get('human')
        prompt = PromptTemplate(**content)
        message = HumanMessagePromptTemplate(prompt=prompt)
        messages.append(message)
        
    chat_prompt_template = ChatPromptTemplate.from_messages(messages)
    
    return chat_prompt_template

credential_init()

  from .autonotebook import tqdm as notebook_tqdm


### 抽取Wikipedia的內容

In [None]:
import requests

session = requests.Session()

query = "獵人中的念能力系統"

# Wikipedia語言選擇
URL = "https://ja.wikipedia.org/w/api.php"


PARAMS = {
    "action": "parse",
    # Wikipedia 頁面的 關鍵字
    "page": "HUNTER×HUNTER",
    "prop": "text",
    "format": "json"
}

HEADERS = {
    "User-Agent": "AI Tutorial Bot/1.0 (mengchiehling@gmail.com)"
}

response = session.get(url=URL, params=PARAMS, headers=HEADERS)

data = response.json()

使用 bs4 處理數據 

In [None]:
from bs4 import BeautifulSoup

html_content = data['parse']['text']["*"]

soup = BeautifulSoup(html_content, "html.parser")

# 移除 style 和 script
for tag in soup(["style", "script"]):
    tag.decompose()

# 提取文字
text_content = soup.get_text(separator="\n")

# 清理空白與空行
cleaned_text = "\n".join(
    line.strip() for line in text_content.splitlines() if line.strip()
)

In [None]:
# cleaned_text

慢慢寫Pydantic物件是可以實現的目標的，但是在應用中我們希望有更好的自動化: 根據使用者的需求自動生成物件

我們嘗試結合代碼生成來輔助完成目標

In [40]:
code_example = dedent("""
    # Example 1: Name extraction using Pydantic and LangChain

    from pydantic import BaseModel, Field
    from langchain_core.output_parsers import PydanticOutputParser

    class NameExtractor(BaseModel):
        name: str = Field(description="The extracted name from the input text")

    output_parser = PydanticOutputParser(pydantic_object=NameExtractor)
    format_instructions = output_parser.get_format_instructions()

    ---
    # Example 2: Multiple product information extraction using Pydantic and Langchain

    from typing import List
    
    class Product(BaseModel):
        name: str = Field(description="Product")
        brand: str = Field(description="The brand name")
        country_code: str = Field(description="ISO 3166-1 alpha-2 of the country of the brand")

    class ProductOutput(BaseModel):
        products: List[Product] = Field(description="A list of products")

    output_parser = PydanticOutputParser(pydantic_object=NameExtractor)
    format_instructions = output_parser.get_format_instructions()
    
""")


system_template = dedent(f"""
    You are an expert Python developer specializing in large language models and the LangChain framework.
    Your objective is to generate **only valid, executable Python code** that solves the user's request.

    Requirements:
    - Use Pydantic models when defining structured outputs.
    - Ensure imports are correct and minimal.
    - Follow PEP 8 formatting standards.
    - Do not include any explanations, markdown, comments, or extra text outside the code block.
    - You have have the output_parser and the format_instruction of the Pydantic models.

    Example structure:
    {code_example}
""")

human_template = dedent("""
                        {query}
                        Code:
                        """)


input_ = {"system": {"template": system_template},
          "human": {"template": human_template,
                    "input_variable": ["query"]}}

chat_prompt_template = build_standard_chat_prompt_template(input_)

model = ChatOpenAI(model="gpt-4o-mini")

code_generation = chat_prompt_template|model|StrOutputParser()

In [None]:
generated_code = code_generation.invoke({"query": query})

In [None]:
print(generated_code)

### 代碼執行工具

In [16]:
import re

from langchain_core.runnables import chain

@chain
def code_execution(code):
    
    match = re.findall(r"python\n(.*?)\n```", code, re.DOTALL)
    python_code = match[0]
    
    lines = python_code.strip()#.split('\n')
    # *stmts, last_line = lines

    local_vars = {}
    exec(lines, local_vars)

    return local_vars

In [None]:
local_vars = code_execution.invoke(generated_code)

In [None]:
human_template = dedent("""{query}
                            context:
                            {context}
                           output format instruction = {format_instruction} 
                        """)


input_ = {"system": {"template": "Return the result in traditional Chinese"},
          "human": {"template": human_template,
                    "input_variable": ["query"],
                    "partial_variables": {'format_instruction': local_vars["format_instructions"]}
                    }}

chat_prompt_template = build_standard_chat_prompt_template(input_)

model = ChatOpenAI(model="gpt-4o-mini")

pipeline = chat_prompt_template|model|local_vars['output_parser']

output = pipeline.invoke({"query": query,
                 "context": cleaned_text})

## Fan Wiki

萬機之神歐姆尼賽亞的化身

https://warhammer40k.fandom.com/wiki/Titan

In [4]:
import requests

from bs4 import BeautifulSoup


url = "https://warhammer40k.fandom.com/wiki/Titan"

In [5]:
def parsing_process(url):
    """
    Fetches and extracts text content from a given URL.

    Parameters:
    url (str): The URL of the web page to fetch and parse.

    Returns:
    str: Cleaned text content extracted from the web page.

    Raises:
    requests.exceptions.RequestException: If an error occurs while fetching the URL.

    Notes:
    - This function sends a GET request to the specified URL.
    - It uses BeautifulSoup to parse the HTML content of the response.
    - Any <style> tags in the HTML are removed to extract only textual content.
    - The extracted text is cleaned by removing extra whitespace and empty lines.
    """
    # Send a GET request to the URL
    response = requests.get(url)

    # Get the content of the response
    html_content = response.text
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # 移除 style 和 script
    for tag in soup(["style", "script"]):
        tag.decompose()

    # Extract and print only the text content
    text_content = soup.get_text(separator='\n')

    # Clean up the text (optional)
    cleaned_text = '\n'.join(line.strip() for line in text_content.splitlines() if line.strip())
    
    return cleaned_text

### 提取網頁內容

In [6]:
cleaned_text = parsing_process(url)

In [7]:
query = "幫我找出所有忠誠派泰坦級別"

In [14]:
generated_code = code_generation.invoke({"query": query})

In [15]:
print(generated_code)

```python
from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser

class LoyalTitanExtractor(BaseModel):
    loyal_titans: list = Field(description="A list of all loyal Titan level")

output_parser = PydanticOutputParser(pydantic_object=LoyalTitanExtractor)
format_instructions = output_parser.get_format_instructions()
```


In [17]:
local_vars = code_execution.invoke(generated_code)

In [31]:
def build_answer_pipeline(output_parser, format_instructions):

    human_template = dedent("""{query}
                           
                           context:
                           {context}
                           output format instruction = {format_instruction} 
                        """)


    input_ = {"system": {"template": "You are a helpful AI assistant."},
              "human": {"template": human_template,
                        "input_variable": ["query"],
                        "partial_variables": {'format_instruction': format_instructions}
                        }}
    
    chat_prompt_template = build_standard_chat_prompt_template(input_)
    
    model = ChatOpenAI(model="gpt-4o-mini")
    
    answer_pipeline = chat_prompt_template|model|output_parser

    return answer_pipeline
    

In [19]:
answer_pipeline = build_answer_pipeline(output_parser=local_vars['output_parser'], format_instructions=local_vars['format_instructions'])

output = answer_pipeline.invoke({"query": query, "context": cleaned_text})

In [21]:
output.loyal_titans

['Imperator-class Titan',
 'Warmonger-class Titan',
 'Warlord-class Titan',
 'Reaver-class Titan',
 'Dire Wolf-class Titan',
 'Warhound-class Titan',
 'Rapier-class Titan',
 'Executor-class Titan',
 'Apocalypse-class Titan',
 'Carnivore-class Titan',
 'Mirage-class Titan',
 'Warbringer Nemesis-class Titan',
 'Komodo-class Titan',
 'Punisher-class Titan',
 'Warmaster-class Titan',
 'Warmaster Iconoclast-class Titan',
 'Emperor Titans']

#### 試試看更具有挑戰的問題

In [41]:
query = "幫我找出所有陣營的所有泰坦級別"

In [42]:
generated_code = code_generation.invoke({"query": query})

In [43]:
print(generated_code)

```python
from typing import List
from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser

class TitanLevel(BaseModel):
    faction: str = Field(description="The faction of the titan")
    level: str = Field(description="Titan level")

class TitanLevelsOutput(BaseModel):
    titans: List[TitanLevel] = Field(description="A list of titan levels for all factions")

output_parser = PydanticOutputParser(pydantic_object=TitanLevelsOutput)
format_instructions = output_parser.get_format_instructions()
```


In [44]:
local_vars = code_execution.invoke(generated_code)

In [45]:
answer_pipeline = build_answer_pipeline(output_parser=local_vars['output_parser'], format_instructions=local_vars['format_instructions'])

In [46]:
output = answer_pipeline.invoke({"query": query, "context": cleaned_text})

In [48]:
output.titans

[TitanLevel(faction='Imperium of Man', level='Rapier Scout Titan'),
 TitanLevel(faction='Imperium of Man', level='Warhound Scout Titan'),
 TitanLevel(faction='Imperium of Man', level='Dire Wolf Heavy Scout Titan'),
 TitanLevel(faction='Imperium of Man', level='Reaver Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Komodo Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Punisher-class Titan'),
 TitanLevel(faction='Imperium of Man', level='Executor Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Apocalypse Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Carnivore Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Mirage Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Warbringer Nemesis Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Warlord Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Warmaster Heavy Battle Titan'),
 TitanLevel(faction='Imperium of Man', level='Emperor Titan'),
