# MCQA Criteria Template

In [None]:
text = '''你是一個評估模型，將被給予一個選擇題問題、一個 LLM 回應，以及一個選項 {option}。
你的任務不是回答問題，而是判斷 LLM 回應中是否 明確選擇 {option} 作為答案之一。

問題: {question}

LLM 回應: {llm_response}

判斷標準：

如果 {option} 在 LLM 回應中被清楚且直接表達為選擇的答案，或 LLM 回應僅含選項(A, B, C, D等)或本身 ({option})，則請回答 'True'。

如果 {option} 未被選擇，或 LLM 回應未表達出明確的答案，請回答 'False'。

如果 LLM 回應為空，或其內容無法確定 {option} 是否為選擇的答案，請回答 'Unknown'。
'''
text.strip()

In [None]:
text = '''你是一个评估模型，将被给予一个选择题问题、一个 LLM 回应，以及一个选项 {option}。  
你的任务不是回答问题，而是判断 LLM 回应中是否明确选择 {option} 作为答案之一。  

问题: {question}  

LLM 回应: {llm_response}  

判断标准：  

如果 {option} 在 LLM 回应中被清楚且直接表达为选择的答案，或 LLM 回应仅包含选项（A、B、C、D 等）或本身（{option}），则请回答 'True'。  

如果 {option} 未被选择，或 LLM 回应未表达出明确的答案，请回答 'False'。  

如果 LLM 回应为空，或其内容无法确定 {option} 是否为选择的答案，请回答 'Unknown'。  

'''
text.strip()

In [None]:
text = '''당신은 평가 모델이며, 하나의 객관식 질문, LLM 응답, 그리고 하나의 선택지 {option}을 받게 됩니다.  
당신의 임무는 질문에 답하는 것이 아니라, LLM 응답에서 {option}이 명확하게 선택된 답변 중 하나인지 판단하는 것입니다.  

질문: {question}  

LLM 응답: {llm_response}  

판단 기준:  

{option}이 LLM 응답에서 명확하고 직접적으로 선택된 답변으로 표현되었거나, LLM 응답이 선택지(A, B, C, D 등) 또는 {option}만 포함하는 경우 'True'를 답하십시오.  

{option}이 선택되지 않았거나, LLM 응답이 명확한 답을 표현하지 않았다면 'False'를 답하십시오.  

LLM 응답이 비어 있거나, {option}이 선택된 답변인지 판단할 수 없다면 'Unknown'을 답하십시오.  

'''
text.strip()

In [None]:
text = '''You are an evaluation model and will be given a multiple-choice question, an LLM response, and an option {option}.  
Your task is not to answer the question but to determine whether the LLM response explicitly selects {option} as one of the answers.  

Question: {question}  

LLM Response: {llm_response}  

Evaluation criteria:  

If {option} is clearly and directly expressed as a selected answer in the LLM response, or if the LLM response contains only an option (A, B, C, D, etc.) or {option} itself, respond with 'True'.  

If {option} is not selected, or the LLM response does not clearly express an answer, respond with 'False'.  

If the LLM response is empty or it is unclear whether {option} is a selected answer, respond with 'Unknown'.  

'''
text.strip()

# Translation Criteria

In [None]:
import os
from openai import AzureOpenAI
from BenchWeaver.extras.load_env import load_env_variables

load_env_variables()
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_ENDPOINT_URL"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("AZURE_API_VERSION"),
)

In [None]:
criteria_prompt = '''
你是一位專業的翻譯評估員。請根據提供的「原文文本」、「翻譯文本」、「風格範例」及「評估標準」，對翻譯品質進行評估，評分範圍為 1（最差）至 10（最佳），並以 JSON 格式輸出結果。

評估標準：
1. 資訊保留度：
   - 評估翻譯文本是否完整保留了原文的資訊內容，包括關鍵細節、邏輯關係與語義準確性。
   - 允許因風格匹配的需求進行詞句調整，但不可影響核心資訊的傳遞。
   - 例如，若原文提及具體數據、時間、因果關係或條件，翻譯文本應忠實呈現，而非省略或改動這些重要內容。
   - 若翻譯文本有刪減、曲解或誤譯，則應降低分數。

2. 風格匹配度：
   - 若風格範例為空，則請直接給予 1 分。
   - 評估翻譯文本是否符合給定的「風格範例」，包括語氣、句式、措辭選擇、正式度等。
   - 例如，若風格範例是學術論文，則翻譯文本應使用正式、嚴謹的語言，避免口語化表達；若風格範例是兒童讀物，則應使用簡單易懂、富有親和力的詞彙。
   - 風格匹配度高的翻譯應該讀起來與範例文本的風格一致，而不只是逐字翻譯。

3. 專有名詞準確度：
   - 專有名詞包括人名、地名、機構名稱、術語、技術詞彙等，應與上下文一致，並符合標準翻譯慣例。
   - 例如，「United Nations」應翻譯為「聯合國」，而非「統一國家」；「Neural Network」應譯為「神經網絡」，而非「神經連接」。
   - 若專有名詞有公認的譯法，則應使用標準譯法，若無固定譯法，則應確保譯法在全文內保持一致。

4. 翻譯品質：
   - 綜合評估翻譯文本的整體品質，包括語法、流暢度與可讀性。
   - 翻譯應避免生硬直譯或機翻痕跡，確保句子通順自然、符合目標語言的語法規範。
   - 例如，若翻譯文本讀起來拗口或不符合語法，應降低分數；若譯文自然流暢，則應提高分數。

---
「原文文本」：
{source_text}

「翻譯文本」：
{target_text}

「風格範例」：
{style_example}
---

請以以下 JSON 格式輸出評估結果，確保 `分數` 為 1-10 之間的整數，`原因` 為簡要但具體的說明：
{
    "資訊保留度": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    },
    "風格匹配度": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    },
    "專有名詞準確度": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    },
    "翻譯品質": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    }
}
'''

In [None]:
translation = "以下是關於會計的選擇題（及答案）。\n\n如果在期末將商品存貨金額錯誤地記錄為560,000元而實際為650,000元，對當期的銷售成本和當期淨利的影響正確的是？（假設存貨資產評價採用實地存貨調查法。）\nA. （銷售成本）過高，（當期淨利）過低\nB. （銷售成本）過高，（當期淨利）過高\nC. （銷售成本）過低，（當期淨利）過低\nD. （銷售成本）過低，（當期淨利）過高\n正確答案："
original_text = "다음은 accounting에 대한 객관식 질문(및 정답)입니다.\n\n전기 말에 상품재고액 \\560,000을 \\650,000으로 잘못 계상한 경우, 당기의 매출원가와 당기순이익에 미치는 영향으로 옳은 것은? (단, 재고자산 평가는 실지재고조사법을 적용 한다.)\nA. (매출원가) 과대, (당기순이익) 과소\nB. (매출원가) 과대, (당기순이익) 과대\nC. (매출원가) 과소, (당기순이익) 과소\nD. (매출원가) 과소, (당기순이익) 과대\n정답:"
style_example = """
"""
messages = [
    {
        "role": "system", 
        "content": "you are a helpful assistant."
    },
    {
        "role": "user", 
        "content": criteria_prompt.replace("{source_text}", original_text).replace("{target_text}", translation).replace("{style_example}", style_example).strip()
    }
    ]
print(messages[-1]['content'])

In [None]:
from openai import BadRequestError

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    try: 
        import ast
        import json
        import re
        resp_dict = ast.literal_eval(re.sub(r'\\|\n', '', response.choices[0].message.content))
        print(json.dumps(resp_dict, ensure_ascii=False, indent=2))
    except Exception as e:
        print("Error parsing response:", e)
        print("Raw response:", response.choices[0].message.content)
except BadRequestError as e:
    error_dict = e.response.content.decode()
    import ast
    import json
    resp_dict = ast.literal_eval(error_dict)
    response = ast.literal_eval(error_dict)['error']['message']
    print(json.dumps(resp_dict, indent=2))
    print(response)

# Test re

In [44]:
import re

text = """For example:
John went to the store.
Mary likes ice cream.

Source sentence: This is a test."""

pattern = r"(?:For example:|Examples:|Few-shot Examples:)\s*(.*?)\s*(?:Source sentence:|Proper Noun Examples:)"
match = re.search(pattern, text, re.DOTALL)

if match:
    extracted_text = match.group(1).strip()
    print(extracted_text)


John went to the store.
Mary likes ice cream.


# Refine translation Template

In [61]:
import json

# load the JSON data    
with open('/work/u5110390/BenchWeaver/prompt/translation_prompt.json', 'r') as f:
    data = json.load(f)

In [60]:
# export to json

with open('/work/u5110390/BenchWeaver/prompt/translation_prompt.json', 'w') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# Re-scoreing

In [1]:
import json
import re
import os
from typing import Any, Dict, List
import numpy as np
from BenchWeaver.eval.benchmarks.configs import BENCHMARK_CONFIG
from tqdm import tqdm

def parse_bool_score(text: str) -> str:
    '''
    Normally for MCQA checking. The answer is either true, false or unknown.
    '''
    text = text.lower()
    match = re.search(r'\b(true|false|unknown)\b', text)
    return match.group(0) if match else ""
        
def compute_score(benchmark_name: str, 
                  checked_answers: Dict[str, List[Any]], 
                  check_results: Dict[str, List[Any]], 
                  mapping_dict: Dict[str, dict]
    ) -> Dict[str, float]:
    category_corrects = {score: {"corrects": 0, "true_mask_count": 0} for score in BENCHMARK_CONFIG[benchmark_name]['display_scores']}

    for subject in tqdm(mapping_dict.keys(), desc="Compute subjects"):
        category_name = mapping_dict[subject]["category"]
        
        # Ground truth and predictions
        answers = np.array(checked_answers[subject])
        predictions = np.array([parse_bool_score(ans) for ans in check_results[subject]])

        # Mask for when the answer is 'true'
        true_mask = answers == 'true'

        # Compare predictions and answers, only where answer is 'true'
        corrects = (predictions == 'true') & true_mask  # correct when answer is 'true' and prediction is 'true'

        # Append results to category
        category_corrects[category_name]["corrects"] += corrects.sum()
        category_corrects[category_name]["true_mask_count"] += true_mask.sum()
        category_corrects["Average"]['corrects'] += corrects.sum()
        category_corrects["Average"]['true_mask_count'] += true_mask.sum()

    # Compute accuracy per category: correct_true / total_true
    results = {}
    for category_name, record_dict in category_corrects.items():
        acc = round(100 * (record_dict['corrects'] / record_dict['true_mask_count']), 4)
        results[category_name] = acc

    return results



2025-04-05 13:22:22.119593: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-05 13:22:22.119640: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-05 13:22:22.121176: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-05 13:22:22.129561: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
benchmark_name = "cmmlu"
folder = f"/work/u5110390/BenchWeaver/score/trans_template_exp/{benchmark_name}/mix"
mapping_path = os.path.join(f"/work/u5110390/BenchWeaver/evaluation_data/{benchmark_name}/mapping.json")

# load the JSON data    
with open(mapping_path, 'r') as f:
    mapping = json.load(f)
    
with open(os.path.join(folder, "checked_answers.json"), 'r') as f:
    answer_data = json.load(f)
    
with open(os.path.join(folder, "check_results.json"), 'r') as f:
    checked_data = json.load(f)
    # retreive the answer
    for subj, check_list in checked_data.items():
        bool_list = []
        for check in check_list:
            bool_list.append(parse_bool_score(check))
        checked_data[subj] = bool_list

score_dict = compute_score(benchmark_name, answer_data, checked_data, mapping)

with open(os.path.join(folder, "score.json"), 'w') as f:
    json.dump(score_dict, f, indent=4, ensure_ascii=False)

Compute subjects: 100%|██████████| 67/67 [00:00<00:00, 1445.11it/s]


In [2]:
import json
question_check_result_path = "/work/u5110390/BenchWeaver/score/translation_results/cmmlu/few_shot/question_check_result.json"
answer_check_result_path = "/work/u5110390/BenchWeaver/score/translation_results/cmmlu/few_shot/answer_check_result.json"
with open(question_check_result_path, 'r') as f:
    question_check_result = json.load(f)
    
with open(answer_check_result_path, 'r') as f:
    answer_check_result = json.load(f)

In [1]:
from typing import Dict, List
import numpy as np
def merge_and_calculate_results(
    question_check_result: Dict[str, List[dict]],
    answer_check_result: Dict[str, List[dict]]
    ):
    score_dict = {
        subj: {
            "question":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            },
            "answer":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            }
        } 
        for subj in question_check_result.keys()
    }
    average_score_dict = {
        "資訊保留度": 0,
        "風格匹配度": 0,
        "專有名詞準確度": 0,
        "翻譯品質": 0,
        "Average": 0,
        "question":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            },
            "answer":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            }
    }
    
    for subj in question_check_result.keys():
        question_record_dict = {
            "資訊保留度": [],
            "風格匹配度": [],
            "專有名詞準確度": [],
            "翻譯品質": []
        }
        answer_record_dict = {
            "資訊保留度": [],
            "風格匹配度": [],
            "專有名詞準確度": [],
            "翻譯品質": []
        }
        for question_result_dict, answer_result_dict in zip(question_check_result[subj], answer_check_result[subj]):
            # append the scores to the record dict
            try:
                question_record_dict["資訊保留度"].append(question_result_dict["資訊保留度"]["分數"])
                question_record_dict["風格匹配度"].append(question_result_dict["風格匹配度"]["分數"])
                question_record_dict["專有名詞準確度"].append(question_result_dict["專有名詞準確度"]["分數"])
                question_record_dict["翻譯品質"].append(question_result_dict["翻譯品質"]["分數"])
                answer_record_dict["資訊保留度"].append(answer_result_dict["資訊保留度"]["分數"])
                answer_record_dict["風格匹配度"].append(answer_result_dict["風格匹配度"]["分數"])
                answer_record_dict["專有名詞準確度"].append(answer_result_dict["專有名詞準確度"]["分數"])
                answer_record_dict["翻譯品質"].append(answer_result_dict["翻譯品質"]["分數"])
                # calculate the average score for each subject
                score_dict[subj]['question']['資訊保留度'] = np.mean(question_record_dict["資訊保留度"])
                score_dict[subj]['question']['風格匹配度'] = np.mean(question_record_dict["風格匹配度"])
                score_dict[subj]['question']['專有名詞準確度'] = np.mean(question_record_dict["專有名詞準確度"])
                score_dict[subj]['question']['翻譯品質'] = np.mean(question_record_dict["翻譯品質"])
                score_dict[subj]['question']['Average'] = np.mean([
                    score_dict[subj]['question']['資訊保留度'],
                    score_dict[subj]['question']['風格匹配度'],
                    score_dict[subj]['question']['專有名詞準確度'],
                    score_dict[subj]['question']['翻譯品質']
                ])
                score_dict[subj]['answer']['資訊保留度'] = np.mean(answer_record_dict["資訊保留度"])
                score_dict[subj]['answer']['風格匹配度'] = np.mean(answer_record_dict["風格匹配度"])
                score_dict[subj]['answer']['專有名詞準確度'] = np.mean(answer_record_dict["專有名詞準確度"])
                score_dict[subj]['answer']['翻譯品質'] = np.mean(answer_record_dict["翻譯品質"])
                score_dict[subj]['answer']['Average'] = np.mean([
                    score_dict[subj]['answer']['資訊保留度'],
                    score_dict[subj]['answer']['風格匹配度'],
                    score_dict[subj]['answer']['專有名詞準確度'],
                    score_dict[subj]['answer']['翻譯品質']
                ])
            except Exception as e:
                print(f"Exception: {e} in subject {subj}")
                print(f"Question result: {question_result_dict}")
                print(f"Answer result: {answer_result_dict}")
                continue
            
    # calculate average score for each subject
    average_score_dict['資訊保留度'] = np.mean(
        [score_dict[subj]['question']['資訊保留度'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['資訊保留度'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["資訊保留度"] = np.mean(
        [score_dict[subj]['question']['資訊保留度'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["資訊保留度"] = np.mean(
        [score_dict[subj]['answer']['資訊保留度'] for subj in score_dict.keys()]
    )
    average_score_dict['風格匹配度'] = np.mean(
        [score_dict[subj]['question']['風格匹配度'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['風格匹配度'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["風格匹配度"] = np.mean(
        [score_dict[subj]['question']['風格匹配度'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["風格匹配度"] = np.mean(
        [score_dict[subj]['answer']['風格匹配度'] for subj in score_dict.keys()]
    )
    average_score_dict['專有名詞準確度'] = np.mean(
        [score_dict[subj]['question']['專有名詞準確度'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['專有名詞準確度'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["專有名詞準確度"] = np.mean(
        [score_dict[subj]['question']['專有名詞準確度'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["專有名詞準確度"] = np.mean(
        [score_dict[subj]['answer']['專有名詞準確度'] for subj in score_dict.keys()]
    )
    average_score_dict['翻譯品質'] = np.mean(
        [score_dict[subj]['question']['翻譯品質'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['翻譯品質'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["翻譯品質"] = np.mean(
        [score_dict[subj]['question']['翻譯品質'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["翻譯品質"] = np.mean(
        [score_dict[subj]['answer']['翻譯品質'] for subj in score_dict.keys()]
    )
    average_score_dict['Average'] = np.mean([
        average_score_dict['資訊保留度'],
        average_score_dict['風格匹配度'],
        average_score_dict['專有名詞準確度'],
        average_score_dict['翻譯品質']
    ])
    average_score_dict['question']["Average"] = np.mean([
        average_score_dict['question']["資訊保留度"],
        average_score_dict['question']["風格匹配度"],
        average_score_dict['question']["專有名詞準確度"],
        average_score_dict['question']["翻譯品質"]
    ])
    average_score_dict['answer']["Average"] = np.mean([
        average_score_dict['answer']["資訊保留度"],
        average_score_dict['answer']["風格匹配度"],
        average_score_dict['answer']["專有名詞準確度"],
        average_score_dict['answer']["翻譯品質"]
    ])
    
    score_dict.update({"Average": average_score_dict})
    return score_dict

In [4]:
score = merge_and_calculate_results(question_check_result, answer_check_result)
score

Exception: list indices must be integers or slices, not str in subject sports_science
Question result: [{'資訊保留度': {'分數': 9, '原因': '翻譯文本完好地保留了原文的資訊內容，包括關鍵細節、選項和正確答案，僅在個別措辭上有些許不同，但不影響核心資訊傳遞。'}, '風格匹配度': {'分數': 10, '原因': '符合風格範例的語氣和形式，按照考題的格式和回應方式呈現答案，沒有違背既定風格。'}, '專有名詞準確度': {'分數': 10, '原因': '所有專有名詞（如肌肉名稱）均翻譯準確且符合標準翻譯慣例。'}, '翻譯品質': {'分數': 9, '原因': '翻譯流暢自然，句子通順，基本沒有生硬直譯或機翻痕跡，符合目標語言的語法規範。'}}]
Answer result: {'資訊保留度': {'分數': 10, '原因': '原文僅有一個單一的選擇項目字母，翻譯文本正確保留了該信息，並且沒有遺漏任何資訊。'}, '風格匹配度': {'分數': 10, '原因': '原文只是回答一個選擇題的答案字母，沒有額外的解釋，也沒有特別的答案風格要求，符合給定的資訊。'}, '專有名詞準確度': {'分數': 10, '原因': '翻譯文本中沒有涉及專有名詞，直接保留了原文的單一字母選項，完全準確。'}, '翻譯品質': {'分數': 10, '原因': '翻譯文本與原文完全一致，沒有語法或流暢度問題。'}}
Exception: '專有名詞準確度' in subject virology
Question result: {'資訊保留度': {'分數': 6, '原因': '大部分內容保持了原文的信息，但部分答案伴隨的額外資訊（如疫苗的抗原性及可能誘導的腫瘤）在原文中是不存在的，造成情報傳遞不一致。'}, '風格匹配度': {'分數': 7, '原因': '與風格範例保持了一致，使用了正式且嚴謹的語言，但出現了一些額外的解釋內容，影響了風格的純粹性。'}, '專有名詞準確度': {'分數': 9, '原因': '專有名詞基本準確，但出現少量不必要的情況說明（如疫苗的抗原性及腫瘤），改變了原有的直白回答。'}, '翻譯品質': {'分數': 6, 

{'agronomy': {'question': {'資訊保留度': 9.34319526627219,
   '風格匹配度': 9.183431952662723,
   '專有名詞準確度': 9.159763313609467,
   '翻譯品質': 9.136094674556213,
   'Average': 9.205621301775148},
  'answer': {'資訊保留度': 8.804733727810651,
   '風格匹配度': 8.396449704142011,
   '專有名詞準確度': 8.816568047337277,
   '翻譯品質': 8.65680473372781,
   'Average': 8.668639053254438}},
 'anatomy': {'question': {'資訊保留度': 4.054054054054054,
   '風格匹配度': 5.75,
   '專有名詞準確度': 7.175675675675675,
   '翻譯品質': 4.777027027027027,
   'Average': 5.4391891891891895},
  'answer': {'資訊保留度': 8.20945945945946,
   '風格匹配度': 7.601351351351352,
   '專有名詞準確度': 8.256756756756756,
   '翻譯品質': 8.0,
   'Average': 8.016891891891891}},
 'ancient_chinese': {'question': {'資訊保留度': 8.847560975609756,
   '風格匹配度': 8.932926829268293,
   '專有名詞準確度': 9.067073170731707,
   '翻譯品質': 8.682926829268293,
   'Average': 8.882621951219512},
  'answer': {'資訊保留度': 8.207317073170731,
   '風格匹配度': 7.7560975609756095,
   '專有名詞準確度': 8.286585365853659,
   '翻譯品質': 8.03048780487805,

In [7]:
import os
# export
with open("/work/u5110390/BenchWeaver/score/translation_results/cmmlu/few_shot/score.json", 'w') as f:
    json.dump(score, f, indent=2, ensure_ascii=False)

In [16]:
import re

def parse_numerical_score(text: str) -> float:
    score = -1.0
    regex_patterns = [
        r'(?:score|分數)\s*[:：]?\s*([\d.]+)',
        r'rating:\s*\[\[([\d.]+)\]\]$'
    ]
    for pattern in regex_patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                score = float(match.group(1))
                break
            except ValueError:
                continue  # Skip to next pattern if conversion fails
    else:
        # This block runs if no break occurred (i.e., no match)
        t = text.replace('\n', '\\n')
        print(f'Parse score error: {t}')
        
    return score

print(parse_numerical_score("Score:\n0.5"))
print(parse_numerical_score("分數：0.5"))
print(parse_numerical_score("分數： 0.5"))
print(parse_numerical_score("分數：0.5\n"))
print(parse_numerical_score("分數：0.5\n"))
print(parse_numerical_score("Score: [[3.5]]"))

0.5
0.5
0.5
0.5
0.5
Parse score error: Score: [[3.5]]
-1.0


In [17]:
import numpy as np

np.mean([1.0, 2, 3, 4, 5])

3.0