# MCQA Criteria Template

In [None]:
text = '''你是一個評估模型，將被給予一個選擇題問題、一個 LLM 回應，以及一個選項 {option}。
你的任務不是回答問題，而是判斷 LLM 回應中是否 明確選擇 {option} 作為答案之一。

問題: {question}

LLM 回應: {llm_response}

判斷標準：

如果 {option} 在 LLM 回應中被清楚且直接表達為選擇的答案，或 LLM 回應僅含選項(A, B, C, D等)或本身 ({option})，則請回答 'True'。

如果 {option} 未被選擇，或 LLM 回應未表達出明確的答案，請回答 'False'。

如果 LLM 回應為空，或其內容無法確定 {option} 是否為選擇的答案，請回答 'Unknown'。
'''
text.strip()

In [None]:
text = '''你是一个评估模型，将被给予一个选择题问题、一个 LLM 回应，以及一个选项 {option}。  
你的任务不是回答问题，而是判断 LLM 回应中是否明确选择 {option} 作为答案之一。  

问题: {question}  

LLM 回应: {llm_response}  

判断标准：  

如果 {option} 在 LLM 回应中被清楚且直接表达为选择的答案，或 LLM 回应仅包含选项（A、B、C、D 等）或本身（{option}），则请回答 'True'。  

如果 {option} 未被选择，或 LLM 回应未表达出明确的答案，请回答 'False'。  

如果 LLM 回应为空，或其内容无法确定 {option} 是否为选择的答案，请回答 'Unknown'。  

'''
text.strip()

In [None]:
text = '''당신은 평가 모델이며, 하나의 객관식 질문, LLM 응답, 그리고 하나의 선택지 {option}을 받게 됩니다.  
당신의 임무는 질문에 답하는 것이 아니라, LLM 응답에서 {option}이 명확하게 선택된 답변 중 하나인지 판단하는 것입니다.  

질문: {question}  

LLM 응답: {llm_response}  

판단 기준:  

{option}이 LLM 응답에서 명확하고 직접적으로 선택된 답변으로 표현되었거나, LLM 응답이 선택지(A, B, C, D 등) 또는 {option}만 포함하는 경우 'True'를 답하십시오.  

{option}이 선택되지 않았거나, LLM 응답이 명확한 답을 표현하지 않았다면 'False'를 답하십시오.  

LLM 응답이 비어 있거나, {option}이 선택된 답변인지 판단할 수 없다면 'Unknown'을 답하십시오.  

'''
text.strip()

In [None]:
text = '''You are an evaluation model and will be given a multiple-choice question, an LLM response, and an option {option}.  
Your task is not to answer the question but to determine whether the LLM response explicitly selects {option} as one of the answers.  

Question: {question}  

LLM Response: {llm_response}  

Evaluation criteria:  

If {option} is clearly and directly expressed as a selected answer in the LLM response, or if the LLM response contains only an option (A, B, C, D, etc.) or {option} itself, respond with 'True'.  

If {option} is not selected, or the LLM response does not clearly express an answer, respond with 'False'.  

If the LLM response is empty or it is unclear whether {option} is a selected answer, respond with 'Unknown'.  

'''
text.strip()

# Translation Criteria

In [None]:
import os
from openai import AzureOpenAI
from BenchWeaver.extras.load_env import load_env_variables

load_env_variables()
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_ENDPOINT_URL"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("AZURE_API_VERSION"),
)

In [None]:
criteria_prompt = '''
你是一位專業的翻譯評估員。請根據提供的「原文文本」、「翻譯文本」、「風格範例」及「評估標準」，對翻譯品質進行評估，評分範圍為 1（最差）至 10（最佳），並以 JSON 格式輸出結果。

評估標準：
1. 資訊保留度：
   - 評估翻譯文本是否完整保留了原文的資訊內容，包括關鍵細節、邏輯關係與語義準確性。
   - 允許因風格匹配的需求進行詞句調整，但不可影響核心資訊的傳遞。
   - 例如，若原文提及具體數據、時間、因果關係或條件，翻譯文本應忠實呈現，而非省略或改動這些重要內容。
   - 若翻譯文本有刪減、曲解或誤譯，則應降低分數。

2. 風格匹配度：
   - 若風格範例為空，則請直接給予 1 分。
   - 評估翻譯文本是否符合給定的「風格範例」，包括語氣、句式、措辭選擇、正式度等。
   - 例如，若風格範例是學術論文，則翻譯文本應使用正式、嚴謹的語言，避免口語化表達；若風格範例是兒童讀物，則應使用簡單易懂、富有親和力的詞彙。
   - 風格匹配度高的翻譯應該讀起來與範例文本的風格一致，而不只是逐字翻譯。

3. 專有名詞準確度：
   - 專有名詞包括人名、地名、機構名稱、術語、技術詞彙等，應與上下文一致，並符合標準翻譯慣例。
   - 例如，「United Nations」應翻譯為「聯合國」，而非「統一國家」；「Neural Network」應譯為「神經網絡」，而非「神經連接」。
   - 若專有名詞有公認的譯法，則應使用標準譯法，若無固定譯法，則應確保譯法在全文內保持一致。

4. 翻譯品質：
   - 綜合評估翻譯文本的整體品質，包括語法、流暢度與可讀性。
   - 翻譯應避免生硬直譯或機翻痕跡，確保句子通順自然、符合目標語言的語法規範。
   - 例如，若翻譯文本讀起來拗口或不符合語法，應降低分數；若譯文自然流暢，則應提高分數。

---
「原文文本」：
{source_text}

「翻譯文本」：
{target_text}

「風格範例」：
{style_example}
---

請以以下 JSON 格式輸出評估結果，確保 `分數` 為 1-10 之間的整數，`原因` 為簡要但具體的說明：
{
    "資訊保留度": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    },
    "風格匹配度": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    },
    "專有名詞準確度": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    },
    "翻譯品質": {
        "分數": <1-10 的分數>,
        "原因": "<簡要說明此評分的理由>"
    }
}
'''

In [None]:
translation = "以下是關於會計的選擇題（及答案）。\n\n如果在期末將商品存貨金額錯誤地記錄為560,000元而實際為650,000元，對當期的銷售成本和當期淨利的影響正確的是？（假設存貨資產評價採用實地存貨調查法。）\nA. （銷售成本）過高，（當期淨利）過低\nB. （銷售成本）過高，（當期淨利）過高\nC. （銷售成本）過低，（當期淨利）過低\nD. （銷售成本）過低，（當期淨利）過高\n正確答案："
original_text = "다음은 accounting에 대한 객관식 질문(및 정답)입니다.\n\n전기 말에 상품재고액 \\560,000을 \\650,000으로 잘못 계상한 경우, 당기의 매출원가와 당기순이익에 미치는 영향으로 옳은 것은? (단, 재고자산 평가는 실지재고조사법을 적용 한다.)\nA. (매출원가) 과대, (당기순이익) 과소\nB. (매출원가) 과대, (당기순이익) 과대\nC. (매출원가) 과소, (당기순이익) 과소\nD. (매출원가) 과소, (당기순이익) 과대\n정답:"
style_example = """
"""
messages = [
    {
        "role": "system", 
        "content": "you are a helpful assistant."
    },
    {
        "role": "user", 
        "content": criteria_prompt.replace("{source_text}", original_text).replace("{target_text}", translation).replace("{style_example}", style_example).strip()
    }
    ]
print(messages[-1]['content'])

In [None]:
from openai import BadRequestError

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    try: 
        import ast
        import json
        import re
        resp_dict = ast.literal_eval(re.sub(r'\\|\n', '', response.choices[0].message.content))
        print(json.dumps(resp_dict, ensure_ascii=False, indent=2))
    except Exception as e:
        print("Error parsing response:", e)
        print("Raw response:", response.choices[0].message.content)
except BadRequestError as e:
    error_dict = e.response.content.decode()
    import ast
    import json
    resp_dict = ast.literal_eval(error_dict)
    response = ast.literal_eval(error_dict)['error']['message']
    print(json.dumps(resp_dict, indent=2))
    print(response)

# Test re

In [44]:
import re

text = """For example:
John went to the store.
Mary likes ice cream.

Source sentence: This is a test."""

pattern = r"(?:For example:|Examples:|Few-shot Examples:)\s*(.*?)\s*(?:Source sentence:|Proper Noun Examples:)"
match = re.search(pattern, text, re.DOTALL)

if match:
    extracted_text = match.group(1).strip()
    print(extracted_text)


John went to the store.
Mary likes ice cream.


# Refine translation Template

In [61]:
import json

# load the JSON data    
with open('/work/u5110390/BenchWeaver/prompt/translation_prompt.json', 'r') as f:
    data = json.load(f)

In [60]:
# export to json

with open('/work/u5110390/BenchWeaver/prompt/translation_prompt.json', 'w') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# Re-scoreing

In [1]:
import json
import re
import os
from typing import Any, Dict, List
import numpy as np
from BenchWeaver.eval.benchmarks.configs import BENCHMARK_CONFIG
from tqdm import tqdm

def parse_bool_score(text: str) -> str:
    '''
    Normally for MCQA checking. The answer is either true, false or unknown.
    '''
    text = text.lower()
    match = re.search(r'\b(true|false|unknown)\b', text)
    return match.group(0) if match else ""
        
def compute_score(benchmark_name: str, 
                  checked_answers: Dict[str, List[Any]], 
                  check_results: Dict[str, List[Any]], 
                  mapping_dict: Dict[str, dict]
    ) -> Dict[str, float]:
    category_corrects = {score: {"corrects": 0, "true_mask_count": 0} for score in BENCHMARK_CONFIG[benchmark_name]['display_scores']}

    for subject in tqdm(mapping_dict.keys(), desc="Compute subjects"):
        category_name = mapping_dict[subject]["category"]
        
        # Ground truth and predictions
        answers = np.array(checked_answers[subject])
        predictions = np.array([parse_bool_score(ans) for ans in check_results[subject]])

        # Mask for when the answer is 'true'
        true_mask = answers == 'true'

        # Compare predictions and answers, only where answer is 'true'
        corrects = (predictions == 'true') & true_mask  # correct when answer is 'true' and prediction is 'true'

        # Append results to category
        category_corrects[category_name]["corrects"] += corrects.sum()
        category_corrects[category_name]["true_mask_count"] += true_mask.sum()
        category_corrects["Average"]['corrects'] += corrects.sum()
        category_corrects["Average"]['true_mask_count'] += true_mask.sum()

    # Compute accuracy per category: correct_true / total_true
    results = {}
    for category_name, record_dict in category_corrects.items():
        acc = round(100 * (record_dict['corrects'] / record_dict['true_mask_count']), 4)
        results[category_name] = acc

    return results



2025-04-05 13:22:22.119593: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-05 13:22:22.119640: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-05 13:22:22.121176: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-05 13:22:22.129561: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
benchmark_name = "cmmlu"
folder = f"/work/u5110390/BenchWeaver/score/trans_template_exp/{benchmark_name}/mix"
mapping_path = os.path.join(f"/work/u5110390/BenchWeaver/evaluation_data/{benchmark_name}/mapping.json")

# load the JSON data    
with open(mapping_path, 'r') as f:
    mapping = json.load(f)
    
with open(os.path.join(folder, "checked_answers.json"), 'r') as f:
    answer_data = json.load(f)
    
with open(os.path.join(folder, "check_results.json"), 'r') as f:
    checked_data = json.load(f)
    # retreive the answer
    for subj, check_list in checked_data.items():
        bool_list = []
        for check in check_list:
            bool_list.append(parse_bool_score(check))
        checked_data[subj] = bool_list

score_dict = compute_score(benchmark_name, answer_data, checked_data, mapping)

with open(os.path.join(folder, "score.json"), 'w') as f:
    json.dump(score_dict, f, indent=4, ensure_ascii=False)

Compute subjects: 100%|██████████| 67/67 [00:00<00:00, 1445.11it/s]


In [2]:
import json
question_check_result_path = "/work/u5110390/BenchWeaver/score/translation_results/kmmlu/mix/question_check_result.json"
answer_check_result_path = "/work/u5110390/BenchWeaver/score/translation_results/kmmlu/mix/answer_check_result.json"
with open(question_check_result_path, 'r') as f:
    question_check_result = json.load(f)
    
with open(answer_check_result_path, 'r') as f:
    answer_check_result = json.load(f)

In [6]:
from typing import Dict, List
import numpy as np
def merge_and_calculate_results(
    question_check_result: Dict[str, List[dict]],
    answer_check_result: Dict[str, List[dict]]
    ):
    score_dict = {
        subj: {
            "question":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            },
            "answer":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            }
        } 
        for subj in question_check_result.keys()
    }
    average_score_dict = {
        "資訊保留度": 0,
        "風格匹配度": 0,
        "專有名詞準確度": 0,
        "翻譯品質": 0,
        "Average": 0,
        "question":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            },
            "answer":{
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            }
    }
    
    for subj in question_check_result.keys():
        question_record_dict = {
            "資訊保留度": [],
            "風格匹配度": [],
            "專有名詞準確度": [],
            "翻譯品質": []
        }
        answer_record_dict = {
            "資訊保留度": [],
            "風格匹配度": [],
            "專有名詞準確度": [],
            "翻譯品質": []
        }
        for question_result_dict, answer_result_dict in zip(question_check_result[subj], answer_check_result[subj]):
            # append the scores to the record dict
            try:
                question_record_dict["資訊保留度"].append(question_result_dict["資訊保留度"]["分數"])
                question_record_dict["風格匹配度"].append(question_result_dict["風格匹配度"]["分數"])
                question_record_dict["專有名詞準確度"].append(question_result_dict["專有名詞準確度"]["分數"])
                question_record_dict["翻譯品質"].append(question_result_dict["翻譯品質"]["分數"])
                answer_record_dict["資訊保留度"].append(answer_result_dict["資訊保留度"]["分數"])
                answer_record_dict["風格匹配度"].append(answer_result_dict["風格匹配度"]["分數"])
                answer_record_dict["專有名詞準確度"].append(answer_result_dict["專有名詞準確度"]["分數"])
                answer_record_dict["翻譯品質"].append(answer_result_dict["翻譯品質"]["分數"])
                # calculate the average score for each subject
                score_dict[subj]['question']['資訊保留度'] = np.mean(question_record_dict["資訊保留度"])
                score_dict[subj]['question']['風格匹配度'] = np.mean(question_record_dict["風格匹配度"])
                score_dict[subj]['question']['專有名詞準確度'] = np.mean(question_record_dict["專有名詞準確度"])
                score_dict[subj]['question']['翻譯品質'] = np.mean(question_record_dict["翻譯品質"])
                score_dict[subj]['question']['Average'] = np.mean([
                    score_dict[subj]['question']['資訊保留度'],
                    score_dict[subj]['question']['風格匹配度'],
                    score_dict[subj]['question']['專有名詞準確度'],
                    score_dict[subj]['question']['翻譯品質']
                ])
                score_dict[subj]['answer']['資訊保留度'] = np.mean(answer_record_dict["資訊保留度"])
                score_dict[subj]['answer']['風格匹配度'] = np.mean(answer_record_dict["風格匹配度"])
                score_dict[subj]['answer']['專有名詞準確度'] = np.mean(answer_record_dict["專有名詞準確度"])
                score_dict[subj]['answer']['翻譯品質'] = np.mean(answer_record_dict["翻譯品質"])
                score_dict[subj]['answer']['Average'] = np.mean([
                    score_dict[subj]['answer']['資訊保留度'],
                    score_dict[subj]['answer']['風格匹配度'],
                    score_dict[subj]['answer']['專有名詞準確度'],
                    score_dict[subj]['answer']['翻譯品質']
                ])
            except Exception as e:
                print(f"Exception: {e} in subject {subj}")
                print(f"Question result: {question_result_dict}")
                print(f"Answer result: {answer_result_dict}")
                continue
            
    # calculate average score for each subject
    average_score_dict['資訊保留度'] = np.mean(
        [score_dict[subj]['question']['資訊保留度'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['資訊保留度'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["資訊保留度"] = np.mean(
        [score_dict[subj]['question']['資訊保留度'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["資訊保留度"] = np.mean(
        [score_dict[subj]['answer']['資訊保留度'] for subj in score_dict.keys()]
    )
    average_score_dict['風格匹配度'] = np.mean(
        [score_dict[subj]['question']['風格匹配度'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['風格匹配度'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["風格匹配度"] = np.mean(
        [score_dict[subj]['question']['風格匹配度'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["風格匹配度"] = np.mean(
        [score_dict[subj]['answer']['風格匹配度'] for subj in score_dict.keys()]
    )
    average_score_dict['專有名詞準確度'] = np.mean(
        [score_dict[subj]['question']['專有名詞準確度'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['專有名詞準確度'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["專有名詞準確度"] = np.mean(
        [score_dict[subj]['question']['專有名詞準確度'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["專有名詞準確度"] = np.mean(
        [score_dict[subj]['answer']['專有名詞準確度'] for subj in score_dict.keys()]
    )
    average_score_dict['翻譯品質'] = np.mean(
        [score_dict[subj]['question']['翻譯品質'] for subj in score_dict.keys()] + 
        [score_dict[subj]['answer']['翻譯品質'] for subj in score_dict.keys()]
    )
    average_score_dict['question']["翻譯品質"] = np.mean(
        [score_dict[subj]['question']['翻譯品質'] for subj in score_dict.keys()]
    )
    average_score_dict['answer']["翻譯品質"] = np.mean(
        [score_dict[subj]['answer']['翻譯品質'] for subj in score_dict.keys()]
    )
    average_score_dict['Average'] = np.mean([
        average_score_dict['資訊保留度'],
        average_score_dict['風格匹配度'],
        average_score_dict['專有名詞準確度'],
        average_score_dict['翻譯品質']
    ])
    average_score_dict['question']["Average"] = np.mean([
        average_score_dict['question']["資訊保留度"],
        average_score_dict['question']["風格匹配度"],
        average_score_dict['question']["專有名詞準確度"],
        average_score_dict['question']["翻譯品質"]
    ])
    average_score_dict['answer']["Average"] = np.mean([
        average_score_dict['answer']["資訊保留度"],
        average_score_dict['answer']["風格匹配度"],
        average_score_dict['answer']["專有名詞準確度"],
        average_score_dict['answer']["翻譯品質"]
    ])
    
    score_dict.update({"Average": average_score_dict})
    return score_dict

In [3]:
from typing import Dict, List
import numpy as np

def merge_and_calculate_results(
    question_check_result: Dict[str, List[dict]],
    answer_check_result: Dict[str, List[dict]]  # kept for compatibility, but will be unused
):
    score_dict = {
        subj: {
            "question": {
                "資訊保留度": 0,
                "風格匹配度": 0,
                "專有名詞準確度": 0,
                "翻譯品質": 0,
                "Average": 0
            }
        }
        for subj in question_check_result.keys()
    }

    average_score_dict = {
        "資訊保留度": 0,
        "風格匹配度": 0,
        "專有名詞準確度": 0,
        "翻譯品質": 0,
        "Average": 0,
        "question": {
            "資訊保留度": 0,
            "風格匹配度": 0,
            "專有名詞準確度": 0,
            "翻譯品質": 0,
            "Average": 0
        }
    }

    for subj in question_check_result.keys():
        question_record_dict = {
            "資訊保留度": [],
            "風格匹配度": [],
            "專有名詞準確度": [],
            "翻譯品質": []
        }

        for question_result_dict in question_check_result[subj]:
            try:
                question_record_dict["資訊保留度"].append(question_result_dict["資訊保留度"]["分數"])
                question_record_dict["風格匹配度"].append(question_result_dict["風格匹配度"]["分數"])
                question_record_dict["專有名詞準確度"].append(question_result_dict["專有名詞準確度"]["分數"])
                question_record_dict["翻譯品質"].append(question_result_dict["翻譯品質"]["分數"])

                score_dict[subj]['question']['資訊保留度'] = np.mean(question_record_dict["資訊保留度"])
                score_dict[subj]['question']['風格匹配度'] = np.mean(question_record_dict["風格匹配度"])
                score_dict[subj]['question']['專有名詞準確度'] = np.mean(question_record_dict["專有名詞準確度"])
                score_dict[subj]['question']['翻譯品質'] = np.mean(question_record_dict["翻譯品質"])
                score_dict[subj]['question']['Average'] = np.mean([
                    score_dict[subj]['question']['資訊保留度'],
                    score_dict[subj]['question']['風格匹配度'],
                    score_dict[subj]['question']['專有名詞準確度'],
                    score_dict[subj]['question']['翻譯品質']
                ])
            except Exception as e:
                print(f"Exception: {e} in subject {subj}")
                print(f"Question result: {question_result_dict}")
                continue

    for metric in ["資訊保留度", "風格匹配度", "專有名詞準確度", "翻譯品質"]:
        scores = [score_dict[subj]['question'][metric] for subj in score_dict.keys()]
        average_score_dict[metric] = np.mean(scores)
        average_score_dict['question'][metric] = average_score_dict[metric]

    average_score_dict['Average'] = np.mean([
        average_score_dict['資訊保留度'],
        average_score_dict['風格匹配度'],
        average_score_dict['專有名詞準確度'],
        average_score_dict['翻譯品質']
    ])
    average_score_dict['question']["Average"] = average_score_dict['Average']

    score_dict.update({"Average": average_score_dict})
    return score_dict


In [4]:
score = merge_and_calculate_results(question_check_result, answer_check_result)
score

Exception: '專有名詞準確度' in subject civil_engineering
Question result: {'資訊保留度': {'分數': 10, '原因': '所有選擇題的問題和選項均被完整地翻譯並保留了原始資訊，包括關鍵細節、邏輯關係和語義準確性。'}, '風格匹配度': {'分數': 9, '原因': "翻譯文本與風格範例中學術論文的語氣和形式相當一致，使用了正式且嚴謹的語言，但有部分措辭稍顯口語化。例如 'Which of the following...' 可以更正式地表達。"}, '專有名詞准確度': {'分數': 10, '原因': "所有專有名詞（如 'civil engineering', 'hydrographic surveys', 'standard penetration test' 等）均準確且一致地翻譯，符合集合慣例。"}, '翻譯品質': {'分數': 10, '原因': '翻譯文本語法正確、流暢自然，完全避免了生硬直譯或機翻痕跡。句子結構良好且符合目標語言的語法規範。'}}
Exception: '專有名詞準確度' in subject construction
Question result: {'資訊保留度': {'分數': 9, '原因': '翻譯文本完整保留了原文的所有資訊，包括關鍵細節、邏輯關係和語義準確性。唯一的不足之處是翻譯中略去了原文中的部分上下文提示，如『사무소건축에서』。'}, '風格匹配度': {'分數': 8, '原因': '翻譯文本大體上符合風格範例的正式語氣和措辭選擇，但在句式排列上稍顯口語化。例如使用了『What is the correct』這種稍微更口語化的形式。'}, '專有名詞准確度': {'分數': 10, '原因': "所有專有名詞均正確翻譯，並在英文和原文韓語之間保持一致，如 'Radiographic Test (방사선투과검사)' 等。"}, '翻譯品質': {'分數': 9, '原因': "翻譯文本流暢自然，語法規範，沒有生硬的直譯或機翻痕跡。唯一的稍微不足之處是個別句子（如 'In the case of continuously poured concrete'）中層次較多，稍顯複雜。"}}
Exception: '專有名詞準確度' in subjec

{'accounting': {'question': {'資訊保留度': 7.14,
   '風格匹配度': 7.49,
   '專有名詞準確度': 8.25,
   '翻譯品質': 6.92,
   'Average': 7.449999999999999}},
 'agricultural_sciences': {'question': {'資訊保留度': 9.16,
   '風格匹配度': 8.35,
   '專有名詞準確度': 9.175,
   '翻譯品質': 8.62,
   'Average': 8.82625}},
 'aviation_engineering_and_maintenance': {'question': {'資訊保留度': 9.42,
   '風格匹配度': 8.425,
   '專有名詞準確度': 9.455,
   '翻譯品質': 8.845,
   'Average': 9.036249999999999}},
 'biology': {'question': {'資訊保留度': 9.195,
   '風格匹配度': 8.57,
   '專有名詞準確度': 9.22,
   '翻譯品質': 8.76,
   'Average': 8.93625}},
 'chemical_engineering': {'question': {'資訊保留度': 7.965,
   '風格匹配度': 7.785,
   '專有名詞準確度': 8.57,
   '翻譯品質': 7.585,
   'Average': 7.97625}},
 'chemistry': {'question': {'資訊保留度': 9.185,
   '風格匹配度': 8.365,
   '專有名詞準確度': 9.33,
   '翻譯品質': 8.71,
   'Average': 8.8975}},
 'civil_engineering': {'question': {'資訊保留度': 9.215,
   '風格匹配度': 8.6,
   '專有名詞準確度': 9.316582914572864,
   '翻譯品質': 8.71356783919598,
   'Average': 8.96128768844221}},
 'computer_science'

In [7]:
import os
# export
with open("/work/u5110390/BenchWeaver/score/translation_results/cmmlu/few_shot/score.json", 'w') as f:
    json.dump(score, f, indent=2, ensure_ascii=False)

In [16]:
import re

def parse_numerical_score(text: str) -> float:
    score = -1.0
    regex_patterns = [
        r'(?:score|分數)\s*[:：]?\s*([\d.]+)',
        r'rating:\s*\[\[([\d.]+)\]\]$'
    ]
    for pattern in regex_patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                score = float(match.group(1))
                break
            except ValueError:
                continue  # Skip to next pattern if conversion fails
    else:
        # This block runs if no break occurred (i.e., no match)
        t = text.replace('\n', '\\n')
        print(f'Parse score error: {t}')
        
    return score

print(parse_numerical_score("Score:\n0.5"))
print(parse_numerical_score("分數：0.5"))
print(parse_numerical_score("分數： 0.5"))
print(parse_numerical_score("分數：0.5\n"))
print(parse_numerical_score("分數：0.5\n"))
print(parse_numerical_score("Score: [[3.5]]"))

0.5
0.5
0.5
0.5
0.5
Parse score error: Score: [[3.5]]
-1.0


In [17]:
import numpy as np

np.mean([1.0, 2, 3, 4, 5])

3.0

In [2]:
from datetime import datetime

input_data = {
    "資訊保留度": {
        "分數": 4,
        "原因": "回答中有部分資訊缺失",
        "timestamp": datetime.now(),  # Not JSON serializable
    },
    "風格匹配度": {
        "分數": 3.5,
        "原因": None
    },
    "自定義物件": set([1, 2, 3])  # Also not JSON serializable
}
def ensure_json_serializable(obj):
    """
    Recursively ensure all values in the dict are JSON serializable.
    """
    if isinstance(obj, dict):
        return {str(k): ensure_json_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [ensure_json_serializable(v) for v in obj]
    elif isinstance(obj, (str, int, float, bool)) or obj is None:
        return obj
    else:
        return str(obj)  # Fallback to string representation

ensure_json_serializable(input_data)

{'資訊保留度': {'分數': 4,
  '原因': '回答中有部分資訊缺失',
  'timestamp': '2025-04-13 20:02:39.104163'},
 '風格匹配度': {'分數': 3.5, '原因': None},
 '自定義物件': '{1, 2, 3}'}

In [None]:
import numpy as np
import os
from typing import Any, Dict, List
from tqdm import tqdm
import json
os.environ["JAVA_HOME"]="/usr/lib/java"
from BenchWeaver.eval.metric.retrieve_score import parse_bool_score, parse_numerical_score

folder = "/work/u5110390/BenchWeaver/score/trans_template_exp/tmmluplus/mix"
with open("/work/u5110390/BenchWeaver/evaluation_data/tmmluplus/mapping.json", "r") as f:
    catagories = json.load(f)
    for key, value in catagories.items():
        value["category"] = key
            
with open(os.path.join(folder, "checked_answers.json"), 'r') as f:
    checked_answers = json.load(f)
with open(os.path.join(folder, "check_results.json"), 'r') as f:
    check_results = json.load(f)
    
def retrieve_answer(text: str, numerical:bool=False) -> str | float:
        return parse_numerical_score(text) if numerical else parse_bool_score(text)
    
def comput_score(catagories, checked_answers: Dict[str, List[Any]], check_results: Dict[str, List[Any]], subjects: List[str]) -> Dict[str, float]:
        category_corrects = {score: {"corrects": 0, "true_mask_count": 0} for score in subjects}

        for subject in tqdm(catagories.keys(), desc="Compute subjects"):
            category_name = catagories[subject]["category"]
            answers = np.array(checked_answers[subject])
            predictions = np.array([retrieve_answer(ans) for ans in check_results[subject]])
            # Mask for when the answer is 'true'
            true_mask: np.ndarray = answers == 'true'
            # Compare predictions and answers, only where answer is 'true'
            corrects: np.ndarray = (predictions == 'true') & true_mask  # correct when answer is 'true' and prediction is 'true'
            # Update the corrects and true_mask counts
            category_corrects[category_name]["corrects"] += corrects.sum()
            category_corrects[category_name]["true_mask_count"] += true_mask.sum()
            category_corrects["Average"]['corrects'] += corrects.sum()
            category_corrects["Average"]['true_mask_count'] += true_mask.sum()
        
        return {
            category_name: round(100 * (record_dict['corrects'] / record_dict['true_mask_count']), 4)
                for category_name, record_dict in category_corrects.items()
        }



In [9]:
subjects = ['Average'] + [_ for _ in catagories.keys()]
comput_score(catagories, checked_answers, check_results, subjects)

Compute subjects: 100%|██████████| 66/66 [00:00<00:00, 403.33it/s]


{'Average': 44.6168,
 'dentistry': 42.8571,
 'traditional_chinese_medicine_clinical_medicine': 31.6547,
 'clinical_psychology': 61.6,
 'technical': 49.7512,
 'culinary_skills': 54.1096,
 'mechanical': 53.3898,
 'logic_reasoning': 30.2158,
 'real_estate': 45.6522,
 'general_principles_of_law': 39.6226,
 'finance_banking': 47.4074,
 'anti_money_laundering': 62.6866,
 'ttqav2': 63.7168,
 'marketing_management': 70.9677,
 'business_management': 53.9568,
 'organic_chemistry': 60.5505,
 'advance_chemistry': 40.6504,
 'physics': 54.6392,
 'secondary_physics': 63.3929,
 'human_behavior': 57.6052,
 'national_protection': 45.0237,
 'jce_humanities': 4.4444,
 'politic_science': 53.7688,
 'agriculture': 50.3311,
 'official_document_management': 36.4865,
 'financial_analysis': 58.377,
 'pharmacy': 39.6419,
 'educational_psychology': 62.5,
 'statistics_and_machine_learning': 51.7857,
 'management_accounting': 33.9535,
 'introduction_to_law': 3.7975,
 'computer_science': 60.3448,
 'veterinary_patholo

In [None]:
from BenchWeaver.eval.benchmarks.en.ifeval.source_code.evaluation_main import evaluate_instruction_following
import json
input_data = "/work/u5110390/BenchWeaver/instruction_following_eval/data/input_data.jsonl"
input_response_data = "/work/u5110390/BenchWeaver/instruction_following_eval/data/input_response_data_gpt4_20231107_145030.jsonl"
output_dir = "/work/u5110390/BenchWeaver/scripts"

print(json.dumps(
evaluate_instruction_following(
    input_data=input_data,
    input_response_data=input_response_data,
    output_dir=output_dir,
), ensure_ascii=False, indent=2))

In [1]:
from datasets import load_dataset

dataset = load_dataset(
                path="/work/u5110390/BenchWeaver/evaluation_data/ifeval",
                name="all",
                cache_dir=None,
                trust_remote_code=True,
            )
dataset['test'][0]['kwargs']

{'num_highlights': 3.0,
 'relation': 'at least',
 'num_words': 300.0,
 'num_placeholders': None,
 'prompt_to_repeat': None,
 'num_bullets': None,
 'section_spliter': None,
 'num_sections': None,
 'capital_relation': None,
 'capital_frequency': None,
 'keywords': None,
 'num_paragraphs': None,
 'language': None,
 'let_relation': None,
 'letter': None,
 'let_frequency': None,
 'end_phrase': None,
 'forbidden_words': None,
 'keyword': None,
 'frequency': None,
 'num_sentences': None,
 'postscript_marker': None,
 'first_word': None,
 'nth_paragraph': None}

In [2]:
for dict in dataset['test']:
    if dict['kwargs']['keywords'] is not None:
        print(dict['kwargs']['keywords'])
    
    if dict['kwargs']['forbidden_words'] is not None:
        print(dict['kwargs']['forbidden_words'])

['correlated', 'experiencing']
['rock']
['nourriture']
['mom', 'mother']
['field', 'thanks', 'issue', 'collaborator']
['Argentinian']
['nickname']
['waste', 'material', 'meal']
['sleep', 'cook', 'feed']
['intern', 'grow']
['brilliant', 'le', 'hou']
['vulnerable']
['sarah']
['dupage', 'dade']
['cat']
['coop', 'killings', 'dead', 'night']
['taylor', 'swift', 'together']
['name', 'rename']
['forests', 'riddle']
['netflix']
['atlantis', 'constable']
['land', 'river']
['can', 'ride']
['station']
['python', 'java']
['heute']
['moser', 'glassworks', 'pravcice', 'karlovy', 'vary']
['disgusting', 'delicious', 'bad', 'good']
['lacking', 'model', 'performance', 'quality', 'architecture']
['calculate', 'file', 'conclusion']
['bad', 'underperform']
['yes', 'no']
['enzymes', 'antibodies']
['rich', 'money']
['die']
['rate', 'rte']
['reschedule', 'free']
['talented', 'tianjin']
['ours', 'have']
['startup', 'capsule']
['flea', 'json']
['died', 'drowned']
['sad', 'crazy', 'stress']
['ink', 'memoirs']
['

In [1]:
import re

def _postprocess_generation(text: str) -> str:
    # Try to match [BEGIN] ... [DONE] with flexible spacing and casing
    pattern_begin_done = r'\[\s*begin\s*\](.*?)\[\s*done\s*\]'
    match = re.search(pattern_begin_done, text, re.IGNORECASE | re.DOTALL)

    if match:
        return match.group(1).strip()

    # If no [BEGIN]...[DONE], try to match text between triple backticks
    pattern_code_block = r'```(?:\w*\n)?(.*?)```'
    match = re.search(pattern_code_block, text, re.DOTALL)

    if match:
        return match.group(1).strip()

    return text.strip()

print(_postprocess_generation("以下是 Perl 程式設計的任務：\n\n```perl\nsub sum_product {\n    my ($numbers) = @_;\n    my ($sum, $product) = (0, 1);\n    foreach my $num (@$numbers) {\n        $sum += $num;\n        $product *= $num;\n    }\n    return ($sum, $product);\n}\n\n# 測試案例\nuse Data::Compare;\n\nmy $arg00 = [];\nmy $x0 = sum_product($arg00);\nmy $v0 = [0, 1];\nunless (Compare($x0, $v0)) {\n    die \"例外 -- 測試案例 0 未通過。\";\n}\n\nmy $arg10 = [1, 1, 1];\nmy $x1 = sum_product($arg10);\nmy $v1 = [3, 1];\nunless (Compare($x1, $v1)) {\n    die \"例外 -- 測試案例 1 未通過。\";\n}\n\nmy $arg20 = [100, 0];\nmy $x2 = sum_product($arg20);\nmy $v2 = [100, 0];\nunless (Compare($x2, $v2)) {\n    die \"例外 -- 測試案例 2 未通過。\";\n}\n\nmy $arg30 = [3, 5, 7];\nmy $x3 = sum_product($arg30);\nmy $v3 = [15, 105];\nunless (Compare($x3, $v3)) {\n    die \"例外 -- 測試案例 3 未通過。\";\n}\n\nmy $arg40 = [10];\nmy $x4 = sum_product($arg40);\nmy $v4 = [10, 10];\nunless (Compare($x4, $v4)) {\n    die \"例外 -- 測試案例 4 未通過。\";\n}\n```\n\n這個 Perl 程式定義了一個名為 `sum_product` 的子程式，它接受一個列表作為輸入，計算這個列表中所有整數的總和和乘積。它使用兩個變數 `$sum` 和 `$product` 來分別儲存總和和乘積。程式使用一個 `foreach` 迴圈來遍歷輸入列表中的每個整數，對每個整數執行兩個操作：將其加到 `$sum` 中，將其乘以 `$product`。最後，程式返回一個元組，包含 `$sum` 和 `$product` 的值。"))

sub sum_product {
    my ($numbers) = @_;
    my ($sum, $product) = (0, 1);
    foreach my $num (@$numbers) {
        $sum += $num;
        $product *= $num;
    }
    return ($sum, $product);
}

# 測試案例
use Data::Compare;

my $arg00 = [];
my $x0 = sum_product($arg00);
my $v0 = [0, 1];
unless (Compare($x0, $v0)) {
    die "例外 -- 測試案例 0 未通過。";
}

my $arg10 = [1, 1, 1];
my $x1 = sum_product($arg10);
my $v1 = [3, 1];
unless (Compare($x1, $v1)) {
    die "例外 -- 測試案例 1 未通過。";
}

my $arg20 = [100, 0];
my $x2 = sum_product($arg20);
my $v2 = [100, 0];
unless (Compare($x2, $v2)) {
    die "例外 -- 測試案例 2 未通過。";
}

my $arg30 = [3, 5, 7];
my $x3 = sum_product($arg30);
my $v3 = [15, 105];
unless (Compare($x3, $v3)) {
    die "例外 -- 測試案例 3 未通過。";
}

my $arg40 = [10];
my $x4 = sum_product($arg40);
my $v4 = [10, 10];
unless (Compare($x4, $v4)) {
    die "例外 -- 測試案例 4 未通過。";
}


In [1]:
from BenchWeaver.eval.metric.mxeval import evaluate_functional_correctness
from typing import Any, Dict, List
import json

def load_jsonl(file_path: str) -> List[Dict[str, Any]]:
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]
    
sample_file = "/work/u5110390/BenchWeaver/mxeval/data/mbxp/examples/mbjp_samples.jsonl"
problem_file = "/work/u5110390/BenchWeaver/mxeval/data/mbxp/mbjp_release_v1.2.jsonl"
output_dir = "/work/u5110390/BenchWeaver/scripts"

evaluate_functional_correctness(
    sample_file=load_jsonl(sample_file),
    problem_file=load_jsonl(problem_file),
    output_dir=output_dir,
)

2025-05-08 20:53:33.970957: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-05-08 20:53:33.970999: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-05-08 20:53:33.972454: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-08 20:53:33.980563: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Skip reading problems -- using problem_file (List[dict]) as problems


Submitting samples: 100%|██████████| 966/966 [00:00<00:00, 1301.89it/s, MBJP/959]
Running test suites: 100%|██████████| 966/966 [00:53<00:00, 18.20it/s, MBJP/617]


Writing results to /work/u5110390/BenchWeaver/scripts/mxeval_results.jsonl


100%|██████████| 966/966 [00:00<00:00, 113540.64it/s]


{'pass@1': 0.8530020703933747}

In [None]:
from BenchWeaver.eval.metric.translate import eval_bleu
predictions = ["hello there general kenobi", "foobar"]
references = [
    ["hello there general kenobi", "hello there !"],
    ["foo bar foobar", "foo bar foobar"]
    ]

eval_bleu(
    predictions=predictions,
    references=references,
)

2025-06-01 21:44:52.986120: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-06-01 21:44:52.986241: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-06-01 21:44:53.180799: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-01 21:44:53.256565: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using the latest cached version of the module

{'bleu': 0.8187307530779819,
 'precisions': [1.0, 1.0, 1.0, 1.0],
 'brevity_penalty': 0.8187307530779819,
 'length_ratio': 0.8333333333333334,
 'translation_length': 5,
 'reference_length': 6}

: 

In [3]:
import json
path = "/work/u5110390/BenchWeaver/score/additional/zh/humaneval-xl/en/mxeval_results.jsonl"
with open(path, 'r') as f:
    results = [json.loads(line) for line in f]

In [6]:
results[0].keys()

dict_keys(['task_id', 'completion', 'language', 'result', 'passed', 'time_elapsed'])

In [8]:
for idx, dict in enumerate(results):
    if idx % 80 == 0:
        print(dict['completion'])
        print('-------------------------------------------')
        print(dict['result'])
        print("==========================================")

use Data::Compare;

sub below_zero {
    my (@arr) = @_;
    my $count = 0;
    foreach my $num (@arr) {
        if ($num < 0) {
            $count++;
        }
    }
    return $count;
}
-------------------------------------------
Missing right curly or square bracket at /work/u5110390/BenchWeaver/mxeval_cache/perl_exec_eval/HbolMJUevb.pl line 64, at end of line
syntax error at /work/u5110390/BenchWeaver/mxeval_cache/perl_exec_eval/HbolMJUevb.pl line 64, at EOF
Execution of /work/u5110390/BenchWeaver/mxeval_cache/perl_exec_eval/HbolMJUevb.pl aborted due to compilation errors.

--- a/, index 
---b/, using System; using System; using, you' You' you' your, are, are, BelowZero([1, BelowZero(new List<int your code should pass these tests: public static} You' You' you' are} are, BelowZero, are: are; BelowZero
-------------------------------------------

def below_zero(operations)
  balance = 0
  operations.each do |operation|
    balance += operation
    if balance < 0
      return true
   