This notebook is designed to be opened and run seamlessly in Google Colab. Click the badge to get started!

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lightblue-tech/M-IFEval/blob/main/colab_mifeval_run.ipynb)

# Requirements

Mount Google Drive to access files within Colab

In [1]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


Install GitHub CLI (gh) using apt without confirmation prompts

In [2]:
! apt -y install gh

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  gh
0 upgraded, 1 newly installed, 0 to remove and 19 not upgraded.
Need to get 6,242 kB of archives.
After this operation, 33.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 gh amd64 2.4.0+dfsg1-2 [6,242 kB]
Fetched 6,242 kB in 1s (8,181 kB/s)
Selecting previously unselected package gh.
(Reading database ... 124926 files and directories currently installed.)
Preparing to unpack .../gh_2.4.0+dfsg1-2_amd64.deb ...
Unpacking gh (2.4.0+dfsg1-2) ...
Setting up gh (2.4.0+dfsg1-2) ...
Processing triggers for man-db (2.10.2-1) ...


Clone the 'main' branch of the M-IFEval repository from GitHub

In [None]:
! git clone -b main https://github.com/lightblue-tech/M-IFEval.git

Install project-specific dependencies and additional packages for large language models and optimization

In [4]:
! cd M-IFEval && pip install -q -r requirements.txt
! pip install -q vllm==0.7.1 bitsandbytes==0.45.1 hf-transfer==0.1.9

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m768.0/981.5 kB[0m [31m22.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.7/19.7 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m

Import NLTK and download the 'punkt' tokenizer for text processing

In [5]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
!python -m spacy download es_core_news_sm --quiet
!python -m spacy download xx_sent_ud_sm --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/12.9 MB[0m [31m36.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m7.8/12.9 MB[0m [31m113.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.9/12.9 MB[0m [31m213.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m108.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runt

# Response generation

## Models supported by HuggingFace Transformers

Log in to Hugging Face.

> If it is your fisrt time using HF tokens, check out their detailed [doc](https://huggingface.co/docs/hub/en/security-tokens) to set up your own.

In [9]:
from google.colab import userdata

# Replace 'your_huggingface_token' with your actual token
your_huggingface_token = "your_huggingface_token"
! huggingface-cli login --token {your_huggingface_token}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `read_perso_public_gated_repo` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `read_perso_public_gated_repo`


In [None]:
# # Alternative if you have your huggingface token saved in Colab's Secrets
# from google.colab import userdata

# ! huggingface-cli login --token {userdata.get('HF_TOKEN')}

**Generate responses for supported models**

> To add support for a new model that is not currently supported:
>
> 1. Open the `get_responses.py` file in M-IFEval directory.
> 2. In the `SUPPORTED_MODELS` dictionary, add the new model with the following format:
>
>   `'name_or_path_of_the_huggingface_model': 'vllm'`
>   
>   - Replace `name_or_path_of_the_huggingface_model` with the actual HuggingFace model name or path.
>   - `'vllm'` should remain as the value to indicate that the model uses the vLLM inference method.
>
> This will ensure that the new model is recognized and supported by the system.


**Note:** Depending on the model you're running, you may need to upgrade your device to a T4, L4, or A100 GPU.

In [8]:
# Add the name or path of a HuggingFace Transformers model
# for which you want to generate responses
local_model_names = [
    "deepseek-ai/deepseek-llm-7b-chat",
    #  'CohereForAI/aya-23-8B',
    #  'Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4',
    #  'Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4',
    #  'Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4',
    #  'Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4',
    #  'Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4',
    #  'Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4',
    #  'hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4',
    #  'mistralai/Mistral-7B-Instruct-v0.3'
]

for local_model_name in local_model_names:
    ! cd M-IFEval && HF_HUB_ENABLE_HF_TRANSFER=1 python get_responses.py --model_name {local_model_name}

2025-02-07 01:30:26.968937: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-07 01:30:26.986690: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738891827.009126    3770 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738891827.015848    3770 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-07 01:30:27.038459: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

## OpenAI

Install OpenAI SDK

In [None]:
!pip install -q openai

Set the OpenAI API key as an environment variable for authentication

In [None]:
import os

# Set the API key (replace with your actual key)
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

**Generate responses for supported models:**

> To add support for a new model that is not currently supported:
>
> 1. Open the `get_responses.py` file in M-IFEval directory.
> 2. In the `SUPPORTED_MODELS` dictionary, add the new model with the following format:
>
>   `'openai_model_name_or_version': 'openai'`
>   
>   - Replace `name_or_path_of_the_huggingface_model` with the actual HuggingFace model name or path.
>   - `'openai'` should remain as the value to indicate that the model uses the OpenAI inference method.
>
> This will ensure that the new model is recognized and supported by the system.


In [None]:
model_names = [
    "gpt-4o-mini-2024-07-18",
    "gpt-4o-2024-08-06",
    "o1-preview-2024-09-12",
    "o1-mini-2024-09-12",
]

for model_name in model_names:
    ! cd M-IFEval && HF_HUB_ENABLE_HF_TRANSFER=1 python get_responses.py --model_name {model_name}

## Anthropic

Install Anthropic SDK

In [38]:
!pip install -q anthropic

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/222.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m194.6/222.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.8/222.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h

Set the Anthropic API key as an environment variable for authentication

In [None]:
import os

# Set the API key (replace with your actual key)
os.environ["ANTHROPIC_API_KEY"] = "your-api-key-here"

**Generate responses for supported models:**

> To add support for a new model that is not currently supported:
>
> 1. Open the `get_responses.py` file in M-IFEval directory.
> 2. In the `SUPPORTED_MODELS` dictionary, add the new model with the following format:
>
>   `'anthropic_model_name_or_version': 'anthropic'`
>   
>   - Replace `name_or_path_of_the_huggingface_model` with the actual HuggingFace model name or path.
>   - `'anthropic'` should remain as the value to indicate that the model uses the Anthropic inference method.
>
> This will ensure that the new model is recognized and supported by the system.


In [None]:
model_names = [
    "claude-3-haiku-20240307",
    "claude-3-5-sonnet-20240620",
    "claude-3-opus-20240229",
]

for model_name in model_names:
    ! cd M-IFEval && HF_HUB_ENABLE_HF_TRANSFER=1 python get_responses.py --model_name {model_name}

# Evaluation

Evaluate selected models on selected languages:

In [9]:
from glob import glob

%cd M-IFEval

# Add or remove model names from this list to specify which models to evaluate.
# Use Hugging Face model identifiers (e.g., 'deepseek-ai/deepseek-llm-7b-chat')
# or providers model names (e.g., 'o1-preview-2024-09-12').
selected_model_names = ["deepseek-ai/deepseek-llm-7b-chat", "o1-preview-2024-09-12"]
selected_model_names = [
    model_name.replace("/", "__") for model_name in selected_model_names
]

# Only keep the language tags of the languages you want to evaluate on.
selected_languages = ["en", "es", "fr", "ja"]
input_paths = [f"./data/{lang}_input_data.jsonl" for lang in selected_languages]

for input_path in input_paths:
    response_paths = [
        input_path[:-10] + f"response_data_{model_name}.jsonl"
        for model_name in selected_model_names
    ]

    for response_path in response_paths:
        run_name = response_path.split("/")[-1].split(".")[0]

        ! mkdir -p ./evaluations/{run_name}
        ! python -m evaluation_main \
          --input_data={input_path} \
          --input_response_data={response_path} \
          --output_dir=./evaluations/{run_name}

/content/M-IFEval
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
I0207 01:50:13.243359 132675298972288 evaluation_main.py:261] Generating eval_results_strict...
I0207 01:50:13.957061 132675298972288 evaluation_main.py:267] Accuracy: 0.369686
I0207 01:50:13.970398 132675298972288 evaluation_main.py:273] Generated: ./evaluations/en_input_response_data_deepseek-ai__deepseek-llm-7b-chat/eval_results_strict.jsonl
./evaluations/en_input_response_data_deepseek-ai__deepseek-llm-7b-chat/eval_results_strict.jsonl Accuracy Scores:
prompt-level: 0.36968576709796674
instruction-level: 0.4892086330935252

en 0.4892086330935252

en:change_case:capital_word_frequency 0.52
en:change_case:english_capital 0.12
en:change_case:english_lowercase 0.3076923076923077
en:combination:repeat_prompt 0.14634146341463414
en:combination:two_responses 0.3333333333333333
en:detectable_content:number_placeholders 0.37037037037037035
en:detectable_content

Or if you want to reproduce evaluation for all models for which responses have been generated:
> This will take some time to run.

In [46]:
from glob import glob

%cd M-IFEval

input_paths = glob("./data/*_input_data.jsonl")
print(input_paths)
for input_path in input_paths:
    response_paths = glob(input_path[:-10] + "response_data_*")
    print(response_paths)
    for response_path in response_paths:
        print(response_path)
        run_name = response_path.split("/")[-1].split(".")[0]

        ! mkdir -p ./evaluations/{run_name}
        ! python -m evaluation_main \
          --input_data={input_path} \
          --input_response_data={response_path} \
          --output_dir=./evaluations/{run_name}

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
en:length_constraints:number_words 0.6538461538461539
en:punctuation:no_comma 0.9545454545454546
en:startend:end_checker 0.9615384615384616
en:startend:quotation 0.926829268292683
I0206 06:06:54.960812 138724136882176 evaluation_main.py:267] Generating eval_results_loose...
I0206 06:06:55.801276 138724136882176 evaluation_main.py:273] Accuracy: 0.804067
I0206 06:06:55.816987 138724136882176 evaluation_main.py:279] Generated: ./test/en_input_response_data_Qwen__Qwen2/eval_results_loose.jsonl
./test/en_input_response_data_Qwen__Qwen2/eval_results_loose.jsonl Accuracy Scores:
prompt-level: 0.8040665434380776
instruction-level: 0.8561151079136691

en 0.8561151079136691

en:change_case:capital_word_frequency 0.8
en:change_case:english_capital 0.84
en:change_case:english_lowercase 0.9743589743589743
en:combination:repeat_prompt 0.6341463414634146
en:combination:two_responses 0.9166666666666666
en:dete

# Visualisation

Processing results for all evaluation result files:

In [8]:
from glob import glob
import pandas as pd
from tqdm.auto import tqdm

%cd M-IFEval

# Get paths of evaluation result files
paths = glob("./evaluations/*/eval_results_strict.jsonl")

# Dictionary of selected models with their display names
select_models = {
    "gpt-4o-2024-08-06": "GPT4o",
    "o1-mini-2024-09-12": "o1 Mini",
    "claude-3-5-sonnet-20240620": "Sonnet",
    "Qwen__Qwen2.5-32B-Instruct-GPTQ-Int4": "Qwen 2.5 32B I. 4-bit",
    "gpt-4o-mini-2024-07-18": "GPT4o Mini",
    "claude-3-haiku-20240307": "Haiku",
    "claude-3-opus-20240229": "Opus",
    "o1-preview-2024-09-12": "o1",
    "CohereForAI__aya-23-8B": "Aya 23 8B",
    "Qwen__Qwen2.5-0.5B-Instruct-GPTQ-Int4": "Qwen 2.5 0.5B I. 4-bit",
    "Qwen__Qwen2.5-1.5B-Instruct-GPTQ-Int4": "Qwen 2.5 1.5B I. 4-bit",
    "Qwen__Qwen2.5-3B-Instruct-GPTQ-Int4": "Qwen 2.5 3B I. 4-bit",
    "Qwen__Qwen2.5-7B-Instruct-GPTQ-Int4": "Qwen 2.5 7B I. 4-bit",
    "Qwen__Qwen2.5-14B-Instruct-GPTQ-Int4": "Qwen 2.5 14B I. 4-bit",
    "hugging-quants__Meta-Llama-3.1-8B-Instruct-AWQ-INT4": "Llama 3.1 8B I.",
    "mistralai__Mistral-7B-Instruct-v0.3": "Mistral 7B I.",
    "deepseek-ai__deepseek-llm-7b-chat": "DeepSeek LLM 7B Chat",
}

# List to store model results
model_results = []

for path in tqdm(paths):
    run_name = path.split("/")[-2]  # Extract the model run name from path
    model_name = run_name[len("es_input_response_data_") :]  # Extract model identifier

    # Skip models that are not in the selected list
    if model_name not in select_models:
        continue

    # Read the JSONL file into a DataFrame
    res_df = pd.read_json(path, lines=True)
    res_df["instr_len"] = res_df.follow_instruction_list.str.len()

    # Explode the instruction list into separate rows
    exploded_res_df = pd.DataFrame(
        res_df.apply(
            lambda x: [
                {"instruction_id": inst, "follow_bool": val}
                for inst, val in zip(
                    x["instruction_id_list"], x["follow_instruction_list"]
                )
            ],
            axis=1,
        ).explode()
    )

    # Extract instruction details
    exploded_res_df["instruction_id"] = exploded_res_df[0].apply(
        lambda x: x["instruction_id"]
    )
    exploded_res_df["follow_bool"] = exploded_res_df[0].apply(
        lambda x: x["follow_bool"]
    )
    exploded_res_df = exploded_res_df.drop(0, axis=1)

    # Add model name and language details
    exploded_res_df["model_name"] = select_models[model_name]
    exploded_res_df["language"] = run_name[:2]

    # Count number of instructions per index
    idx_val_counts = exploded_res_df.index.value_counts()
    exploded_res_df["num_instr"] = exploded_res_df.index.map(idx_val_counts)

    # Append processed results
    model_results.append(exploded_res_df)

full_results_df = pd.concat(model_results)
full_results_df["instruction_stem"] = (
    full_results_df.instruction_id.str.split(":").str[1:3].str.join(":")
)

/content/M-IFEval


  0%|          | 0/64 [00:00<?, ?it/s]

### Per model analysis

In [24]:
grouped_df = (
    full_results_df.groupby(["model_name", "language"])
    .follow_bool.mean()
    .reset_index(drop=False)
    .pivot(index="model_name", columns="language", values="follow_bool")
)

In [25]:
grouped_df["mean_esfrja"] = grouped_df[["es", "fr", "ja"]].mean(axis=1)

In [26]:
grouped_df.sort_values("mean_esfrja", ascending=False).round(3) * 100

language,en,es,fr,ja,mean_esfrja
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
o1,85.9,92.7,91.3,75.7,86.6
Opus,87.3,90.5,87.0,75.7,84.4
Sonnet,88.1,87.6,88.1,77.0,84.2
o1 Mini,83.9,92.0,88.4,69.5,83.3
GPT4o,88.6,89.8,87.8,70.4,82.7
GPT4o Mini,86.0,85.4,85.5,65.9,78.9
Qwen 2.5 32B I. 4-bit,86.0,82.5,81.7,65.9,76.7
Qwen 2.5 14B I. 4-bit,84.2,83.2,82.6,57.5,74.4
Haiku,77.3,78.8,78.3,61.9,73.0
Qwen 2.5 7B I. 4-bit,80.6,78.1,76.8,50.9,68.6


Convert to Markdown table for leaderboard:

In [29]:
print(
    (
        grouped_df.sort_values("mean_esfrja", ascending=False).round(3) * 100
    ).to_markdown()
)

| model_name             |   en |   es |   fr |   ja |   mean_esfrja |
|:-----------------------|-----:|-----:|-----:|-----:|--------------:|
| o1                     | 85.9 | 92.7 | 91.3 | 75.7 |          86.6 |
| Opus                   | 87.3 | 90.5 | 87   | 75.7 |          84.4 |
| Sonnet                 | 88.1 | 87.6 | 88.1 | 77   |          84.2 |
| o1 Mini                | 83.9 | 92   | 88.4 | 69.5 |          83.3 |
| GPT4o                  | 88.6 | 89.8 | 87.8 | 70.4 |          82.7 |
| GPT4o Mini             | 86   | 85.4 | 85.5 | 65.9 |          78.9 |
| Qwen 2.5 32B I. 4-bit  | 86   | 82.5 | 81.7 | 65.9 |          76.7 |
| Qwen 2.5 14B I. 4-bit  | 84.2 | 83.2 | 82.6 | 57.5 |          74.4 |
| Haiku                  | 77.3 | 78.8 | 78.3 | 61.9 |          73   |
| Qwen 2.5 7B I. 4-bit   | 80.6 | 78.1 | 76.8 | 50.9 |          68.6 |
| Llama 3.1 8B I.        | 80.1 | 80.3 | 71.3 | 39.8 |          63.8 |
| Qwen 2.5 3B I. 4-bit   | 67.9 | 68.6 | 65.8 | 45.1 |          59.8 |
| Mist

### Per model (only unique categories)

In [16]:
unique_instruction_list = [
    "detectable_content:informal_address",
    "detectable_content:no_digits",
    "detectable_format:nominal_ending",
    "detectable_format:number_numbered_lists",
    "length_constraints:number_letters",
    "letters:furigana",
    "letters:hiragana_only",
    "letters:kanji",
    "letters:kansuuji",
    "letters:katakana_only",
    "letters:no_hiragana",
    "letters:no_katakana",
    "punctuation:exclamation_marks",
    "punctuation:no_period",
    "punctuation:question_marks",
    "special_character:accents",
    "special_character:dieresis",
    "special_character:enie",
    "special_character:ethel_or_cedilla",
    "special_character:no_accents",
    "special_character:tildes",
    "startend:sentence_unified_end",
]

grouped_df = (
    full_results_df[full_results_df.instruction_stem.isin(unique_instruction_list)]
    .groupby(["model_name", "language"])
    .follow_bool.mean()
    .reset_index(drop=False)
    .pivot(index="model_name", columns="language", values="follow_bool")
)

grouped_df["mean_esfrja"] = grouped_df[["es", "fr", "ja"]].mean(axis=1)

In [17]:
grouped_df.sort_values("mean_esfrja", ascending=False).round(3) * 100

language,es,fr,ja,mean_esfrja
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
o1,75.0,96.1,61.4,77.5
Sonnet,66.7,90.2,70.5,75.8
Opus,62.5,90.2,64.8,72.5
GPT4o,58.3,80.4,55.7,64.8
o1 Mini,66.7,72.5,50.0,63.1
Qwen 2.5 32B I. 4-bit,54.2,78.4,54.5,62.4
Haiku,54.2,80.4,52.3,62.3
GPT4o Mini,58.3,68.6,47.7,58.2
Qwen 2.5 14B I. 4-bit,54.2,62.7,40.9,52.6
Qwen 2.5 7B I. 4-bit,50.0,62.7,40.9,51.2


### Per model (only non-unique)

In [18]:
grouped_df = (
    full_results_df[~full_results_df.instruction_stem.isin(unique_instruction_list)]
    .groupby(["model_name", "language"])
    .follow_bool.mean()
    .reset_index(drop=False)
    .pivot(index="model_name", columns="language", values="follow_bool")
)

grouped_df["mean_esfrja"] = grouped_df[["es", "fr", "ja"]].mean(axis=1)

In [19]:
grouped_df.sort_values("mean_esfrja", ascending=False).round(3) * 100

language,en,es,fr,ja,mean_esfrja
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
o1,85.9,96.5,90.5,84.8,90.6
o1 Mini,83.9,97.3,91.2,81.9,90.1
Opus,87.3,96.5,86.4,82.6,88.5
GPT4o,88.6,96.5,89.1,79.7,88.4
Sonnet,88.1,92.0,87.8,81.2,87.0
GPT4o Mini,86.0,91.2,88.4,77.5,85.7
Qwen 2.5 32B I. 4-bit,86.0,88.5,82.3,73.2,81.3
Qwen 2.5 14B I. 4-bit,84.2,89.4,86.1,68.1,81.2
Haiku,77.3,84.1,77.9,68.1,76.7
Qwen 2.5 7B I. 4-bit,80.6,84.1,79.3,57.2,73.5


### Per instruction results

In [20]:
grouped_df = (
    full_results_df.groupby(["model_name", "instruction_stem"])
    .follow_bool.mean()
    .reset_index(drop=False)
    .pivot(index="model_name", columns="instruction_stem", values="follow_bool")
)

In [21]:
lang_map = full_results_df.groupby(["instruction_stem"]).language.apply(set)

per_instr_scores = (
    grouped_df.T.mean(axis=1)
    .sort_index()
    .reset_index(drop=False)
    .join(lang_map.sort_index().reset_index(drop=True))
    .sort_values(0, ascending=True)
)

In [22]:
per_instr_scores["instruction_family"] = (
    per_instr_scores["instruction_stem"]
    .str.split(":")
    .str[0]
    .str.replace("_", " ")
    .str.capitalize()
)
per_instr_scores["instruction_name"] = (
    per_instr_scores["instruction_stem"]
    .str.split(":")
    .str[1]
    .str.replace("_", " ")
    .str.capitalize()
)
per_instr_scores["language"] = per_instr_scores["language"].apply(
    lambda x: ", ".join([y.upper() for y in x])
)

per_instr_scores[["instruction_family", "instruction_name", "language", 0]]

Unnamed: 0,instruction_family,instruction_name,language,0
44,Special character,Enie,ES,0.014706
31,Letters,Furigana,JA,0.083333
43,Special character,Dieresis,ES,0.132353
37,Letters,No katakana,JA,0.168067
50,Startend,Sentence unified end,JA,0.252101
32,Letters,Hiragana only,JA,0.285714
36,Letters,No hiragana,JA,0.285714
16,Detectable format,Nominal ending,JA,0.310924
35,Letters,Katakana only,JA,0.333333
46,Special character,No accents,FR,0.364706
