# 🌊🌊 **SWELLS** — *Test Set Inferencer*
[![Paper](http://img.shields.io/badge/Arxiv:2503.09454-B31B1B.svg)](https://arxiv.org/abs/2503.09454)
[![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github)](https://github.com/mmarmonier/SWELLS)

This notebook submits the SWELLS test set (in full or specific partitions) to an LLM using OpenAI’s Batch API.

🚨 **WARNING:** Running this notebook requires an OpenAI API key and costs money. Cost estimations are provided but current rates should be carefully checked: https://platform.openai.com/docs/pricing.

🚨🚨 **DOUBLE WARNING:** Download/Keep the json file called `file_id_mapping.json` generated by this notebook, it will be required to retrieve the inferences at the evaluation stage.

# Data

In [None]:
!wget https://github.com/mmarmonier/SWELLS/releases/download/v1.0/SWELLS_dataset.7z

In [None]:
!7z x /content/SWELLS_dataset.7z -p"SWELLS"


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan /content/drive/MyDrive/                                 1 file, 299744346 bytes (286 MiB)

Extracting archive: /content/drive/MyDrive/SWELLS_dataset.7z
--
Path = /content/drive/MyDrive/SWELLS_dataset.7z
Type = 7z
Physical Size = 299744346
Headers Size = 1754
Method = LZMA2:24 7zAES
Solid = +
Blocks = 12

  0%      0% 12         0% 14 - SWELLS_Dataset/Dev/dev_French.jsonl                                               0% 15 - SWELLS_Dataset/Dev/dev_Latin.jsonl                                            

# Batch Creation for All Test Set Partitions

In [None]:
# Specify model_name
model_name = "gpt-4o-mini"

In [None]:
import json
import os

def create_batch_files_from_directory(input_dir, output_dir, model=model_name, temperature=0.05):
    """
    Processes all .jsonl files in the input directory and creates batch files
    with hyperparameters reflected in the output file names.

    Parameters:
    - input_dir (str): Path to the directory containing input JSONL files.
    - output_dir (str): Path to the directory to save the output batch JSONL files.
    - model (str): Model to use for the OpenAI API.
    - temperature (float): Sampling temperature.
    """
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # List all .jsonl files in the input directory
    jsonl_files = [f for f in os.listdir(input_dir) if f.endswith('.jsonl')]

    if not jsonl_files:
        print(f"No .jsonl files found in directory: {input_dir}")
        return

    for jsonl_file in jsonl_files:
        input_file_path = os.path.join(input_dir, jsonl_file)

        # Construct a meaningful output filename
        base_name = os.path.splitext(jsonl_file)[0]
        output_file_name = f"{base_name}_model-{model}_temp-{temperature}.jsonl"
        output_file_path = os.path.join(output_dir, output_file_name)

        # Read input file and write batch JSONL file
        with open(input_file_path, 'r', encoding='utf-8') as infile, open(output_file_path, 'w', encoding='utf-8') as outfile:
            for i, line in enumerate(infile):
                # Load the existing JSONL line to extract the prompt
                instance = json.loads(line)
                prompt = instance.get("prompt")
                custom_id = instance.get("instance_id")
                instance_type = instance.get("instance_type")[0]

                if "no_CoT_" in custom_id:
                  max_tokens = 100
                elif instance_type == "1":
                  max_tokens = 400
                elif instance_type in list("23"):
                  max_tokens = 600
                elif instance_type == "4":
                  max_tokens = 700
                else:
                  max_tokens = 900


                # Prepare the batch line with the required fields
                batch_line = {
                    "custom_id": custom_id,
                    "method": "POST",
                    "url": "/v1/chat/completions",
                    "body": {
                        "model": model,
                        "messages": [
                            {"role": "system", "content": "You are an expert linguist and translator."},
                            {"role": "user", "content": prompt}
                        ],
                        "max_tokens": max_tokens,
                        "temperature": temperature
                    }
                }

                # Write the batch line to the new JSONL file
                outfile.write(json.dumps(batch_line, ensure_ascii=False) + '\n')

        print(f"Batch file created for {jsonl_file} at: {output_file_path}")



create_batch_files_from_directory(
    input_dir='/content/SWELLS_Dataset/Test/Test_CoT',
    output_dir='./API_batches',
    model= model_name,
    temperature=0.05
    )

create_batch_files_from_directory(
    input_dir='/content/SWELLS_Dataset/Test/Test_no_CoT',
    output_dir='./API_batches',
    model= model_name,
    temperature=0.05
    )


Batch file created for CoT_full_art_eng_test_dictionary_only.jsonl at: ./API_batches/CoT_full_art_eng_test_dictionary_only_model-gpt-4o-mini-2024-07-18_temp-0.05.jsonl
Batch file created for CoT_full_art_eng_test_incidental_bitexts.jsonl at: ./API_batches/CoT_full_art_eng_test_incidental_bitexts_model-gpt-4o-mini-2024-07-18_temp-0.05.jsonl
Batch file created for CoT_full_eng_art_test_grammar_excerpts.jsonl at: ./API_batches/CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini-2024-07-18_temp-0.05.jsonl
Batch file created for CoT_full_art_eng_test_grammar_excerpts.jsonl at: ./API_batches/CoT_full_art_eng_test_grammar_excerpts_model-gpt-4o-mini-2024-07-18_temp-0.05.jsonl
Batch file created for CoT_full_eng_art_test_dictionary_only.jsonl at: ./API_batches/CoT_full_eng_art_test_dictionary_only_model-gpt-4o-mini-2024-07-18_temp-0.05.jsonl
Batch file created for CoT_full_eng_art_test_incidental_bitexts.jsonl at: ./API_batches/CoT_full_eng_art_test_incidental_bitexts_model-gpt-4o-mini-202

In [None]:
!head /content/API_batches/CoT_full_art_eng_test_dictionary_only_model-gpt-4o-mini-2024-07-18_temp-0.05.jsonl

{"custom_id": "CoT___2024-12-28_09-59-48___full_art_eng_test_dictionary_only.jsonl___0", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini-2024-07-18", "messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": "user", "content": "Xeniwalé is a recently devised conlang. You are to translate the following Xeniwalé segment into English with the help of a few dictionary entries.\n\nHere is the text segment you must translate:\ndeifu qtpfuvwafu\n\nAnd here are a few dictionary entries that may be of use to you; note that each entry follows the format: lemma (grammatical gender and/or part of speech) : English equivalent.\nqtpfuvwa (fem. n.): house\n\nA reminder that the Xeniwalé sentence you must translate into English is:\ndeifu qtpfuvwafu\n\nYou may explain your chain of thoughts prior to producing the required translation. IMPORTANT: Do write your translation between tags in the following manner: <translation>your tra

# Batch creation for specific test set partitions

In [None]:
!rm -R /content/API_batches

In [None]:
# Specify model_name
model_name = "gpt-4o-mini"

In [None]:
#example for eng_art direction and grammar excerpts modality only
import json
import os

def create_batch_files_from_directory(input_dir, output_dir, model=model_name, temperature=0.05):
    """
    Processes all .jsonl files in the input directory and creates batch files
    with hyperparameters reflected in the output file names.

    Parameters:
    - input_dir (str): Path to the directory containing input JSONL files.
    - output_dir (str): Path to the directory to save the output batch JSONL files.
    - model (str): Model to use for the OpenAI API.
    - temperature (float): Sampling temperature.
    """
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # List all .jsonl files in the input directory
    jsonl_files = [f for f in os.listdir(input_dir) if f.endswith('eng_art_test_grammar_excerpts.jsonl') ]

    if not jsonl_files:
        print(f"No .jsonl files found in directory: {input_dir}")
        return

    for jsonl_file in jsonl_files:
        input_file_path = os.path.join(input_dir, jsonl_file)

        # Construct a meaningful output filename
        base_name = os.path.splitext(jsonl_file)[0]
        output_file_name = f"{base_name}_model-{model}_temp-{temperature}.jsonl"
        output_file_path = os.path.join(output_dir, output_file_name)

        # Read input file and write batch JSONL file
        with open(input_file_path, 'r', encoding='utf-8') as infile, open(output_file_path, 'w', encoding='utf-8') as outfile:
            for i, line in enumerate(infile):
                # Load the existing JSONL line to extract the prompt
                instance = json.loads(line)
                prompt = instance.get("prompt")
                custom_id = instance.get("instance_id")
                instance_type = instance.get("instance_type")[0]

                if "no_CoT_" in custom_id:
                  max_tokens = 100
                elif instance_type == "1":
                  max_tokens = 400
                elif instance_type in list("23"):
                  max_tokens = 600
                elif instance_type == "4":
                  max_tokens = 700
                else:
                  max_tokens = 900


                # Prepare the batch line with the required fields
                batch_line = {
                    "custom_id": custom_id,
                    "method": "POST",
                    "url": "/v1/chat/completions",
                    "body": {
                        "model": model,
                        "messages": [
                            {"role": "system", "content": "You are an expert linguist and translator."},
                            {"role": "user", "content": prompt}
                        ],
                        "max_tokens": max_tokens,
                        "temperature": temperature
                    }
                }

                # Write the batch line to the new JSONL file
                outfile.write(json.dumps(batch_line, ensure_ascii=False) + '\n')

        print(f"Batch file created for {jsonl_file} at: {output_file_path}")



create_batch_files_from_directory(
    input_dir='/content/SWELLS_Dataset/Test/Test_CoT',
    output_dir='./API_batches',
    model= model_name,
    temperature=0.05
    )

create_batch_files_from_directory(
    input_dir='/content/SWELLS_Dataset/Test/Test_no_CoT',
    output_dir='./API_batches',
    model= model_name,
    temperature=0.05
    )


Batch file created for CoT_full_eng_art_test_grammar_excerpts.jsonl at: ./API_batches/CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini_temp-0.05.jsonl
Batch file created for no_CoT_full_eng_art_test_grammar_excerpts.jsonl at: ./API_batches/no_CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini_temp-0.05.jsonl


# Inference Cost Estimation

In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [None]:
import os
import json
from tiktoken import encoding_for_model  # For token calculation (install via `pip install tiktoken`)

# Pricing Constants (Check current rate. Inference on finetuned checkpoints is pricier.)
INPUT_COST_PER_MILLION = 0.075  # $ per 1M input tokens
OUTPUT_COST_PER_MILLION = 0.30  # $ per 1M output tokens

def calculate_inference_cost(directory, model="gpt-4o-mini"):
    """
    Calculates the cost of batch inference based on input and output tokens.

    Parameters:
    - directory (str): Path to the directory containing batch JSONL files.
    - model (str): Model name to determine the tokenizer.

    Returns:
    - cost_summary (dict): Summary of total tokens and costs.
    """
    # Get the appropriate tokenizer
    #encoding = tiktoken.get_encoding("cl100k_base")
    encoding = encoding_for_model("gpt-4o-mini")

    total_input_tokens = 0
    total_output_tokens = 0
    file_stats = {}

    # List all JSONL files in the directory
    jsonl_files = [f for f in os.listdir(directory) if f.endswith('.jsonl')]

    for file in jsonl_files:
        file_path = os.path.join(directory, file)
        file_input_tokens = 0
        file_output_tokens = 0
        line_count = 0

        try:
            with open(file_path, 'r', encoding='utf-8') as infile:
                for line in infile:
                    line_count += 1
                    try:
                        batch_entry = json.loads(line.strip())
                        body = batch_entry.get("body", {})
                        messages = body.get("messages", [])
                        max_tokens = body.get("max_tokens", 0)

                        # Calculate input tokens (system + user messages)
                        input_content = "".join([msg.get("content", "") for msg in messages])
                        input_tokens = len(encoding.encode(input_content))
                        file_input_tokens += input_tokens

                        # Output tokens (max_tokens is the number charged by the API)
                        file_output_tokens += max_tokens

                    except (json.JSONDecodeError, KeyError) as e:
                        print(f"Skipping malformed line in {file}: {e}")

            # Add file-level stats
            file_stats[file] = {
                "input_tokens": file_input_tokens,
                "output_tokens": file_output_tokens,
                "line_count": line_count
            }

            total_input_tokens += file_input_tokens
            total_output_tokens += file_output_tokens

        except Exception as e:
            print(f"Error processing file {file}: {e}")
            file_stats[file] = {"input_tokens": None, "output_tokens": None, "line_count": None}

    # Calculate costs
    input_cost = (total_input_tokens / 1_000_000) * INPUT_COST_PER_MILLION
    output_cost = (total_output_tokens / 1_000_000) * OUTPUT_COST_PER_MILLION
    total_cost = input_cost + output_cost

    # Summary
    cost_summary = {
        "total_input_tokens": total_input_tokens,
        "total_output_tokens": total_output_tokens,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "total_cost": total_cost,
        "file_stats": file_stats
    }

    return cost_summary


# Example Usage
directory = "./API_batches"  # Path to your batch files
cost_summary = calculate_inference_cost(directory, model="gpt-4o-mini")

# Print Summary
print("\n=== Cost Projection Summary ===")
print(f"Total Input Tokens: {cost_summary['total_input_tokens']}")
print(f"Total Output Tokens: {cost_summary['total_output_tokens']}")
print(f"Input Cost: ${cost_summary['input_cost']:.2f}")
print(f"Output Cost: ${cost_summary['output_cost']:.2f}")
print(f"Total Cost: ${cost_summary['total_cost']:.2f}")

print("\n=== File-Level Details ===")
for file, stats in cost_summary["file_stats"].items():
    print(f"{file}:")
    if stats["input_tokens"] is not None:
        print(f"  Input Tokens: {stats['input_tokens']}")
        print(f"  Output Tokens: {stats['output_tokens']}")
        print(f"  Line Count: {stats['line_count']}")
    else:
        print("  Error processing file.")



=== Cost Projection Summary ===
Total Input Tokens: 22777254
Total Output Tokens: 1240000
Input Cost: $1.71
Output Cost: $0.37
Total Cost: $2.08

=== File-Level Details ===
no_CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini_temp-0.05.jsonl:
  Input Tokens: 11392127
  Output Tokens: 140000
  Line Count: 1400
CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini_temp-0.05.jsonl:
  Input Tokens: 11385127
  Output Tokens: 1100000
  Line Count: 1400


# Batch API Call

In [None]:
import os
import json
import datetime
import openai  # Ensure the OpenAI library is installed: pip install openai

def process_and_send_batches(client, output_dir):
    """
    Processes and submits batches to the OpenAI API.

    Parameters:
    - client: The OpenAI API client instance.
    - output_dir (str): Path to the directory containing batch files to process.

    Returns:
    - mapping (dict): A dictionary mapping batch files to their file IDs and metadata.
    """
    # Ensure the output directory exists
    if not os.path.exists(output_dir):
        print(f"Output directory does not exist: {output_dir}")
        return

    # List all JSONL batch files in the output directory
    batch_files = [f for f in os.listdir(output_dir) if f.endswith('.jsonl')]

    if not batch_files:
        print(f"No batch files found in directory: {output_dir}")
        return

    # Create a mapping to store file IDs, metadata, and their corresponding batch files
    mapping = {}

    for batch_file in batch_files:
        batch_file_path = os.path.join(output_dir, batch_file)

        try:
            # Upload the batch file
            with open(batch_file_path, "rb") as file:
                batch_input_file = client.files.create(
                    file=file,
                    purpose="batch"
                )
            batch_input_file_id = batch_input_file.id

            # Create the batch job
            client.batches.create(
                input_file_id=batch_input_file_id,
                endpoint="/v1/chat/completions",
                completion_window="24h",
                metadata={
                    "description": f"Batch created from {batch_file}",
                    "timestamp": datetime.datetime.now().isoformat(),
                    "original_file_name": batch_file
                }
            )

            print(f"Submitted {batch_file} with ID: {batch_input_file_id}")

            # Map the batch to its metadata
            mapping[batch_file] = {
                "file_id": batch_input_file_id,
                "batch_file": batch_file_path,
                "timestamp": datetime.datetime.now().isoformat()
            }

        except openai.error.AuthenticationError as e:
            print(f"Authentication error for {batch_file}: {e}")
        except openai.error.RateLimitError as e:
            print(f"Rate limit error for {batch_file}: {e}")
        except Exception as e:
            print(f"Error submitting batch {batch_file}: {e}")

    # Save the mapping to a JSON file
    mapping_file_path = "./file_id_mapping.json"
    with open(mapping_file_path, "w", encoding="utf-8") as map_file:
        json.dump(mapping, map_file, ensure_ascii=False, indent=4)

    print(f"Mapping saved to: {mapping_file_path}")

    return mapping


In [None]:
from openai import OpenAI

# Initialize OpenAI client
# Replace the XXXX string with an OpenAI API key.
client = OpenAI(api_key='XXXX')

# Run the batch processing
mapping = process_and_send_batches(
    client=client,
    output_dir='./API_batches'
)

Submitted no_CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini_temp-0.05.jsonl with ID: file-GRgLFyC9iFQSRrtt9N1ggu
Submitted CoT_full_eng_art_test_grammar_excerpts_model-gpt-4o-mini_temp-0.05.jsonl with ID: file-DJUt777smAdgB1fJqfNdCw
Mapping saved to: ./file_id_mapping.json


#**🚨🚨🚨 Important: DOWNLOAD/KEEP the generated `file_id_mapping.json` !**