# MMLU-ProX-Lite-Open Dataset Upload to Elluminate

This notebook loads the MMLU-ProX-Lite-open dataset and uploads it to Elluminate, creating separate collections for each language.

## Setup and Imports

In [7]:
import os
from dotenv import load_dotenv
from elluminate import Client
from datasets import load_dataset
import pandas as pd
from tqdm import tqdm
import nest_asyncio

nest_asyncio.apply()

# Load environment variables
load_dotenv()

print("Environment setup complete")

Environment setup complete


## Initialize Elluminate Client

In [2]:
# Initialize Elluminate client
# Make sure ELLUMINATE_API_KEY and ELLUMINATE_BASE_URL are set in your .env file
client = Client()

print("Elluminate client initialized successfully")

Elluminate client initialized successfully


## Load Dataset

In [3]:
# Load the MMLU-ProX-Lite-open dataset
dataset = load_dataset("jphme/MMLU-ProX-Lite-open")

print(f"Dataset loaded successfully")
print(f"Available splits: {list(dataset.keys())}")
print(f"Total splits: {len(dataset)}")

Dataset loaded successfully
Available splits: ['af', 'ar', 'bn', 'cs', 'de', 'en', 'es', 'fr', 'hi', 'hu', 'id', 'it', 'ja', 'ko', 'mr', 'ne', 'pt', 'ru', 'sr', 'sw', 'te', 'th', 'uk', 'ur', 'vi', 'wo', 'yo', 'zh', 'zu']
Total splits: 29


## Inspect Dataset Structure

In [4]:
# Show sample data from English split
en_split = dataset["en"]
print(f"Columns: {en_split.column_names}")
print(f"Number of questions in English split: {len(en_split)}")
print("\nSample data:")
print(en_split[0])

Columns: ['question_id', 'question', 'answer', 'cot_content', 'category', 'src']
Number of questions in English split: 470

Sample data:
{'question_id': 72, 'question': 'Determine the number of men needed to build a boat in 77 days if it takes 36 men 132 days to build one.', 'answer': '62 men', 'cot_content': '', 'category': 'business', 'src': 'stemez-Business'}


## Data Preprocessing Function

In [19]:
def preprocess_split_data(split_data):
    """
    Preprocess dataset split by removing empty cot_content column
    and converting to list of dictionaries for Elluminate upload
    """
    # Convert to pandas DataFrame for easier manipulation
    df = split_data.to_pandas()

    # Remove the empty cot_content column
    if "cot_content" in df.columns:
        df = df.drop("cot_content", axis=1)
        print(f"Removed empty cot_content column")

    # Convert to list of dictionaries
    records = df.astype(str).to_dict("records")

    print(f"Preprocessed {len(records)} records")
    print(f"Remaining columns: {list(df.columns)}")

    return records


# Test preprocessing with English split
en_records = preprocess_split_data(dataset["en"])
print(f"\nSample preprocessed record:")
print(en_records[0])

Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']

Sample preprocessed record:
{'question_id': '72', 'question': 'Determine the number of men needed to build a boat in 77 days if it takes 36 men 132 days to build one.', 'answer': '62 men', 'category': 'business', 'src': 'stemez-Business'}


## Create Collections for Each Language

In [20]:
# Language names mapping for better collection descriptions
language_names = {
    "af": "Afrikaans",
    "ar": "Arabic",
    "bn": "Bengali",
    "cs": "Czech",
    "de": "German",
    "en": "English",
    "es": "Spanish",
    "fr": "French",
    "hi": "Hindi",
    "hu": "Hungarian",
    "id": "Indonesian",
    "it": "Italian",
    "ja": "Japanese",
    "ko": "Korean",
    "mr": "Marathi",
    "ne": "Nepali",
    "pt": "Portuguese",
    "ru": "Russian",
    "sr": "Serbian",
    "sw": "Swahili",
    "te": "Telugu",
    "th": "Thai",
    "uk": "Ukrainian",
    "ur": "Urdu",
    "vi": "Vietnamese",
    "wo": "Wolof",
    "yo": "Yoruba",
    "zh": "Chinese",
    "zu": "Zulu",
}

# Dictionary to store created collections
collections = {}

print("Creating collections for each language...")
for lang_code in tqdm(dataset.keys(), desc="Creating collections"):
    language_name = language_names.get(lang_code, lang_code.upper())
    collection_name = f"MMLU-ProX-Lite-Open-{lang_code.upper()}"
    description = f"MMLU-ProX-Lite open-ended questions in {language_name} ({lang_code}). Contains {len(dataset[lang_code])} questions across multiple academic categories."

    try:
        try:
            collection, created = client.collections.get_or_create(name=collection_name)
            client.collections.delete(collection)
            print(f"Deleted existing collection {collection_name}")
        except Exception as e:
            pass

        collection, created = client.collections.get_or_create(
            name=collection_name,
            description=description,
            variables=preprocess_split_data(dataset[lang_code]),
        )

        collections[lang_code] = collection
        status = "created" if created else "found existing"
        print(
            f"✓ {language_name} ({lang_code}): {status} collection '{collection_name}'"
        )

    except Exception as e:
        print(
            f"✗ Error creating collection for {language_name} ({lang_code}): {str(e)}"
        )

print(f"\nSuccessfully created/found {len(collections)} collections")

Creating collections for each language...


Creating collections:   0%|          | 0/29 [00:00<?, ?it/s]

Deleted existing collection MMLU-ProX-Lite-Open-AF
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:   3%|▎         | 1/29 [00:01<00:35,  1.28s/it]

✓ Afrikaans (af): created collection 'MMLU-ProX-Lite-Open-AF'
Deleted existing collection MMLU-ProX-Lite-Open-AR
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:   7%|▋         | 2/29 [00:02<00:28,  1.04s/it]

✓ Arabic (ar): created collection 'MMLU-ProX-Lite-Open-AR'
Deleted existing collection MMLU-ProX-Lite-Open-BN
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  10%|█         | 3/29 [00:03<00:24,  1.04it/s]

✓ Bengali (bn): created collection 'MMLU-ProX-Lite-Open-BN'
Deleted existing collection MMLU-ProX-Lite-Open-CS
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  14%|█▍        | 4/29 [00:03<00:22,  1.10it/s]

✓ Czech (cs): created collection 'MMLU-ProX-Lite-Open-CS'
Deleted existing collection MMLU-ProX-Lite-Open-DE
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  17%|█▋        | 5/29 [00:04<00:20,  1.18it/s]

✓ German (de): created collection 'MMLU-ProX-Lite-Open-DE'
Deleted existing collection MMLU-ProX-Lite-Open-EN
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  21%|██        | 6/29 [00:05<00:19,  1.19it/s]

✓ English (en): created collection 'MMLU-ProX-Lite-Open-EN'
Deleted existing collection MMLU-ProX-Lite-Open-ES
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  24%|██▍       | 7/29 [00:06<00:19,  1.13it/s]

✓ Spanish (es): created collection 'MMLU-ProX-Lite-Open-ES'
Deleted existing collection MMLU-ProX-Lite-Open-FR
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  28%|██▊       | 8/29 [00:07<00:21,  1.01s/it]

✓ French (fr): created collection 'MMLU-ProX-Lite-Open-FR'
Deleted existing collection MMLU-ProX-Lite-Open-HI
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  31%|███       | 9/29 [00:09<00:22,  1.12s/it]

✓ Hindi (hi): created collection 'MMLU-ProX-Lite-Open-HI'
Deleted existing collection MMLU-ProX-Lite-Open-HU
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  34%|███▍      | 10/29 [00:09<00:19,  1.04s/it]

✓ Hungarian (hu): created collection 'MMLU-ProX-Lite-Open-HU'
Deleted existing collection MMLU-ProX-Lite-Open-ID
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  38%|███▊      | 11/29 [00:10<00:17,  1.03it/s]

✓ Indonesian (id): created collection 'MMLU-ProX-Lite-Open-ID'
Deleted existing collection MMLU-ProX-Lite-Open-IT
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  41%|████▏     | 12/29 [00:11<00:16,  1.04it/s]

✓ Italian (it): created collection 'MMLU-ProX-Lite-Open-IT'
Deleted existing collection MMLU-ProX-Lite-Open-JA
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  45%|████▍     | 13/29 [00:12<00:14,  1.08it/s]

✓ Japanese (ja): created collection 'MMLU-ProX-Lite-Open-JA'
Deleted existing collection MMLU-ProX-Lite-Open-KO
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  48%|████▊     | 14/29 [00:13<00:13,  1.11it/s]

✓ Korean (ko): created collection 'MMLU-ProX-Lite-Open-KO'
Deleted existing collection MMLU-ProX-Lite-Open-MR
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  52%|█████▏    | 15/29 [00:14<00:14,  1.05s/it]

✓ Marathi (mr): created collection 'MMLU-ProX-Lite-Open-MR'
Deleted existing collection MMLU-ProX-Lite-Open-NE
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  55%|█████▌    | 16/29 [00:15<00:13,  1.04s/it]

✓ Nepali (ne): created collection 'MMLU-ProX-Lite-Open-NE'
Deleted existing collection MMLU-ProX-Lite-Open-PT
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  59%|█████▊    | 17/29 [00:16<00:12,  1.05s/it]

✓ Portuguese (pt): created collection 'MMLU-ProX-Lite-Open-PT'
Deleted existing collection MMLU-ProX-Lite-Open-RU
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  62%|██████▏   | 18/29 [00:17<00:11,  1.05s/it]

✓ Russian (ru): created collection 'MMLU-ProX-Lite-Open-RU'
Deleted existing collection MMLU-ProX-Lite-Open-SR
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  66%|██████▌   | 19/29 [00:18<00:10,  1.00s/it]

✓ Serbian (sr): created collection 'MMLU-ProX-Lite-Open-SR'
Deleted existing collection MMLU-ProX-Lite-Open-SW
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  69%|██████▉   | 20/29 [00:19<00:09,  1.07s/it]

✓ Swahili (sw): created collection 'MMLU-ProX-Lite-Open-SW'
Deleted existing collection MMLU-ProX-Lite-Open-TE
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  72%|███████▏  | 21/29 [00:21<00:10,  1.28s/it]

✓ Telugu (te): created collection 'MMLU-ProX-Lite-Open-TE'
Deleted existing collection MMLU-ProX-Lite-Open-TH
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  76%|███████▌  | 22/29 [00:23<00:08,  1.28s/it]

✓ Thai (th): created collection 'MMLU-ProX-Lite-Open-TH'
Deleted existing collection MMLU-ProX-Lite-Open-UK
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  79%|███████▉  | 23/29 [00:23<00:07,  1.17s/it]

✓ Ukrainian (uk): created collection 'MMLU-ProX-Lite-Open-UK'
Deleted existing collection MMLU-ProX-Lite-Open-UR
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  83%|████████▎ | 24/29 [00:24<00:05,  1.09s/it]

✓ Urdu (ur): created collection 'MMLU-ProX-Lite-Open-UR'
Deleted existing collection MMLU-ProX-Lite-Open-VI
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  86%|████████▌ | 25/29 [00:25<00:03,  1.01it/s]

✓ Vietnamese (vi): created collection 'MMLU-ProX-Lite-Open-VI'
Deleted existing collection MMLU-ProX-Lite-Open-WO
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  90%|████████▉ | 26/29 [00:26<00:02,  1.02it/s]

✓ Wolof (wo): created collection 'MMLU-ProX-Lite-Open-WO'
Deleted existing collection MMLU-ProX-Lite-Open-YO
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  93%|█████████▎| 27/29 [00:27<00:02,  1.03s/it]

✓ Yoruba (yo): created collection 'MMLU-ProX-Lite-Open-YO'
Deleted existing collection MMLU-ProX-Lite-Open-ZH
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections:  97%|█████████▋| 28/29 [00:28<00:01,  1.00s/it]

✓ Chinese (zh): created collection 'MMLU-ProX-Lite-Open-ZH'
Deleted existing collection MMLU-ProX-Lite-Open-ZU
Removed empty cot_content column
Preprocessed 470 records
Remaining columns: ['question_id', 'question', 'answer', 'category', 'src']


Creating collections: 100%|██████████| 29/29 [00:29<00:00,  1.01s/it]

✓ Zulu (zu): created collection 'MMLU-ProX-Lite-Open-ZU'

Successfully created/found 29 collections





## Collection Summary

In [9]:
# Display summary of created collections
print("Collection Summary:")
print("=" * 50)

for lang_code, collection in collections.items():
    language_name = language_names.get(lang_code, lang_code.upper())
    num_questions = len(dataset[lang_code])
    print(
        f"{language_name:12} ({lang_code}): {num_questions:3} questions - Collection ID: {collection.id}"
    )

total_questions = sum(len(dataset[lang_code]) for lang_code in collections.keys())
print("=" * 50)
print(f"Total: {len(collections)} collections, {total_questions:,} questions")

Collection Summary:
Afrikaans    (af): 470 questions - Collection ID: 463
Arabic       (ar): 470 questions - Collection ID: 464
Bengali      (bn): 470 questions - Collection ID: 465
Czech        (cs): 470 questions - Collection ID: 466
German       (de): 470 questions - Collection ID: 467
English      (en): 470 questions - Collection ID: 468
Spanish      (es): 470 questions - Collection ID: 469
French       (fr): 470 questions - Collection ID: 470
Hindi        (hi): 470 questions - Collection ID: 471
Hungarian    (hu): 470 questions - Collection ID: 472
Indonesian   (id): 470 questions - Collection ID: 473
Italian      (it): 470 questions - Collection ID: 474
Japanese     (ja): 470 questions - Collection ID: 475
Korean       (ko): 470 questions - Collection ID: 476
Marathi      (mr): 470 questions - Collection ID: 477
Nepali       (ne): 470 questions - Collection ID: 478
Portuguese   (pt): 470 questions - Collection ID: 479
Russian      (ru): 470 questions - Collection ID: 480
Serbian 