# MMLU and MMMLU EDA

- Author: Jay Chiehen Liao
- This notebook did these things:
    1. It downloads MMLU (English) and MMMLU - French version.
    2. It checks if datasets are normal.
    3. It selects one subcategory for each category (there are 17 categories in total and thus we eventually selects 17 subcategories,)
    4. It filters the first 100 samples for 17 subcategories and 2 languages, and collects 17 * 2 = 34 datasets into a dict.
    5. It saves the final dict into a pickle file.
- To skip steps of checking and re-collect the datasets, please run `save_datasets.py`

In [9]:
from datasets import load_dataset
import pandas as pd
import numpy as np
from categories import categories, subcategories

## Download English and French datasets

In [61]:
ds_en = load_dataset("cais/mmlu", "all", split="test")
ds_en

Dataset({
    features: ['question', 'subject', 'choices', 'answer'],
    num_rows: 14042
})

In [62]:
ds_fr = load_dataset("openai/MMMLU", "FR_FR", split="test")
ds_fr

Dataset({
    features: ['Unnamed: 0', 'Question', 'A', 'B', 'C', 'D', 'Answer', 'Subject'],
    num_rows: 14042
})

## Confirm if all data match in two languages
### Check the first 5 questions

In [70]:
for i, x in enumerate(zip(ds_en, ds_fr)):
    en, fr = x
    print(i)
    print("[EN]", en["question"])
    print("[FR]", fr["Question"])
    if i == 4:
        break

0
[EN] Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
[FR] Déterminez le degré d'extension de champ donnée Q(sqrt(2), sqrt(3), sqrt(18)) sur Q.
1
[EN] Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.
[FR] Étant donné p = (1, 2, 5, 4)(2, 3) dans S_5. Déterminez l’indice de <p> dans S_5.
2
[EN] Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5
[FR] Déterminez tous les zéros dans le champ fini indiqué du polynôme donné ayant des coefficients dans ce champ. x^5 + 3x^3 + x^2 + 2x dans Z_5
3
[EN] Statement 1 | A factor group of a non-Abelian group is non-Abelian. Statement 2 | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G.
[FR] Énoncé 1 | Un groupe de facteurs d’un groupe non-abélien est non-abélien. Énoncé 2 | Si K est un sous-groupe normal de H et H est un sous-groupe normal de G, alors K est un sous-groupe

Looks good.

### Check the subject counts

In [75]:
subcategory_counts_en = pd.Series(ds_en["subject"]).value_counts()
subcategory_counts_fr = pd.Series(ds_fr["Subject"]).value_counts()

In [77]:
sum(subcategory_counts_en == subcategory_counts_fr)

57

In [79]:
len(subcategories)

57

#### Check if any subcategory contains quesions less than 100

In [89]:
(subcategory_counts_en < 100).any()

np.False_

Looks good.

## Handle different data formats of two languages

### Check if any sample with less than or more than 4 choices

#### English


In [98]:
Pass = True
for x in ds_en["choices"]:
    if len(x) != 4:
        print(f"Question {ds_en['question'][i]} has {len(x)} choices.")
        Pass = False
        break
if Pass:
    print("All questions have 4 choices.")

All questions have 4 choices.


#### French

In [87]:
pd.isnull(ds_fr.to_pandas()[["A", "B", "C", "D"]]).apply(lambda x: x.sum() == 0)

A    True
B    True
C    True
D    True
dtype: bool

# Select the first 100 questions for 17 subcategories

In [102]:
categories.values()

dict_values([['physics', 'chemistry', 'biology', 'computer science', 'math', 'engineering'], ['history', 'philosophy', 'law'], ['politics', 'culture', 'economics', 'geography', 'psychology'], ['other', 'business', 'health']])

In [116]:
selected_categories = []
for v in categories.values():
    selected_categories += v

selected_subcategories = {}
for k, v in subcategories.items():
    category = v[0]
    if category in selected_categories:
        selected_subcategories[category] = k
        selected_categories.remove(category)
    if len(selected_categories) == 0:
        break

In [117]:
selected_subcategories

{'math': 'abstract_algebra',
 'health': 'anatomy',
 'physics': 'astronomy',
 'business': 'business_ethics',
 'biology': 'college_biology',
 'chemistry': 'college_chemistry',
 'computer science': 'college_computer_science',
 'economics': 'econometrics',
 'engineering': 'electrical_engineering',
 'philosophy': 'formal_logic',
 'other': 'global_facts',
 'history': 'high_school_european_history',
 'geography': 'high_school_geography',
 'politics': 'high_school_government_and_politics',
 'psychology': 'high_school_psychology',
 'culture': 'human_sexuality',
 'law': 'international_law'}

In [122]:
ds_selected = {}
print()
for i, subcategory in enumerate(selected_subcategories.values()):
    ds_selected[subcategory] ={}
    ds_selected[subcategory]["en"] = ds_en.filter(lambda x: x["subject"] == subcategory).select(range(100))
    ds_selected[subcategory]["fr"] = ds_fr.filter(lambda x: x["Subject"] == subcategory).select(range(100))
    print(f"Subcategory {i:2d} {subcategory:35s}", ds_selected[subcategory]["en"].shape, ds_selected[subcategory]["fr"].shape)



Subcategory  0 abstract_algebra                    (100, 4) (100, 8)
Subcategory  1 anatomy                             (100, 4) (100, 8)
Subcategory  2 astronomy                           (100, 4) (100, 8)
Subcategory  3 business_ethics                     (100, 4) (100, 8)
Subcategory  4 college_biology                     (100, 4) (100, 8)
Subcategory  5 college_chemistry                   (100, 4) (100, 8)
Subcategory  6 college_computer_science            (100, 4) (100, 8)
Subcategory  7 econometrics                        (100, 4) (100, 8)
Subcategory  8 electrical_engineering              (100, 4) (100, 8)
Subcategory  9 formal_logic                        (100, 4) (100, 8)
Subcategory 10 global_facts                        (100, 4) (100, 8)
Subcategory 11 high_school_european_history        (100, 4) (100, 8)
Subcategory 12 high_school_geography               (100, 4) (100, 8)
Subcategory 13 high_school_government_and_politics (100, 4) (100, 8)
Subcategory 14 high_school_psycho

In [124]:
import os, pickle
if not os.path.exists("data"):
    os.mkdir("data")
with open("data/ds_selected.pkl", "wb") as f:
    pickle.dump(ds_selected, f, protocol=pickle.HIGHEST_PROTOCOL)

# Check `ds_selected.pkl` after PRs #2, #3, and #4

In [1]:
import pickle as pkl

In [3]:
with open("./ds_selected.pkl", "rb") as f:
    ds_selected = pkl.load(f)

In [4]:
len(ds_selected)

17

#### Check value type of each key (subcategory)

In [14]:
for subtask, v in ds_selected.items():
    print(f"{subtask:35s}", type(v))

abstract_algebra                    <class 'dict'>
anatomy                             <class 'dict'>
astronomy                           <class 'dict'>
business_ethics                     <class 'dict'>
college_biology                     <class 'dict'>
college_chemistry                   <class 'dict'>
college_computer_science            <class 'dict'>
econometrics                        <class 'dict'>
electrical_engineering              <class 'dict'>
formal_logic                        <class 'dict'>
global_facts                        <class 'dict'>
high_school_european_history        <class 'dict'>
high_school_geography               <class 'dict'>
high_school_government_and_politics <class 'dict'>
high_school_psychology              <class 'dict'>
human_sexuality                     <class 'dict'>
international_law                   <class 'dict'>


#### Check if there are 100 questions in both languages for all subtasks

In [21]:
for subtask, v in ds_selected.items():
    print(f"{subtask:35s}", len(v["en"]), len(v["fr"]))
    assert len(v["en"]) == len(v["fr"]) == 100, f"Different number of questions in {subtask}"

abstract_algebra                    100 100
anatomy                             100 100
astronomy                           100 100
business_ethics                     100 100
college_biology                     100 100
college_chemistry                   100 100
college_computer_science            100 100
econometrics                        100 100
electrical_engineering              100 100
formal_logic                        100 100
global_facts                        100 100
high_school_european_history        100 100
high_school_geography               100 100
high_school_government_and_politics 100 100
high_school_psychology              100 100
human_sexuality                     100 100
international_law                   100 100


#### Take the first k questions to check if English versions match French ones 

In [23]:
SUBTASK = "electrical_engineering"
for i, x in enumerate(zip(ds_selected[SUBTASK]["en"], ds_selected[SUBTASK]["fr"])):
    en, fr = x
    print(i)
    print("[EN]", en["question"])
    print("[FR]", fr["question"])
    if i == 4:
        break

0
[EN] The Barkhausen criterion for an oscillator
[FR] Le critère de Barkhausen pour un oscillateur
1
[EN] Potentiometer method of DC voltage measurement is more accurate than direct measurement using a voltmeter because
[FR] La méthode du potentiomètre pour mesurer la tension continue est plus précise que la mesure directe à l'aide d'un voltmètre, parce que
2
[EN] Which of these sets of logic gates are designated as universal gates?
[FR] Lesquels de ces ensembles de portes logiques sont désignés comme portes universelles ?
3
[EN] A single phase one pulse controlled circuit has a resistance R and counter emf E load 400 sin(314 t) as the source voltage. For a load counter emf of 200 V, the range of firing angle control is
[FR] Un circuit monophasé contrôlé par une impulsion possède une résistance R et une charge contre-emf E de 400 sin (314 t) comme tension source. Pour un compteur de charge emf de 200 V, la plage de contrôle de l'angle d'allumage est
4
[EN] A box which tells the effect

Looks fine.