# Dataset Overview

In [1]:
import pandas as pd

In [4]:
data = pd.read_parquet('data/agent_compt.parquet')

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175112 entries, 0 to 175111
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   dataset       175112 non-null  object
 1   category      175112 non-null  object
 2   input         175112 non-null  object
 3   agent         153220 non-null  object
 4   response      175090 non-null  object
 5   ground_truth  175112 non-null  object
 6   pred          175112 non-null  object
 7   score         175112 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 10.7+ MB


In [6]:
data.head()

Unnamed: 0,dataset,category,input,agent,response,ground_truth,pred,score
0,mmlu_pro,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","You are a logician, proficient in identifying ...",The correct answer is **(I) Red Herring**. Her...,C,I,0
1,mmlu_pro,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...",,The correct answer is **(C) False sign**.\n\nH...,C,C,1
2,mmlu_pro,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","You are a classicist, perhaps with a focus on ...","The Latin phrase ""_Ad crumenam_"" translates to...",C,C,1
3,mmlu_pro,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","You are a classical scholar, specializing in a...",The correct answer is **(C) False sign**. Here...,C,C,1
4,mmlu_pro,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","You are a Latin expert, familiar with the nuan...",The correct answer is **(C) False sign**. Here...,C,C,1


The dataset contains tuples of prompts, agent descriptions and responses, as well as a score indicating whether the answer is correct or not.

## Datasets

In [8]:
data['dataset'].unique()

array(['mmlu_pro', 'bigbench', 'truthfulqa'], dtype=object)

- Measuring Massive Multitask Language Understanding Pro (MMLU-Pro)
    - benchmark for evaluating LLMs
    - obv. for multitask learning and language understanding
    - a more challenging benchmark than MMLU with reasoning-focused questions and more answers choices per question (10 instead of 4)
    - 12 000 questions from academic exams and textbooks spanning domains including Natural Sciences, Business, CS, Engineering, Econ, Law, Math...
- Beyond Imitation Game Benchmark (BIG-bench)
    - More than 200 different LLM tasks in one benchmark
        - Traditional NLP tasks: word disambiguation, contextual qa, reading comprehension
        - Logic, math and code: algorithms, logical reasoning, algebra
        - Understanding the world: causal reasoning, consistent identity
        - And much, much more (https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/keywords_to_tasks.md#summary-table)
    - Multilanguage
    - Maintained by Google
- TruthfulQA
    - "Assess the accuracy of LLMs in answering questions truthfully"
    - 817 questions from 38 topics (health, law, finance, politics...)
    - Questions target common misconceptions, which humans would likely answer false
    - Tasks:
        - Generation task: Given a question, answer in 1-2 sentences
        - MC: Either with single choice or true multiple choice

## Categories

Appears to be the category of the task the model should perform.

In [11]:
len(data['category'].unique())

72

In [10]:
sorted(data['category'].unique())

['abstract_narrative_understanding',
 'anachronisms',
 'analogical_similarity',
 'arithmetic',
 'biology',
 'business',
 'causal_judgment',
 'cause_and_effect',
 'checkmate_in_one',
 'chemistry',
 'cifar10_classification',
 'color',
 'computer science',
 'contextual_parametric_knowledge_conflicts',
 'cs_algorithms',
 'discourse_marker_prediction',
 'economics',
 'elementary_math_qa',
 'engineering',
 'epistemic_reasoning',
 'fact_checker',
 'fantasy_reasoning',
 'formal_fallacies_syllogisms_negation',
 'goal_step_wikihow',
 'health',
 'hhh_alignment',
 'hindu_knowledge',
 'hinglish_toxicity',
 'history',
 'implicatures',
 'indic_cause_and_effect',
 'intent_recognition',
 'kanji_ascii',
 'kannada',
 'language_identification',
 'law',
 'logic_grid_puzzle',
 'logical_deduction',
 'logical_fallacy_detection',
 'math',
 'medical_questions_russian',
 'metaphor_boolean',
 'metaphor_understanding',
 'misconceptions',
 'mnist_ascii',
 'moral_permissibility',
 'movie_dialog_same_or_different',
 

## Input

The input passed to the model. For the model, this includes all information, including the task it should solve, answer possibilities for MC tasks and so on.

In [24]:
data['input'][0]

'" _Ad crumenam_ " is a specific kind of\n(A) Slippery Slope\n(B) Hasty Conclusion\n(C) False sign\n(D) False analogy\n(E) Bandwagon Fallacy\n(F) Circular Reasoning\n(G) Hasty Generalization\n(H) Straw Man Fallacy\n(I) Red Herring<end_of_turn>\n<start_of_turn>model\n'

In [27]:
data['input'].value_counts().unique()

array([16,  8])

Either 8 or 16 answers per prompt.

In [23]:
f"On average, there are {len(data) / len(data['input'].unique())} answers per prompt"

'On average, there are 8.058536585365854 answers per prompt'

## Agent (also called persona)

Each answer to a given prompt is answered by making the model act as a certain persona.

In [35]:
data[data['input'] == data['input'][0]]['agent']

0    You are a logician, proficient in identifying ...
1                                                 None
2    You are a classicist, perhaps with a focus on ...
3    You are a classical scholar, specializing in a...
4    You are a Latin expert, familiar with the nuan...
5    You are a student of logic and rhetoric, well-...
6    You are a klassically trained historian specia...
7    You are a scholar specializing in ancient Roma...
Name: agent, dtype: object

The dataset includes no specific information about whether these personas are advantageous, disadvantageous or neutral. However, one might assume some are better suited for tasks than others. In the above example, all personas seem to be advantageous.

## Response, Ground Truth, Pred and Score

The response is the output of the LLM in prose:

In [40]:
data['response'][0]

'The correct answer is **(I) Red Herring**. Here\'s why:\n\n* **_Ad crumenam_** is a Latin term that translates to "to the purse".  It\'s a fallacy where someone tries to deflect from the real issue by introducing an irrelevant appeal to wealth or financial gain.\n\nLet\'s break down why the other options are incorrect:\n\n* **(A) Slippery Slope:**  This fallacy assumes a chain reaction of negative events will occur when something initially happens, without sufficient evidence.\n* **(B) Hasty Conclusion:** Drawing a conclusion based on insufficient or biased evidence.\n* **(C) False Sign:** Mistaking a correlation for causation.\n* **(D) False Analogy:**  Comparing two things that are not truly alike to make a flawed argument.\n* **(E) Bandwagon Fallacy:** Arguing that something must be true because many people believe it.\n* **(F) Circular Reasoning:**  Restating the conclusion as part of the premise.\n* **(G) Hasty Generalization:** Making a broad generalization based on a small or u

Whereas ground truth and pred only contain the answer character:

In [41]:
data['ground_truth'].unique()

array(['C', 'A', 'D', 'B', 'F', 'G', 'I', 'J', 'E', 'H', 'K', 'P', 'W',
       'Y', 'N', 'U', 'O', 'L', 'M', 'Q', 'S', 'T'], dtype=object)

In [42]:
data['pred'].unique()

array(['I', 'C', 'A', 'D', 'J', 'B', '-1', 'F', 'H', 'E', 'G', 'K', 'P',
       'W', 'Z', 'Y', 'N', 'Q', 'L', 'U', 'O', 'M', 'T', 'S'],
      dtype=object)

Score is a boolean indicating whether the answer is correct or not.

In [36]:
data['score'].unique()

array([0, 1])

In [43]:
filtered_data = data[data.groupby('input')['score'].transform(lambda x: (x == 1).sum()) == 1]

In [49]:
len(filtered_data)

13984

In [50]:
filtered_data['input'][224]

'"The hubZOffset can be used to``sink""the hub into the wheel, and it defaults to half the wheel thickness."\n  choice: Regardless, when the hubHeight is 10 and the hubZOffset is -2, then the hub will protrude 8mm from the wheel, but the shaft hole will be 10mm deep.\n  choice: For example when the hubHeight is 10 and the hubZOffset is -2, then the hub will protrude 8mm from the wheel, but the shaft hole will be 10mm deep.\n  choice: Instead, when the hubHeight is 10 and the hubZOffset is -2, then the hub will protrude 8mm from the wheel, but the shaft hole will be 10mm deep.\n  choice: Conversely when the hubHeight is 10 and the hubZOffset is -2, then the hub will protrude 8mm from the wheel, but the shaft hole will be 10mm deep.\n  choice: Meantime, when the hubHeight is 10 and the hubZOffset is -2, then the hub will protrude 8mm from the wheel, but the shaft hole will be 10mm deep.\n  choice: However when the hubHeight is 10 and the hubZOffset is -2, then the hub will protrude 8mm f

In [48]:
filtered_data[filtered_data['input'] == filtered_data['input'][224]]

Unnamed: 0,dataset,category,input,agent,response,ground_truth,pred,score
224,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",,The best choice to complete the sentence is **...,I,E,0
225,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",You are a technical expert familiar with mecha...,The best option here is **(I) For example when...,I,I,1
226,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",You are an expert in 3D modeling and understan...,"The best choice here is either **(E) Instead, ...",I,E,0
227,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",You are a person with a deep understanding of ...,"The best choice here is **(E) Instead, when th...",I,E,0
228,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",You are a person knowledgeable about 3D modeli...,"The correct answer is **(E), **Instead, when t...",I,E,0
229,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",You are an engineer who understands 3D modelin...,"The answer is **(E) Instead, when the hubHeigh...",I,E,0
230,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...","You are a 3D modeling expert, familiar with t...","The best answer here is **(E) Instead, when th...",I,E,0
231,bigbench,discourse_marker_prediction,"""The hubZOffset can be used to``sink""""the hub ...",You are a technical writer specializing in 3D ...,The best choice from the options provided is *...,I,E,0
