# Data Preprocessing

In this notebook, the data is prepared for training. It contains the following steps:

- Remove duplicate lines
- Remove personas with equal performance.

In [1]:
import pandas as pd

In [3]:
data = pd.read_parquet('data/agent_compt.parquet')

## Check inputs with 16 personas

We assume that these are always duplicates, but further analysis has shown that this is not true.

In [4]:
counts = data['input'].map(data['input'].value_counts())

In [5]:
filtered_df = data[counts == 16]

In [6]:
filtered_df

Unnamed: 0,dataset,category,input,agent,response,ground_truth,pred,score
2824,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a psychiatrist or therapist specializi...,The correct answer is **(A) Acting out**. Here...,A,A,1
2825,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a trained medical professional special...,The most likely defense mechanism the patient ...,A,A,1
2826,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a psychologist specializing in adolesc...,The most likely defense mechanism the patient ...,A,A,1
2827,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a licensed psychologist or psychiatris...,The best answer here is **(A) Acting out**. He...,A,A,1
2828,mmlu_pro,health,A 14-year-old girl is brought to the physician...,"You are a skilled mental health professional, ...",The answer is **(A) Acting out**. Here's why:\...,A,A,1
...,...,...,...,...,...,...,...,...
168787,mmlu_pro,law,Who said that “Jurisprudence is the eye of law...,"You are a legal scholar, perhaps a professor s...","The answer is **(C) Savigny**.\n\nThis maxim, ...",H,C,0
168788,mmlu_pro,law,Who said that “Jurisprudence is the eye of law...,"You are Salmond, a renowned Scottish legal sch...",The answer is **(A) Pound**. \n\n**Roscoe Pou...,H,A,0
168789,mmlu_pro,law,Who said that “Jurisprudence is the eye of law...,You are a legal scholar specializing in the hi...,The correct answer is **(G) Blackstone**. \n\n...,H,G,0
168790,mmlu_pro,law,Who said that “Jurisprudence is the eye of law...,You are a legal scholar with a deep understand...,The answer is **(A) Pound**. \n\nHere's why:\n...,H,A,0


Even though some agents seem to be duplicates, they are just very similar. Some seem to be very similar, yet not the same. So inputs with 16 agents seem to be part of the dataset.

## Remove double line breaks

Some rows contain agent strings that only differ by a double line break. We remove the double line breaks so the duplicates get identified correctly.

In [7]:
data['agent'] = data['agent'].str.replace('\\n\\n', '\\n', regex=True)

## Swap 'None' with empty string

In [8]:
data = data.replace({None: ''})

## Get rid of chat markers

In [9]:
data['input'] = data['input'].apply(lambda x: x.replace('<end_of_turn>\n<start_of_turn>model\n', '').strip())

## Remove any duplicates

In [10]:
data_unique = data.drop_duplicates(subset=['input', 'agent'], keep='first')

## Check outcome

In [11]:
data_unique['input'].value_counts().unique()

array([15, 14, 13, 12, 11,  9,  8,  7,  6,  5,  4])

In [12]:
data_unique[data_unique['input'] == "A 14-year-old girl is brought to the physician after her mother learned that she began having sexual intercourse with various partners 1 month ago. She does not use condoms or other contraception. The mother is concerned about her behavior. The patient's parents separated 3 months ago. She had been an honor student and excelled in sports and leadership positions at school before the separation. Since the separation, however, she has become sullen, defiant, and rebellious. She has begun smoking cigarettes, disobeying her curfew, and being truant from school. This patient is most likely using which of the following defense mechanisms?\n(A) Acting out\n(B) Intellectualization\n(C) Projection\n(D) Regression\n(E) Displacement\n(F) Rationalization\n(G) Denial\n(H) Repression\n(I) Sublimation\n(J) Reaction formation"]

Unnamed: 0,dataset,category,input,agent,response,ground_truth,pred,score
2824,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a psychiatrist or therapist specializi...,The correct answer is **(A) Acting out**. Here...,A,A,1
2825,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a trained medical professional special...,The most likely defense mechanism the patient ...,A,A,1
2826,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a psychologist specializing in adolesc...,The most likely defense mechanism the patient ...,A,A,1
2827,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a licensed psychologist or psychiatris...,The best answer here is **(A) Acting out**. He...,A,A,1
2828,mmlu_pro,health,A 14-year-old girl is brought to the physician...,"You are a skilled mental health professional, ...",The answer is **(A) Acting out**. Here's why:\...,A,A,1
2829,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a qualified psychiatrist or psychologi...,The most likely defense mechanism at play here...,A,A,1
2830,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a licensed therapist specializing in a...,The answer is **(A) Acting out**. Here's why:\...,A,A,1
2831,mmlu_pro,health,A 14-year-old girl is brought to the physician...,"You are a licensed medical professional, speci...",The most likely defense mechanism at play here...,A,A,1
2832,mmlu_pro,health,A 14-year-old girl is brought to the physician...,You are a medical professional specializing in...,The most likely defense mechanism at play here...,A,A,1
2833,mmlu_pro,health,A 14-year-old girl is brought to the physician...,"You are a qualified medical professional, like...",The answer is **(A) Acting out**. Here's why:\...,A,A,1


There are still inputs with more than 8 different personas.

In [13]:
print(f'Deleted rows: {len(data.index) - len(data_unique.index)}')

Deleted rows: 1921


## Check for empty lines

In [14]:
data_unique.info()

<class 'pandas.core.frame.DataFrame'>
Index: 173191 entries, 0 to 175111
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   dataset       173191 non-null  object
 1   category      173191 non-null  object
 2   input         173191 non-null  object
 3   agent         173191 non-null  object
 4   response      173191 non-null  object
 5   ground_truth  173191 non-null  object
 6   pred          173191 non-null  object
 7   score         173191 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 11.9+ MB


Empty agent column indicates no persona was used. This is not an actual empty line. These will not be deleted. All other columns contain the expected (full) amount of entries.

## Clean unusable instances

For the training to work, we need some sort of performance hierarchy between the agents. Thus, all instances will be deleted where either all or no agents are able to complete the task. We can not rank these agents since they all perform equally good or bad.

In [15]:
# Group rows by input and keep only if unique values in score is larger than 1 (as else, all would be the same)
data_filtered = data_unique[data_unique.groupby('input')['score'].transform(lambda x: x.nunique() > 1)]

## Some information about the data

In [16]:
print(f'Deleted rows: {len(data.index) - len(data_filtered.index)}')

Deleted rows: 110461


In [17]:
print(f'Remaining rows: {len(data_filtered.index)}')

Remaining rows: 64651


In [18]:
print(f"Avg. agents per prompt: {len(data_filtered) / len(data_filtered['input'].unique())}")

Avg. agents per prompt: 7.981604938271605


In [19]:
print(f"Number of winning agents: {len(data_filtered[data_filtered['score'] == 1])}")

Number of winning agents: 33130


In [20]:
print(f"Number of losing agents: {len(data_filtered[data_filtered['score'] == 0])}")

Number of losing agents: 31521


## Setup training dataset to allow DPO training

In this section, the actual dataset for training, testing and evaluating the model will be constructed. This will be done according to the documentation here: https://huggingface.co/docs/trl/main/en/dataset_formats#preference

There are two possible formats standard and conversational, as well as implicit and explicit form. Please refer to the documentation linked above to see examples.

In the following code, the dataset is built in the so-called standard format with explicit prompts (which is recommended).

Training samples are tuples of $(x, y_w, y_l)$, where $y_w$ is a winning agent prompt string, $y_l$ is a losing agent prompt string and $x$ is the model input.

In [21]:
train_data = []
count = 0
for input_val, group in data_filtered.groupby('input'):
    # Group by input and get winning and losing outputs
    winning_agents = group[group['score'] == 1]
    losing_agents = group[group['score'] == 0]

    for idx_winning, winning in winning_agents.iterrows():
        for idx_losing, losing in losing_agents.iterrows():
            # As a sanity check I want to check that the inputs are actually the same
            if not winning['input'] == losing['input']:
                raise Exception("Illegal combination of winning and losing input")

            # Task string was updated in favor of the new option
            model_task = 'Given the following task, print a suitable agent string that will answer it as good as possible: '

            train_data.append({
                'category': winning['category'],
                'orig_prompt': input_val,
                'prompt': [{'role': 'user', 'content': f'{model_task} {input_val}'}], # or losing['input'] or winning['input']
                'chosen': [{'role': 'assistant', 'content': winning['agent']}],
                'rejected': [{'role': 'assistant', 'content': losing['agent']}]
            })

In [22]:
print(f'Training (and test/eval) samples: {len(train_data)}')

Training (and test/eval) samples: 86218


In [23]:
train_df = pd.DataFrame(train_data)
train_df.to_parquet('data/agent_prompt_cleaned.parquet', index=False)

In [24]:
train_df.head()

Unnamed: 0,category,orig_prompt,prompt,chosen,rejected
0,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","[{'role': 'user', 'content': 'Given the follow...","[{'role': 'assistant', 'content': ''}]","[{'role': 'assistant', 'content': 'You are a l..."
1,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","[{'role': 'user', 'content': 'Given the follow...","[{'role': 'assistant', 'content': ''}]","[{'role': 'assistant', 'content': 'You are a s..."
2,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","[{'role': 'user', 'content': 'Given the follow...","[{'role': 'assistant', 'content': 'You are a c...","[{'role': 'assistant', 'content': 'You are a l..."
3,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","[{'role': 'user', 'content': 'Given the follow...","[{'role': 'assistant', 'content': 'You are a c...","[{'role': 'assistant', 'content': 'You are a s..."
4,philosophy,""" _Ad crumenam_ "" is a specific kind of\n(A) S...","[{'role': 'user', 'content': 'Given the follow...","[{'role': 'assistant', 'content': 'You are a c...","[{'role': 'assistant', 'content': 'You are a l..."


## Prepare test set

Testing is performed on out-of-training categories with over 1000 inputs in the test set.

In [25]:
train_df['category'].value_counts()

category
physics                      7422
chemistry                    6102
math                         6022
truthfulqa                   4481
engineering                  4248
                             ... 
swahili_english_proverbs       64
checkmate_in_one               56
anachronisms                   50
hindu_knowledge                39
medical_questions_russian      26
Name: count, Length: 72, dtype: int64

In [26]:
test_categories = ['biology', 'philosophy', 'psychology',
                   'social_iqa', 'arithmetic', 'elementary_math_qa', 'history', 'economics']
test_df = train_df[train_df['category'].isin(test_categories)]
train_df = train_df[~train_df['category'].isin(test_categories)]
len(test_df['category'])

13148

In [27]:
len(test_df.orig_prompt.unique())

1296

In [28]:
test_df = test_df.drop(columns=['category', 'orig_prompt'])
train_df = train_df.drop(columns=['category', 'orig_prompt'])

In [29]:
test_df.to_parquet('data/agent_test.parquet', index=False)
train_df.to_parquet('data/agent_train.parquet', index=False)