# Data Preprocessing

In this notebook, the data is prepared for training. It contains the following steps:

- Remove duplicate lines
- Remove personas with equal performance.

In [None]:
import pandas as pd
import re

In [None]:
data = pd.read_parquet('data/agent_compt.parquet')

## Check inputs with 16 personas

We assume that these are always duplicates, but further analysis has shown that this is not true.

In [None]:
counts = data['input'].map(data['input'].value_counts())
filtered_df = data[counts == 16]
filtered_df

Even though some agents seem to be duplicates, they are just very similar. Some seem to be very similar, yet not the same. So inputs with 16 agents seem to be part of the dataset.

## Remove trailing line breaks

Some rows contain agent strings that only differ by a double line break. We remove the double line breaks so the duplicates get identified correctly.

In [None]:
data['agent'] = data['agent'].str.replace('\\n\\n', '\\n', regex=True)
data['agent'] = data['agent'].str.rstrip()

## Swap 'None' with empty string

In [None]:
data = data.replace({None: ''})

## Get rid of chat markers

In [None]:
data['input'] = data['input'].apply(lambda x: x.replace('<end_of_turn>\n<start_of_turn>model\n', '').strip())

## Remove any duplicates

In [None]:
data['agent'] = data['agent'].str.split('\n\n').str[0]
def clean_agent(text):
    # Truncate after double new line
    text = text.split('\n\n')[0]

    # Remove Markdown text styles
    text = re.sub(r'(\*\*|\*|__|_)(.*?)\1', r'\2', text)

    # Remove multiple whitespaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Apply the function to the 'agent' column
data['agent'] = data['agent'].apply(clean_agent)
data_unique = data.drop_duplicates(subset=['input', 'agent'], keep='first')

## Check outcome

In [None]:
data_unique['input'].value_counts().unique()

In [None]:
data_unique[data_unique['input'] == "A 14-year-old girl is brought to the physician after her mother learned that she began having sexual intercourse with various partners 1 month ago. She does not use condoms or other contraception. The mother is concerned about her behavior. The patient's parents separated 3 months ago. She had been an honor student and excelled in sports and leadership positions at school before the separation. Since the separation, however, she has become sullen, defiant, and rebellious. She has begun smoking cigarettes, disobeying her curfew, and being truant from school. This patient is most likely using which of the following defense mechanisms?\n(A) Acting out\n(B) Intellectualization\n(C) Projection\n(D) Regression\n(E) Displacement\n(F) Rationalization\n(G) Denial\n(H) Repression\n(I) Sublimation\n(J) Reaction formation"]

There are still inputs with more than 8 different personas.

In [None]:
print(f'Deleted rows: {len(data.index) - len(data_unique.index)}')

## Check for empty lines

In [None]:
data_unique.info()

Empty agent column indicates no persona was used. This is not an actual empty line. These will not be deleted. All other columns contain the expected (full) amount of entries.

## Clean unusable instances

For the training to work, we need some sort of performance hierarchy between the agents. Thus, all instances will be deleted where either all or no agents are able to complete the task. We can not rank these agents since they all perform equally good or bad.

In [None]:
# Group rows by input and keep only if unique values in score is larger than 1 (as else, all would be the same)
data_filtered = data_unique[data_unique.groupby('input')['score'].transform(lambda x: x.nunique() > 1)]

## Some information about the data

In [None]:
print(f'Deleted rows: {len(data.index) - len(data_filtered.index)}')

In [None]:
print(f'Remaining rows: {len(data_filtered.index)}')

In [None]:
print(f"Avg. agents per prompt: {len(data_filtered) / len(data_filtered['input'].unique())}")

In [None]:
print(f"Number of winning agents: {len(data_filtered[data_filtered['score'] == 1])}")

In [None]:
print(f"Number of losing agents: {len(data_filtered[data_filtered['score'] == 0])}")

## Setup data for fine-tuning

In [None]:
fine_tune_data = data_filtered[data_filtered['score'] == 1]
fine_tune_data = fine_tune_data[fine_tune_data['agent'] != '']

In [None]:
finetune_task_string = 'Given the following task, print a suitable agent string that will answer it as good as possible: '
fine_tune_data['prompt'] = finetune_task_string + fine_tune_data['input'].str.strip() + "\n"
fine_tune_data['completion'] = " " + fine_tune_data['agent'].str.strip()

In [None]:
fine_tune_data = fine_tune_data[['prompt', 'completion']]
fine_tune_data

In [None]:
fine_tune_data.to_parquet('data/agent_finetune.parquet', index=False)

## Setup training dataset to allow DPO training

In this section, the actual dataset for training, testing and evaluating the model will be constructed. This will be done according to the documentation here: https://huggingface.co/docs/trl/main/en/dataset_formats#preference

There are two possible formats standard and conversational, as well as implicit and explicit form. Please refer to the documentation linked above to see examples.

In the following code, the dataset is built in the so-called standard format with explicit prompts (which is recommended).

Training samples are tuples of $(x, y_w, y_l)$, where $y_w$ is a winning agent prompt string, $y_l$ is a losing agent prompt string and $x$ is the model input.

In [None]:
train_data = []
analysis_data = []
count = 0
for input_val, group in data_filtered.groupby('input'):
    # Group by input and get winning and losing outputs
    winning_agents = group[group['score'] == 1]
    losing_agents = group[group['score'] == 0]

    for idx_winning, winning in winning_agents.iterrows():
        for idx_losing, losing in losing_agents.iterrows():
            # As a sanity check I want to check that the inputs are actually the same
            if not winning['input'] == losing['input']:
                raise Exception("Illegal combination of winning and losing input")

            # Task string was updated in favor of the new option
            model_task = 'Given the following task, print a suitable agent string that will answer it as good as possible: '

            # Actual dataset for training the model using conversational, explicit format
            train_data.append({
                'category': winning['category'],
                'orig_prompt': input_val,
                'prompt': [{'role': 'user', 'content': f'{model_task} {input_val}'}],
                'chosen': [{'role': 'assistant', 'content': winning['agent']}],
                'rejected': [{'role': 'assistant', 'content': losing['agent']}]
            })

            # The list entries of columns prompt and chosen are difficult to analyze
            analysis_data.append({
                'category': winning['category'],
                'orig_prompt': input_val,
                'prompt': f'{model_task} {input_val}', # or losing['input'] or winning['input']
                'chosen': winning['agent'],
                'rejected': losing['agent']
            })

In [None]:
print(f'Training (and test/eval) samples: {len(train_data)}')

In [None]:
train_df = pd.DataFrame(train_data)
train_df.to_parquet('data/agent_prompt_cleaned.parquet', index=False)

analysis_df = pd.DataFrame(analysis_data)
analysis_df.to_parquet('data_analysis/agent_prompt_cleaned.parquet', index=False)

In [None]:
train_df.head()

## Prepare test set

Testing is performed on out-of-training categories with over 1000 inputs in the test set.

In [None]:
train_df['category'].value_counts()

In [None]:
test_categories = ['biology', 'philosophy', 'psychology',
                   'social_iqa', 'arithmetic', 'elementary_math_qa', 'history', 'economics']

test_df = train_df[train_df['category'].isin(test_categories)]
train_df = train_df[~train_df['category'].isin(test_categories)]

test_analysis_df = analysis_df[analysis_df['category'].isin(test_categories)]
train_analysis_df = analysis_df[~analysis_df['category'].isin(test_categories)]
len(test_df['category'])

In [None]:
len(test_df.orig_prompt.unique())

In [None]:
test_df = test_df.drop(columns=['orig_prompt'])
train_df = train_df.drop(columns=['orig_prompt'])

In [None]:
test_df.to_parquet('data/agent_test.parquet', index=False)
train_df.to_parquet('data/agent_train.parquet', index=False)

test_analysis_df.to_parquet('data_analysis/agent_test.parquet', index=False)
train_analysis_df.to_parquet('data_analysis/agent_train.parquet', index=False)