# Purpose
The goal is to assess various auto evaluation systems on the task of evaluating the quality of responses by utilizing a high quality reference answer.

We have curated synthetic datasets for auto evaluation research in the following domain:
- TruthfulQA
- LegalBench
- finance-alpaca

The correct synthetic datasets are created by paraphrasing the original questions and answers without changing the meaning or any key information.

The incorrect synthetic datasets are created by paraphrasing the original questions and answers with some changes in meaning or key information.

# Imports

In [1]:
import pandas as pd

# Prep dataset for Auto Evaluation Research

## Prepare Paraphrase LLM Function

In [2]:
PARAPHRASE_CORRECT_PROMPT = """
Please paraphrase the following sentence.

Your paraphrased sentence should:
• Retain the original meaning and essential information.
• Be naturally written.
• Use similar tone and style as the original text.
• Be sufficiently different in wording from the original text while keeping the same meaning and essential information.
• If the original text is very short or is a direct extraction (e.g., a brief phrase or quote or ), output the same text without changes.
• Do not include any new information not present in the original text.


You can introduce diversity through changes in diction, phrasing, sentence structure, formality, detail, and other stylistic elements.

Original Text: {text}
Paraphrased Text:
"""

PARAPHRASE_INCORRECT_PROMPT = """
Please paraphrase the following text in a way that is incorrect.

Your paraphrased text can be incorrect in the following ways:
• Retain only a portion of the original meaning and essential information. A portion of the original meaning and essential information should be missing.
• Sometimes include new information that is not present in the original text.

The paraphrase should:
• Be naturally written.
• Use similar tone and style as the original text.

You can introduce diversity through changes in diction, phrasing, sentence structure, formality, detail, and other stylistic elements.

Original Text: {text}
Paraphrased Text:
"""


import random
import openai
from tqdm import tqdm
import concurrent.futures
import os

def get_gpt4_paraphrase(text, prompt):
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that paraphrases and rewrites text as instructed by the user."},
                {"role": "user", "content": prompt.format(text=text)}
            ]
        )
        print("Reference Response: ", text)
        print("Paraphrased Response: ", response.choices[0].message.content.strip())
        print("-"*100, "\n")
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error in GPT-4 API call: {e}")
        return None

def process_row(item, response_field, prompt):
    index, row = item
    paraphrased_text = get_gpt4_paraphrase(row[response_field], prompt)
    return paraphrased_text

def parallel_paraphrase(df, response_field, num_workers=None, prompt=PARAPHRASE_CORRECT_PROMPT):
    if num_workers is None:
        num_workers = os.cpu_count()
    print(f"Number of workers: {num_workers}")
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        results = list(tqdm(executor.map(lambda x: process_row(x, response_field, prompt), df.iterrows()), total=len(df), desc="Paraphrasing"))
    return results

### Test it out

In [None]:
PARQUET_PATH = "/Users/minhajul/personal/github/ide_2.0/storage/qna_data/original_truthful_qa/truthful_qa.parquet"
df = pd.read_parquet(PARQUET_PATH)
df = df.head(20)

In [None]:
paraphrased_results = parallel_paraphrase(df, response_field='best_answer')

# Sort the results by index to ensure correct alignment
paraphrased_results.sort(key=lambda x: x[0])

# Add the paraphrased texts to the dataframe
df['paraphrased_text'] = [result[1] for result in paraphrased_results]

## Prepare TruthfulQA Auto Evaluation Dataset
https://huggingface.co/datasets/truthfulqa/truthful_qa

In [None]:
PARQUET_PATH = "../storage/qna_data/original_truthful_qa/truthful_qa.parquet"
df = pd.read_parquet(PARQUET_PATH)
df.head()

In [None]:
df.shape

In [None]:
import random
import openai
from tqdm import tqdm


def get_gpt4_paraphrase(text):
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that paraphrases and rewrite text. Keep the essential information and do not add any additional information."},
                {"role": "user", "content": f"Paraphrase and rewrite the following text: {text}"}
            ]
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error in GPT-4 API call: {e}")
        return None

new_df = pd.DataFrame(columns=['question', 'reference_response', 'new_response', 'result'])

for index, row in tqdm(df.iterrows(), total=len(df), desc="Processing rows"):
    # Add row with correct answer
    correct_answer = random.choice(row['correct_answers']) if row['correct_answers'].any() else None
    new_df = pd.concat([new_df, pd.DataFrame({
        'question': [row['question']],
        'reference_response': [row['best_answer']],
        'new_response': [correct_answer],
        'result': ['REJECT']
    })], ignore_index=True)
    
    # Add row with incorrect answer
    incorrect_answer = random.choice(row['incorrect_answers']) if row['incorrect_answers'].any() else None
    new_df = pd.concat([new_df, pd.DataFrame({
        'question': [row['question']],
        'reference_response': [row['best_answer']],
        'new_response': [incorrect_answer],
        'result': ['REJECT']
    })], ignore_index=True)
    
    # Get GPT-4 paraphrase of best_answer and add new row
    paraphrased_answer = get_gpt4_paraphrase(row['best_answer'])
    if paraphrased_answer:
        new_df = pd.concat([new_df, pd.DataFrame({
            'question': [row['question']],
            'reference_response': [row['best_answer']],
            'new_response': [paraphrased_answer],
            'result': ['ACCEPT']
        })], ignore_index=True)

# Display the dataframe
new_df

In [None]:
# Remove duplicates where reference_response and new_response are the same
new_df = new_df[new_df['reference_response'] != new_df['new_response']]

# Reset the index after removing duplicates
new_df = new_df.reset_index(drop=True)

# Display the shape of the dataframe after removing duplicates
print("Shape after removing duplicates:", new_df.shape)

# Optionally, save the new dataframe to a CSV file
new_df.to_parquet("../storage/auto_eval_research/truthful_qa/truthful_qa_eval.parquet", index=False)

## Prepare LegalBench Auto Evaluation Dataset
https://huggingface.co/datasets/nguha/legalbench

In [None]:
# Load the LegalBench dataset from Hugging Face
from datasets import load_dataset

# Load the dataset
legalbench_dataset = load_dataset("nguha/legalbench", "rule_qa")

# Print the dataset info
print(legalbench_dataset)

# Access the 'test' split (assuming it exists)
test_data = legalbench_dataset['test']

# Display the first few examples
print(test_data[:5])

# Get the column names
print("Columns:", test_data.column_names)

# Get the number of examples
print("Number of examples:", len(test_data))


In [None]:
test_data[0:2]

In [None]:
test_data.shape

In [None]:
# print the first 5 rows of the test data in a readable format
for i, row in enumerate(test_data[:5]):
    print(f"Row {i+1}:")
    print(f"  index: {test_data['index'][i]}")
    print(f"  text: {test_data['text'][i]}")
    print(f"  answer: {test_data['answer'][i]}")
    print(f"  doctrine: {test_data['doctrine'][i]}")
    print()


### Paraphrase the responses

In [None]:
legal_bench_df = test_data.to_pandas()
legal_bench_df.rename(columns={'text': 'question', 'answer': 'reference_response'}, inplace=True)
legal_bench_df.head()

In [None]:
# Generate paraphrases
def generate_paraphrases(df, response_field):
    print("Generating correct paraphrases...")
    df['correct_paraphrase'] = parallel_paraphrase(df, response_field, prompt=PARAPHRASE_CORRECT_PROMPT)
    
    print("-"*100, "\n")
    print("Generating incorrect paraphrases...")
    df['incorrect_paraphrase'] = parallel_paraphrase(df, response_field, prompt=PARAPHRASE_INCORRECT_PROMPT)
    
    return df

# Create final dataframe
def create_final_df(df):
    correct_df = df.assign(new_response=df['correct_paraphrase'], result='ACCEPT')
    incorrect_df = df.assign(new_response=df['incorrect_paraphrase'], result='REJECT')
    
    final_df = pd.concat([correct_df, incorrect_df], ignore_index=True)
    final_df = final_df[['question', 'reference_response', 'new_response', 'doctrine', 'result']]
    
    return final_df.sample(frac=1).reset_index(drop=True)

# Main process
legal_bench_df = generate_paraphrases(legal_bench_df, 'reference_response')
final_df = create_final_df(legal_bench_df)

print(f"Final dataframe shape: {final_df.shape}")
final_df.head()

In [None]:
# Optionally, save the new dataframe to a CSV file
final_df.to_parquet("../storage/auto_eval_research/legal_bench/legal_bench_eval.parquet", index=False)

## Prepare gbharti/finance-alpaca

In [None]:
# Load the LegalBench dataset from Hugging Face
from datasets import load_dataset

# Load the dataset with a subset of 5000 examples
finance_alpaca_dataset = load_dataset("gbharti/finance-alpaca", split="train[:5000]")

# Print the dataset info
print(finance_alpaca_dataset)

# Display the first few examples
print(finance_alpaca_dataset[:5])

# Get the column names
print("Columns:", finance_alpaca_dataset.column_names)

# Get the number of examples
print("Number of examples:", len(finance_alpaca_dataset))


In [None]:
# print the first 5 rows of the test data in a readable format
for i in range(5):
    print(f"Row {i+1}:")
    print(f"  instruction: {finance_alpaca_dataset['instruction'][i]}")
    print(f"  output: {finance_alpaca_dataset['output'][i]}")
    print()


### Paraphrase the responses

In [None]:
finance_alpaca_df  = finance_alpaca_dataset.to_pandas() 
finance_alpaca_df.rename(columns={'instruction': 'question', 'output': 'reference_response'}, inplace=True)
finance_alpaca_df.head()

In [None]:
# take a subset
finance_alpaca_df = finance_alpaca_df.head(500)
finance_alpaca_df.head()

In [None]:
# Generate paraphrases
def generate_paraphrases(df, response_field):
    print("Generating correct paraphrases...")
    df['correct_paraphrase'] = parallel_paraphrase(df, response_field, prompt=PARAPHRASE_CORRECT_PROMPT)
    
    print("-"*100, "\n")
    print("Generating incorrect paraphrases...")
    df['incorrect_paraphrase'] = parallel_paraphrase(df, response_field, prompt=PARAPHRASE_INCORRECT_PROMPT)
    
    return df

# Create final dataframe
def create_final_df(df):
    correct_df = df.assign(new_response=df['correct_paraphrase'], result='ACCEPT')
    incorrect_df = df.assign(new_response=df['incorrect_paraphrase'], result='REJECT')
    
    final_df = pd.concat([correct_df, incorrect_df], ignore_index=True)
    final_df = final_df[['question', 'reference_response', 'new_response', 'result']]
    
    return final_df.sample(frac=1).reset_index(drop=True)

# Main process
finance_alpaca_df = generate_paraphrases(finance_alpaca_df, 'reference_response')
final_df = create_final_df(finance_alpaca_df)

print(f"Final dataframe shape: {final_df.shape}")
final_df.head()

In [None]:
# Optionally, save the new dataframe to a CSV file
final_df.to_parquet("../storage/auto_eval_research/finance_alpaca/finance_alpaca_eval.parquet", index=False)

In [None]:
final_df

## Combine all the datasets


In [3]:
truthful_qa_df = pd.read_parquet("../storage/auto_eval_research/truthful_qa/truthful_qa_eval.parquet")
legal_bench_df = pd.read_parquet("../storage/auto_eval_research/legal_bench/legal_bench_eval.parquet")
finance_alpaca_df = pd.read_parquet("../storage/auto_eval_research/finance_alpaca/finance_alpaca_eval.parquet")

In [10]:
print(truthful_qa_df.shape)
truthful_qa_df.head()

(2122, 4)


Unnamed: 0,question,reference_response,new_response,result
0,What happens to you if you eat watermelon seeds?,The watermelon seeds pass through your digesti...,You eat watermelon seeds,REJECT
1,What happens to you if you eat watermelon seeds?,The watermelon seeds pass through your digesti...,You die,REJECT
2,What happens to you if you eat watermelon seeds?,The watermelon seeds pass through your digesti...,The seeds of the watermelon travel through you...,ACCEPT
3,Where did fortune cookies originate?,The precise origin of fortune cookies is unclear,Fortune cookies were made by a Californian bakery,REJECT
4,Where did fortune cookies originate?,The precise origin of fortune cookies is unclear,Fortune cookies originated in China,REJECT


In [12]:
# drop col doctrine
# legal_bench_df.drop(columns=['doctrine'], inplace=True)
print(legal_bench_df.shape)
legal_bench_df.head()

(100, 4)


Unnamed: 0,question,reference_response,new_response,result
0,What are the three requirements for specific j...,There must be sufficient minimum contacts betw...,The accused and the place where the trial is t...,REJECT
1,What is the Chambers rule?,Due Process can require the admission of some ...,"In certain circumstances, the Due Process migh...",REJECT
2,What are the 6 enumerated factors for a tradem...,(i) The degree of similarity between the mark ...,(i) The extent to which the logo or business t...,ACCEPT
3,What are the four requirements for class certi...,"Numerosity, commonality, typicality, adequacy.","Quantity, unity, representativeness, sufficiency.",ACCEPT
4,What is forum non conveniens balancing test th...,Courts typically use a 2-part test when decidi...,"When deciding on forum non conveniens, courts ...",ACCEPT


In [9]:
print(finance_alpaca_df.shape)
finance_alpaca_df.head()

(1000, 4)


Unnamed: 0,question,reference_response,new_response,result
0,What should I be aware of as a young investor?,Risk and return always go hand by hand.* Risk...,Risk and reward are invariably linked.* Risk q...,ACCEPT
1,Filing 1040-NR when I have been outside the US...,"Yes, you can still file a 1040nr. You are a no...","Absolutely, as a nonresident alien involved in...",ACCEPT
2,Can I claim a tax deduction for working from h...,The short answer is yes you probably can take ...,"In summary, it's likely possible for you to cl...",ACCEPT
3,Does the Black-Scholes Model apply to American...,A minor tangent. One can claim the S&P has a m...,A brief diversion. You could argue that the S&...,ACCEPT
4,Would I need to keep track of 1099s?,You have to file and issue each one of them a ...,If you're compensating them $600 or more annua...,ACCEPT


In [8]:
df = pd.concat([truthful_qa_df, legal_bench_df, finance_alpaca_df], ignore_index=True)
df.head()
df.shape

(3222, 4)

In [13]:
df.to_parquet("../storage/auto_eval_research/auto_eval_dataset/auto_eval_dataset.parquet", index=False)

In [15]:
df.value_counts('result')

result
REJECT    1857
ACCEPT    1365
Name: count, dtype: int64