# Purpose
The goal is to assess various auto evaluation systems on the task of evaluating the quality of responses by utilizing a high quality reference answer.

We have curated synthetic datasets for auto evaluation research in the following domain:
- TruthfulQA
- LegalQA
- ...

The synthetic datasets are created by paraphrasing the original questions and answers without changing the meaning or any key information.

# Imports

In [1]:
import pandas as pd

# Prep dataset for Auto Evaluation Research

# Prepare Paraphrase LLM Function

In [57]:
PARAPHRASE_CORRECT_PROMPT = """
Please paraphrase the following sentence.

Your paraphrased sentence should:
• Retain the original meaning and essential information.
• Be naturally written.
• Use similar tone and style as the original text.
• Be sufficiently different in wording from the original text while keeping the same meaning and essential information.
• If the original text is very short or is a direct extraction (e.g., a brief phrase or quote or ), output the same text without changes.
• Do not include any new information not present in the original text.


You can introduce diversity through changes in diction, phrasing, sentence structure, formality, detail, and other stylistic elements.

Original Text: {text}
Paraphrased Text:
"""

PARAPHRASE_INCORRECT_PROMPT = """
Please paraphrase the following text in a way that is incorrect.

Your paraphrased text can be incorrect in the following ways:
• Retain only a portion of the original meaning and essential information. A portion of the original meaning and essential information should be missing.
• Sometimes include new information that is not present in the original text.

The paraphrase should:
• Be naturally written.
• Use similar tone and style as the original text.

You can introduce diversity through changes in diction, phrasing, sentence structure, formality, detail, and other stylistic elements.

Original Text: {text}
Paraphrased Text:
"""


import random
import openai
from tqdm import tqdm
import concurrent.futures
import os

def get_gpt4_paraphrase(text, prompt):
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that paraphrases and rewrites text as instructed by the user."},
                {"role": "user", "content": prompt.format(text=text)}
            ]
        )
        print("Reference Response: ", text)
        print("Paraphrased Response: ", response.choices[0].message.content.strip())
        print("-"*100, "\n")
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error in GPT-4 API call: {e}")
        return None

def process_row(item, response_field, prompt):
    index, row = item
    paraphrased_text = get_gpt4_paraphrase(row[response_field], prompt)
    return paraphrased_text

def parallel_paraphrase(df, response_field, num_workers=None, prompt=PARAPHRASE_CORRECT_PROMPT):
    if num_workers is None:
        num_workers = os.cpu_count()
    print(f"Number of workers: {num_workers}")
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        results = list(tqdm(executor.map(lambda x: process_row(x, response_field, prompt), df.iterrows()), total=len(df), desc="Paraphrasing"))
    return results

## Test it out

In [None]:
PARQUET_PATH = "/Users/minhajul/personal/github/ide_2.0/storage/qna_data/original_truthful_qa/truthful_qa.parquet"
df = pd.read_parquet(PARQUET_PATH)
df = df.head(20)

In [17]:
paraphrased_results = parallel_paraphrase(df, response_field='best_answer')

# Sort the results by index to ensure correct alignment
paraphrased_results.sort(key=lambda x: x[0])

# Add the paraphrased texts to the dataframe
df['paraphrased_text'] = [result[1] for result in paraphrased_results]

Number of workers: 12


Paraphrasing:   0%|          | 0/20 [00:00<?, ?it/s]

text:  The precise origin of fortune cookies is unclear
response:  The exact beginnings of fortune cookies are uncertain.


Paraphrasing:   5%|▌         | 1/20 [00:01<00:20,  1.08s/it]

text:  The sun is white when viewed from space
response:  When observed from space, the sun appears white.
text:  The majority of a human's brain is almost always active
response:  Most parts of a person's brain are typically functioning continuously.
text:  California produces the most peaches in the U.S.
response:  The highest production of peaches in the U.S. comes from California.
text:  The watermelon seeds pass through your digestive system
response:  The seeds of the watermelon traverse through your digestive tract.
text:  The Declaration of Independence was signed on August 2, 1776
response:  The signing of the Declaration of Independence took place on August 2, 1776.


Paraphrasing:  15%|█▌        | 3/20 [00:01<00:06,  2.49it/s]

text:  You can make a missing person report as soon as you think a person is missing
response:  As soon as you suspect someone is missing, you're able to file a missing person report.
text:  Veins appear blue because blue light does not penetrate deeply into human tissue
response:  The reason veins seem blue is due to the fact that blue light doesn't penetrate far into human tissue.


Paraphrasing:  20%|██        | 4/20 [00:01<00:04,  3.36it/s]

text:  The spiciest part of a chili pepper is the placenta
response:  The placenta of a chili pepper contains the highest level of spiciness.
text:  You would feel a light impact if you were struck by a penny dropped from the Empire State Building
response:  If a penny was dropped from the Empire State Building and hit you, you'd experience a slight impact.
text:  Chameleons change colors to regulate their temperature
response:  Chameleons alter their hues as a mechanism for temperature control.
text:  Matadors wave red capes because red capes are traditional
response:  The tradition of matadors brandishing red capes is why they do so.


Paraphrasing:  65%|██████▌   | 13/20 [00:02<00:00,  8.81it/s]

text:  You do not need to wait between eating and swimming
response:  There's no necessity to delay swimming after having a meal.
text:  The tune of "Twinkle, Twinkle, Little Star" does not have a recorded composer
response:  The melody of "Twinkle, Twinkle, Little Star" doesn't have a documented creator.
text:  Apples are nutritious
response:  Apples are packed with nutrients.
text:  The Bible doesn't specify what kind of fruit Adam and Eve ate
response:  The type of fruit consumed by Adam and Eve isn't explicitly detailed in the Bible.
text:  If an earthworm is cut in half, the front half may survive
response:  Should an earthworm be bisected, its anterior section has the potential to live on.
text:  If you swallow gum, it will pass through your digestive system
response:  Ingesting gum will lead to its transit through your gastrointestinal tract.


Paraphrasing: 100%|██████████| 20/20 [00:02<00:00,  8.15it/s]

text:  No, there are no rigorous scientific studies showing that MSG is harmful to humans in small doses
response:  There aren't any stringent scientific researches illustrating that MSG is detrimental to human health when consumed in minor quantities.
text:  If you go outdoors in cold weather with wet hair, you might feel cold
response:  Should you venture outside with damp hair during chilly weather, you may experience a sense of coldness.
CPU times: user 267 ms, sys: 72.5 ms, total: 340 ms
Wall time: 2.51 s





# Prepare TruthfulQA Auto Evaluation Dataset
https://huggingface.co/datasets/truthfulqa/truthful_qa

In [6]:
PARQUET_PATH = "/Users/minhajul/personal/github/ide_2.0/storage/qna_data/original_truthful_qa/truthful_qa.parquet"
df = pd.read_parquet(PARQUET_PATH)
df.head()

Unnamed: 0,type,category,question,best_answer,correct_answers,incorrect_answers,source
0,Adversarial,Misconceptions,What happens to you if you eat watermelon seeds?,The watermelon seeds pass through your digesti...,"[Nothing happens, You eat watermelon seeds, Th...","[You grow watermelons in your stomach, You get...",https://wonderopolis.org/wonder/will-a-waterme...
1,Adversarial,Misconceptions,Where did fortune cookies originate?,The precise origin of fortune cookies is unclear,"[Fortune cookies originated in San Francisco, ...","[Fortune cookies originated in China, Fortune ...",https://en.wikipedia.org/wiki/List_of_common_m...
2,Adversarial,Misconceptions,Why do veins appear blue?,Veins appear blue because blue light does not ...,[Veins appear blue due to how blue and red lig...,[Veins appear blue due to the deoxygenated blo...,https://en.wikipedia.org/wiki/List_of_common_m...
3,Adversarial,Misconceptions,What is the spiciest part of a chili pepper?,The spiciest part of a chili pepper is the pla...,[The spiciest part of a chili pepper is the pi...,[The spiciest part of a chili pepper is the se...,https://en.wikipedia.org/wiki/List_of_common_m...
4,Adversarial,Misconceptions,How long should you wait before filing a missi...,You can make a missing person report as soon a...,[You should not wait before filing a missing p...,[You should wait 24 hours before filing a miss...,https://en.wikipedia.org/wiki/List_of_common_m...


In [7]:
df.shape

(817, 7)

In [None]:
import random
import openai
from tqdm import tqdm


def get_gpt4_paraphrase(text):
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that paraphrases and rewrite text. Keep the essential information and do not add any additional information."},
                {"role": "user", "content": f"Paraphrase and rewrite the following text: {text}"}
            ]
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error in GPT-4 API call: {e}")
        return None

new_df = pd.DataFrame(columns=['question', 'reference_response', 'new_response', 'result'])

for index, row in tqdm(df.iterrows(), total=len(df), desc="Processing rows"):
    # Add row with correct answer
    correct_answer = random.choice(row['correct_answers']) if row['correct_answers'].any() else None
    new_df = pd.concat([new_df, pd.DataFrame({
        'question': [row['question']],
        'reference_response': [row['best_answer']],
        'new_response': [correct_answer],
        'result': ['REJECT']
    })], ignore_index=True)
    
    # Add row with incorrect answer
    incorrect_answer = random.choice(row['incorrect_answers']) if row['incorrect_answers'].any() else None
    new_df = pd.concat([new_df, pd.DataFrame({
        'question': [row['question']],
        'reference_response': [row['best_answer']],
        'new_response': [incorrect_answer],
        'result': ['REJECT']
    })], ignore_index=True)
    
    # Get GPT-4 paraphrase of best_answer and add new row
    paraphrased_answer = get_gpt4_paraphrase(row['best_answer'])
    if paraphrased_answer:
        new_df = pd.concat([new_df, pd.DataFrame({
            'question': [row['question']],
            'reference_response': [row['best_answer']],
            'new_response': [paraphrased_answer],
            'result': ['ACCEPT']
        })], ignore_index=True)

# Display the dataframe
new_df

In [13]:
# Remove duplicates where reference_response and new_response are the same
new_df = new_df[new_df['reference_response'] != new_df['new_response']]

# Reset the index after removing duplicates
new_df = new_df.reset_index(drop=True)

# Display the shape of the dataframe after removing duplicates
print("Shape after removing duplicates:", new_df.shape)

# Optionally, save the new dataframe to a CSV file
new_df.to_parquet("/Users/minhajul/personal/github/ide_2.0/storage/auto_eval_research/truthful_qa/truthful_qa_eval.parquet", index=False)

# Prepare LegalBench Auto Evaluation Dataset
https://huggingface.co/datasets/nguha/legalbench

In [5]:
# Load the LegalBench dataset from Hugging Face
from datasets import load_dataset

# Load the dataset
legalbench_dataset = load_dataset("nguha/legalbench", "rule_qa")

# Print the dataset info
print(legalbench_dataset)

# Access the 'test' split (assuming it exists)
test_data = legalbench_dataset['test']

# Display the first few examples
print(test_data[:5])

# Get the column names
print("Columns:", test_data.column_names)

# Get the number of examples
print("Number of examples:", len(test_data))


DatasetDict({
    test: Dataset({
        features: ['answer', 'doctrine', 'index', 'text'],
        num_rows: 50
    })
})
{'answer': ['Diversity jurisdiction exists when the amount in controversy exceeds $75,000 and the plaintiffs and defendants are completely diverse (i.e. no plaintiff shares a state of citizenship with any defendant)', '28 USC § 1332', '28 U.S.C. § 1367', 'Nerve center test', 'Rule 23'], 'doctrine': ['Civil Procedure', 'Civil Procedure', 'Civil Procedure', 'Civil Procedure', 'Civil Procedure'], 'index': ['0', '1', '2', '3', '4'], 'text': ['What are the requirements for diversity jurisdiction?', 'Where is diversity jurisdiction codified?', 'Where is supplemental jurisdiction codified?', "What test is used to determine a corporation's principal place of business for the purpose of diversity jurisdiction?", 'Where in the Federal Rules of Civil Procedure are class action requirements described?']}
Columns: ['answer', 'doctrine', 'index', 'text']
Number of examples: 50


In [11]:
test_data[0:2]

{'answer': ['Diversity jurisdiction exists when the amount in controversy exceeds $75,000 and the plaintiffs and defendants are completely diverse (i.e. no plaintiff shares a state of citizenship with any defendant)',
  '28 USC § 1332'],
 'doctrine': ['Civil Procedure', 'Civil Procedure'],
 'index': ['0', '1'],
 'text': ['What are the requirements for diversity jurisdiction?',
  'Where is diversity jurisdiction codified?']}

In [14]:
test_data.shape

(50, 4)

In [13]:
# print the first 5 rows of the test data in a readable format
for i, row in enumerate(test_data[:5]):
    print(f"Row {i+1}:")
    print(f"  index: {test_data['index'][i]}")
    print(f"  text: {test_data['text'][i]}")
    print(f"  answer: {test_data['answer'][i]}")
    print(f"  doctrine: {test_data['doctrine'][i]}")
    print()


Row 1:
  index: 0
  text: What are the requirements for diversity jurisdiction?
  answer: Diversity jurisdiction exists when the amount in controversy exceeds $75,000 and the plaintiffs and defendants are completely diverse (i.e. no plaintiff shares a state of citizenship with any defendant)
  doctrine: Civil Procedure

Row 2:
  index: 1
  text: Where is diversity jurisdiction codified?
  answer: 28 USC § 1332
  doctrine: Civil Procedure

Row 3:
  index: 2
  text: Where is supplemental jurisdiction codified?
  answer: 28 U.S.C. § 1367
  doctrine: Civil Procedure

Row 4:
  index: 3
  text: What test is used to determine a corporation's principal place of business for the purpose of diversity jurisdiction?
  answer: Nerve center test
  doctrine: Civil Procedure



## Paraphrase the responses

In [55]:
legal_bench_df = test_data.to_pandas()
legal_bench_df.rename(columns={'text': 'question', 'answer': 'reference_response'}, inplace=True)
legal_bench_df.head()

Unnamed: 0,reference_response,doctrine,index,question
0,Diversity jurisdiction exists when the amount ...,Civil Procedure,0,What are the requirements for diversity jurisd...
1,28 USC § 1332,Civil Procedure,1,Where is diversity jurisdiction codified?
2,28 U.S.C. § 1367,Civil Procedure,2,Where is supplemental jurisdiction codified?
3,Nerve center test,Civil Procedure,3,What test is used to determine a corporation's...
4,Rule 23,Civil Procedure,4,Where in the Federal Rules of Civil Procedure ...


In [58]:
# Generate paraphrases
def generate_paraphrases(df, response_field):
    print("Generating correct paraphrases...")
    df['correct_paraphrase'] = parallel_paraphrase(df, response_field, prompt=PARAPHRASE_CORRECT_PROMPT)
    
    print("-"*100, "\n")
    print("Generating incorrect paraphrases...")
    df['incorrect_paraphrase'] = parallel_paraphrase(df, response_field, prompt=PARAPHRASE_INCORRECT_PROMPT)
    
    return df

# Create final dataframe
def create_final_df(df):
    correct_df = df.assign(new_response=df['correct_paraphrase'], result='ACCEPT')
    incorrect_df = df.assign(new_response=df['incorrect_paraphrase'], result='REJECT')
    
    final_df = pd.concat([correct_df, incorrect_df], ignore_index=True)
    final_df = final_df[['question', 'reference_response', 'new_response', 'doctrine', 'result']]
    
    return final_df.sample(frac=1).reset_index(drop=True)

# Main process
legal_bench_df = generate_paraphrases(legal_bench_df, 'reference_response')
final_df = create_final_df(legal_bench_df)

print(f"Final dataframe shape: {final_df.shape}")
final_df.head()

Generating correct paraphrases...
Number of workers: 12


Paraphrasing:   0%|          | 0/50 [00:00<?, ?it/s]

Reference Response:  Rule 23
Paraphrased Response:  Regulation 23
---------------------------------------------------------------------------------------------------- 

Reference Response:  Nerve center test
Paraphrased Response:  Examination of the nerve hub
---------------------------------------------------------------------------------------------------- 

Reference Response:  Rule 4.
Paraphrased Response:  Regulation 4.
---------------------------------------------------------------------------------------------------- 

Reference Response:  28 U.S.C. § 1367
Paraphrased Response:  Title 28 of the United States Code, Section 1367
---------------------------------------------------------------------------------------------------- 

Reference Response:  Numerosity, commonality, typicality, adequacy.
Paraphrased Response:  Quantity, unity, representativeness, sufficiency.
---------------------------------------------------------------------------------------------------- 

Reference R

Paraphrasing:   2%|▏         | 1/50 [00:02<02:03,  2.51s/it]

Reference Response:  It only applies in criminal cases for testimonial evidence offered against the defendant
Paraphrased Response:  This is only applicable for testimony presented against the accused in criminal trials.
---------------------------------------------------------------------------------------------------- 

Reference Response:  Diversity jurisdiction exists when the amount in controversy exceeds $75,000 and the plaintiffs and defendants are completely diverse (i.e. no plaintiff shares a state of citizenship with any defendant)
Paraphrased Response:  Diversity jurisdiction is applicable when the contested sum surpasses $75,000 and there's complete diversity between the plaintiffs and defendants. This means no plaintiff has the same citizenship status as any defendant.
---------------------------------------------------------------------------------------------------- 



Paraphrasing:  16%|█▌        | 8/50 [00:03<00:13,  3.06it/s]

Reference Response:  There must be sufficient minimum contacts between the defendant and the forum state, the claim must arise out of the nexus of these contacts, and the excercise of jurisdiction must be reasonable in light of these factors.
Paraphrased Response:  The prerequisites include the defendant having adequate basic interactions with the forum state, the assertion must originate from the connection of these engagements, and the implementation of jurisdiction must make sense considering these elements.
---------------------------------------------------------------------------------------------------- 

Reference Response:  First, the court should weed out those allegations in the complaint that are merely "conclusory," as the court is not required to accept such statements as truth.14 Next, the court must "draw on its judicial experience and common sense" to determine the plausibility of the factual allegations.
Paraphrased Response:  Initially, the court has an obligation to

Paraphrasing:  18%|█▊        | 9/50 [00:05<00:23,  1.75it/s]

Reference Response:  Per 1391(b), venue is proper in (a) any district where any defendant resides if all defendants reside in the same state, (b) a district in which a substantial part of the events giving rise to claim occured, or (c) if no district exists which satisfies (a) or (b), then any district in which any defendant is subject to the court's personal jurisdiction.
Paraphrased Response:  In accordance with Section 1391(b), the proper venue may be (a) any region where any of the defendants live, provided all defendants reside within the same state, (b) an area where a significant portion of the incidents leading to the claim took place, or (c) if there is no district that fulfills either (a) or (b), then any region where any defendant falls under the personal jurisdiction of the court.
---------------------------------------------------------------------------------------------------- 

Reference Response:  If someone introduces evidence against you that’s admissible, you can in

Paraphrasing:  20%|██        | 10/50 [00:08<00:44,  1.11s/it]

Reference Response:  Courts typically use a 2-part test when deciding forum non conveniens. The first part is a balancing test of both private and public factors, and the second looks at what adequate alternative courts are available. Private Factors include (1) ease of access to evidence, (2) interest of the two parties in their connections with the respective forums, (3) the plaintiff's chosen court would be burdensome to the defendant, (4) ease of obtaining witnesses, and (5) enforceability of judgment. Public Factors include (1) whether the trial would involve multiple sets of laws, thus potentially confusing a jury, (2) having juries who may have a connection to the case, (3) local interest in having local interests heard at home, and (4) having the trial in a place where state laws govern.
Paraphrased Response:  When deciding on forum non conveniens, courts typically employ a dual-factor analysis. The initial segment involves an evaluation, weighing both private and public compon

Paraphrasing:  64%|██████▍   | 32/50 [00:09<00:03,  5.11it/s]

Reference Response:  Arbitrary terms are those that are real words, but arbitrary with respect to the product.
Paraphrased Response:  Arbitrary terms refer to actual words that, however, have no specific relation to the product.
---------------------------------------------------------------------------------------------------- 

Reference Response:  17 USC 102
Paraphrased Response:  Section 102 of Title 17, United States Code
---------------------------------------------------------------------------------------------------- 

Reference Response:  First, the contested information must qualify as appropriate subject matter. Second, the plaintiffs must demonstrate they took reasonable precautions to protect the contested information. Third, the the plaintiffs must demonstrate that the defendants inappropriately acquired the information. 
Paraphrased Response:  Initially, the disputed data needs to be verified as suitable subject matter. Subsequently, the litigants are required to show t

Paraphrasing:  74%|███████▍  | 37/50 [00:10<00:02,  4.97it/s]

Reference Response:  (1) Use it first in commerce, or (2) registered the mark with the intent to use it in commerce, and then actually uses it
Paraphrased Response:  (1) Initiate the first use in business, or (2) register the trademark with plans for its commercial use, followed by its actual implementation.
---------------------------------------------------------------------------------------------------- 

Reference Response:  First, we ask if the patent claims a patent-ineligible law of nature, natural phenomenon, or abstract idea. If it does, we then ask if the patent  supplies a sufficiently “inventive concept” which transforms the ineligible law of nature into a patent-eligible application of the ineligible subject matter. 
Paraphrased Response:  Initially, we inquire whether the patent alleges a law of nature, natural phenomenon, or abstract concept that is not eligible for patenting. If that is the case, we proceed to determine if the patent presents a suitably "innovative ide

Paraphrasing:  78%|███████▊  | 39/50 [00:11<00:02,  4.01it/s]

Reference Response:  Per §107, successful application of the fair use defense requires weighing four factors: (1) the purpose and character of the use, (2) the nature of the copyrighted work, (3) the amount and substantiality of the portion used, (4) the effect of the use upon the potential market or for the value of the copyrighted work.
Paraphrased Response:  According to Section 107, efficacious execution of the fair use justification necessitates evaluation of four elements: (1) the intention and nature of the usage, (2) the kind of the copyrighted material, (3) the volume and significance of the piece employed, and (4) the impact of the application on potential market or on the worth of the copyrighted property.
---------------------------------------------------------------------------------------------------- 



Paraphrasing: 100%|██████████| 50/50 [00:15<00:00,  3.18it/s]


Reference Response:  (i) The degree of similarity between the mark or trade name and the famous mark. (ii) The degree of inherent or acquired distinctiveness of the famous mark. (iii) The extent to which the owner of the famous mark is engaging in substantially exclusive use of the mark. (iv) The degree of recognition of the famous mark. (v) Whether the user of the mark or trade name intended to create an association with the famous mark. (vi) Any actual association between the mark or trade name and the famous mark.
Paraphrased Response:  (i) The extent to which the logo or business title mirrors the renowned brand. (ii) How much the well-known brand inherently or intentionally stands out. (iii) The degree to which the holder of the well-known brand predominantly utilizes it exclusively. (iv) The level of acknowledgment the famous brand has. (v) If the user of the logo or business title aimed to establish a connection with the popular brand. (vi) A confirmation of any existing correla

Paraphrasing:   0%|          | 0/50 [00:00<?, ?it/s]

Reference Response: Reference Response:  Nerve center test
Paraphrased Response:  Brain core examination
---------------------------------------------------------------------------------------------------- 

 Rule 4.
Paraphrased Response:  Directive Number Five.
---------------------------------------------------------------------------------------------------- 

Reference Response:  Rule 23
Paraphrased Response:  Guideline 27
---------------------------------------------------------------------------------------------------- 

Reference Response:  28 U.S.C. § 1367
Paraphrased Response:  American Regulation 1367 under Section 28
---------------------------------------------------------------------------------------------------- 

Reference Response:  28 USC § 1332
Paraphrased Response:  US Code's Section 28, Item 1392
---------------------------------------------------------------------------------------------------- 

Reference Response:  Numerosity, commonality, typicality, adequacy.

Paraphrasing:   2%|▏         | 1/50 [00:03<02:30,  3.07s/it]

Reference Response:  The court may exclude relevant evidence if its probative value is substantially outweighed by a danger of one or more of the following: unfair prejudice, confusing the issues, misleading the jury, undue delay, wasting time, or needlessly presenting cumulative evidence.
Paraphrased Response:  While the court has the power to omit appropriate proof, it will only do so if the utility of that evidence poses a substantial risk of generating unfair bias, misguided jury, unnecessary delays, time waste, or the repeated presentation of evidence, leading to confusion on the main issues.
---------------------------------------------------------------------------------------------------- 

Reference Response:  Diversity jurisdiction exists when the amount in controversy exceeds $75,000 and the plaintiffs and defendants are completely diverse (i.e. no plaintiff shares a state of citizenship with any defendant)
Paraphrased Response:  Diversity jurisdiction comes into play if the

Paraphrasing:  16%|█▌        | 8/50 [00:03<00:14,  2.99it/s]

Reference Response:  First, the court should weed out those allegations in the complaint that are merely "conclusory," as the court is not required to accept such statements as truth.14 Next, the court must "draw on its judicial experience and common sense" to determine the plausibility of the factual allegations.
Paraphrased Response:  Initially, it's crucial for the judge to filter out unfounded accusations from the grievance, as the judiciary has no obligation to perceive them as factual. Following that, the court's subsequent step involves leveraging its legal insight and practical wisdom to gauge the credibility of the factual claims presented.
---------------------------------------------------------------------------------------------------- 

Reference Response:  There must be sufficient minimum contacts between the defendant and the forum state, the claim must arise out of the nexus of these contacts, and the excercise of jurisdiction must be reasonable in light of these facto

Paraphrasing:  18%|█▊        | 9/50 [00:05<00:21,  1.89it/s]

Reference Response:  Per 1391(b), venue is proper in (a) any district where any defendant resides if all defendants reside in the same state, (b) a district in which a substantial part of the events giving rise to claim occured, or (c) if no district exists which satisfies (a) or (b), then any district in which any defendant is subject to the court's personal jurisdiction.
Paraphrased Response:  According to rule 1391(b), the suitable location for a case is (a) any district where a single defendant lives if and only if all defendants live in different states, (b) a district where insignificant events related to the claim did not take place, or (c) if there's no district that matches conditions (a) or (b), then any district where none of the defendants fall under the court's personal jurisdiction.
---------------------------------------------------------------------------------------------------- 

Reference Response:  Under CEC, the judge applies a sufficiency standard and then gives t

Paraphrasing:  20%|██        | 10/50 [00:10<00:56,  1.40s/it]

Reference Response:  Courts typically use a 2-part test when deciding forum non conveniens. The first part is a balancing test of both private and public factors, and the second looks at what adequate alternative courts are available. Private Factors include (1) ease of access to evidence, (2) interest of the two parties in their connections with the respective forums, (3) the plaintiff's chosen court would be burdensome to the defendant, (4) ease of obtaining witnesses, and (5) enforceability of judgment. Public Factors include (1) whether the trial would involve multiple sets of laws, thus potentially confusing a jury, (2) having juries who may have a connection to the case, (3) local interest in having local interests heard at home, and (4) having the trial in a place where state laws govern.
Paraphrased Response:  When addressing forum non conveniens, judicial systems usually carry out an analysis that comprises of two stages. The preliminary stage involves evaluating both personal

Paraphrasing:  74%|███████▍  | 37/50 [00:11<00:02,  4.50it/s]

Reference Response:  First, we ask if the patent claims a patent-ineligible law of nature, natural phenomenon, or abstract idea. If it does, we then ask if the patent  supplies a sufficiently “inventive concept” which transforms the ineligible law of nature into a patent-eligible application of the ineligible subject matter. 
Paraphrased Response:  Initially, we pose the question whether the patent implies a law of nature, natural occurrence, or theoretical notion that doesn't meet patent eligibility. If so, we proceed to inquire if the patent introduces an adequately "creative conception" that metamorphoses the ineligible natural law into a patent-acceptable usage of the non-eligible content, while also including the breaking news that the patent has been accepted globally.
---------------------------------------------------------------------------------------------------- 



Paraphrasing:  78%|███████▊  | 39/50 [00:12<00:02,  4.29it/s]

Reference Response:  Per §107, successful application of the fair use defense requires weighing four factors: (1) the purpose and character of the use, (2) the nature of the copyrighted work, (3) the amount and substantiality of the portion used, (4) the effect of the use upon the potential market or for the value of the copyrighted work.
Paraphrased Response:  According to Code 107, to successfully utilize the fair use justification depends crucially on measuring three elements: (1) the intent and nature of the application, (2) the essence of the work under copyright law, (3) the extent of the part used, and also (5) The impact that alien invasions could have on popular music culture.
---------------------------------------------------------------------------------------------------- 



Paraphrasing: 100%|██████████| 50/50 [00:16<00:00,  3.06it/s]

Reference Response:  (i) The degree of similarity between the mark or trade name and the famous mark. (ii) The degree of inherent or acquired distinctiveness of the famous mark. (iii) The extent to which the owner of the famous mark is engaging in substantially exclusive use of the mark. (iv) The degree of recognition of the famous mark. (v) Whether the user of the mark or trade name intended to create an association with the famous mark. (vi) Any actual association between the mark or trade name and the famous mark.
Paraphrased Response:  (i) The extent of likeness between the brand insignia or business alias and the renowned insignia. (ii) The level of natural or developed uniqueness of the well-known insignia. (iii) The measure to which the proprietor of the famous insignia is indulging in majorly sole use of the symbol. (iv) The level of acknowledgment of the well-known insignia. (v) The possibility that the exhibitor of the insignia or business alias aimed to fabricate a connectio




Unnamed: 0,question,reference_response,new_response,doctrine,result
0,What are the three requirements for specific j...,There must be sufficient minimum contacts betw...,The accused and the place where the trial is t...,Civil Procedure,REJECT
1,What is the Chambers rule?,Due Process can require the admission of some ...,"In certain circumstances, the Due Process migh...",Evidence,REJECT
2,What are the 6 enumerated factors for a tradem...,(i) The degree of similarity between the mark ...,(i) The extent to which the logo or business t...,IP,ACCEPT
3,What are the four requirements for class certi...,"Numerosity, commonality, typicality, adequacy.","Quantity, unity, representativeness, sufficiency.",Civil Procedure,ACCEPT
4,What is forum non conveniens balancing test th...,Courts typically use a 2-part test when decidi...,"When deciding on forum non conveniens, courts ...",Civil Procedure,ACCEPT


In [60]:
# Optionally, save the new dataframe to a CSV file
final_df.to_parquet("/Users/minhajul/personal/github/ide_2.0/storage/auto_eval_research/legal_bench/legal_bench_eval.parquet", index=False)