In [8]:
user_template="""[Question]
Summarize the following meeting note into 3 Korean sentences
{Summary}
[The Start of Assistant 1's Answer]
{Answer1}
[The End of Assistant 1's Answer]
[The Start of Assistant 2's Answer]
{Answer2}
[The End of Assistant 2's Answer]
[System]
We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.
Please rate the clarity, completeness, and brevity of their responses. Each assistant receives an overall score on a scale of 0 to 10, where a higher score indicates better overall performance. 
The detailed guideline for the scoring is the following:
Coverage (3 points): A high-quality summarization system should be able to identify and extract key points, main arguments, and critical details.
Coherence (2.5 points): The generated summary should be logically organized and easy to understand, with a coherent structure, well-formed sentences, and smooth transitions between ideas
Conciseness (2.5 points): The summary should be concise, removing redundancy and unnecessary information while still conveying the main points effectively.
Accuracy (2 points): The generated summary should maintain the accuracy of the original content and not introduce errors or misrepresentations.
Let's say the score of Assistant 1 is 2, 2, 2, 2 for Coverage, Coherence, Conciseness, and Accuracy, respectively.
Please first provide detailed explanations, specifying at which part of the example you think it is, avoiding any potential bias, and ensuring that the order in which the responses were presented does not affect your judgment. 
Please provide one criteria per each paragraph and give the score to assistant 2 compared to the score of Assistant 1.  
In the subsequent line, please output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by whitespace. 
You may speak in English."""

In [9]:
assist_template = """For Coverage, {At which part does Assistant 2 cover more/less than Assistant 1 in 3 sentences}."""

In [10]:
with open('summary.txt', 'r') as file:
    Summary = file.read()

In [30]:
metrics = ["Coverage", "Coherence", "Conciseness", "Accuracy"]
summ_prompt = []
Answer1 = """ """
Answer2 = """ """
for k in range(4):
    summ_prompt.append(f"""[Question]
Summarize the following meeting note into 3 English sentences
{Summary}
[The Start of Assistant 1's Answer]
{Answer1}
[The End of Assistant 1's Answer]
[The Start of Assistant 2's Answer]
{Answer2}
[The End of Assistant 2's Answer]
[System]
We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.
Each assistant receives an overall score on a scale of 0 to 10, where a higher score indicates better overall performance. 
The detailed guideline for the scoring is the following:
Coverage (3 points): A high-quality summarization system should be able to identify and extract key points, main arguments, and critical details.
Coherence (2.5 points): The generated summary should be logically organized and easy to understand, with a coherent structure, well-formed sentences, and smooth transitions between ideas
Conciseness (2.5 points): The summary should be concise, removing redundancy and unnecessary information while still conveying the main points effectively.
Accuracy (2 points): The generated summary should maintain the accuracy of the original content and not introduce errors or misrepresentations.
Let's say the score of Assistant 1 is 2, 2, 2, 2 for Coverage, Coherence, Conciseness, and Accuracy, respectively.
Please first provide detailed explanations, specifying at which part of the example you think it is, avoiding any potential bias, and ensuring that the order in which the responses were presented does not affect your judgment. 
Please provide the detailed explanations specifying at which part of the example is better or worse and give the score to assistant 2 compared to the score of Assistant 1.  
First, Let's start with the {metrics[k]} score. You may speak in English.""")

In [None]:
import openai
import requests, json

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {openai.api_key}",
    "OpenAI-Organization": openai.organization
}

import torch
import numpy as np

from transformers import AutoTokenizer, DPRQuestionEncoder, DPRContextEncoder
from typing import List

output = ""
for k in range(4):
    message = [{"role":"user", "content":user_template}, 
    {"role":"assistant", "content":assist_template}, 
    {"role":"user", "content":summ_prompt[k]}]
    data = {
        "model":"gpt-4",
        "messages" : message,
        "temperature" : 0.1,
        "max_tokens" : 512,
        "frequency_penalty" : 0.2,
        "stop": ["\n\n", "\nFor"]
    }
    response = None
    while True:
        try:
            response = requests.post(
                "https://api.openai.com/v1/chat/completions",
                headers=headers,
                data=json.dumps(data),
                timeout=60,
            )
            break
        except requests.exceptions.Timeout:
            print(f"Request timed out after 15")
    output += response.json()["choices"][0]["message"]["content"]
    print(response.json()["choices"][0]["message"]["content"])

In [33]:
user_template = """[Question]
Translate the following sentences into Korean.
{Sentences}
[The Start of Assistant 1's Answer]
{Answer1}
[The End of Assistant 1's Answer]
[The Start of Assistant 2's Answer]
{Answer2}
[The End of Assistant 2's Answer]
[System]
We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.
Please rate the clarity, completeness, and brevity of their responses. Each assistant receives an overall score on a scale of 0 to 10, where a higher score indicates better overall performance. 
The detailed guideline for the scoring is the following:
Accuracy (4 points): Ensuring that the translation accurately conveys the meaning and intent of the source text is crucial for effective communication.
Fluency (3 points): A fluent, natural-sounding translation makes the text more readable and accessible to the target audience.
Cultural Appropriateness (3 points): Adapting the translation to the target culture ensures that it resonates with the audience and avoids any potential misunderstandings.
Let's say the score of Assistant 1 is 3, 2, 2 for Accuracy, Fluency, and Cultural Appropriateness, respectively.
Please first provide detailed explanations, specifying at which part of the example you think it is, avoiding any potential bias, and ensuring that the order in which the responses were presented does not affect your judgment. 
Please provide one criteria per each paragraph and give the score to assistant 2 compared to the score of Assistant 1.  
You may speak in English."""

In [34]:
assist_template = """For Accuracy, {At which part is Assistant 2 more/less accurate than Assistant 1 in 3 sentences}."""

In [50]:
metrics = ["Accuracy", "Fluency", "Cultural Appropriateness"]
summ_prompt = []
Sentences = """ """
Answer1 = """ """
Answer2 = """ """
for k in range(len(metrics)):
    summ_prompt.append(f"""[Question]
Translate the following sentences into English.
{Sentences}
[The Start of Assistant 1's Answer]
{Answer1}
[The End of Assistant 1's Answer]
[The Start of Assistant 2's Answer]
{Answer2}
[The End of Assistant 2's Answer]
[System]
We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.
Please rate the clarity, completeness, and brevity of their responses. Each assistant receives an overall score on a scale of 0 to 10, where a higher score indicates better overall performance. 
The detailed guideline for the scoring is the following:
Accuracy (4 points): Ensuring that the translation accurately conveys the meaning and intent of the source text is crucial for effective communication.
Fluency (3 points): A fluent, natural-sounding translation makes the text more readable and accessible to the target audience.
Cultural Appropriateness (3 points): Adapting the translation to the target culture ensures that it resonates with the audience and avoids any potential misunderstandings.
Let's say the score of Assistant 1 is 3, 2, 2 for Accuracy, Fluency, and Cultural Appropriateness, respectively.
Please first provide detailed explanations, specifying at which part of the example you think it is, avoiding any potential bias, and ensuring that the order in which the responses were presented does not affect your judgment. 
Please provide the detailed explanations specifying at which part of the example is better or worse and give the score to assistant 2 compared to the score of Assistant 1.  
First, Let's start with the {metrics[k]} score. You may speak in English.""")

In [None]:
import openai
import requests, json

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {openai.api_key}",
    "OpenAI-Organization": openai.organization
}

import torch
import numpy as np

from transformers import AutoTokenizer, DPRQuestionEncoder, DPRContextEncoder
from typing import List

output = ""
for k in range(len(metrics)):
    message = [{"role":"user", "content":user_template}, 
    {"role":"assistant", "content":assist_template}, 
    {"role":"user", "content":summ_prompt[k]}]
    data = {
        "model":"gpt-4",
        "messages" : message,
        "temperature" : 0.1,
        "max_tokens" : 512,
        "frequency_penalty" : 0.2,
        "stop": ["\n\n"]
    }
    response = None
    while True:
        try:
            response = requests.post(
                "https://api.openai.com/v1/chat/completions",
                headers=headers,
                data=json.dumps(data),
                timeout=60,
            )
            break
        except requests.exceptions.Timeout:
            print(f"Request timed out after 15")
    output += response.json()["choices"][0]["message"]["content"]
    print(response.json()["choices"][0]["message"]["content"])

In [23]:
message = [{"role":"user", "content":output + "\nSo in total, What's the total score of Assistant 2?"}]
data = {
    "model":"gpt-4",
    "messages" : message,
    "temperature" : 0.1,
    "max_tokens" : 512,
    "frequency_penalty" : 0.2
}
response = None
while True:
    try:
        response = requests.post(
            "https://api.openai.com/v1/chat/completions",
            headers=headers,
            data=json.dumps(data),
            # timeout=15,
        )
        break
    except requests.exceptions.Timeout:
        print(f"Request timed out after 15")
print(response.json()["choices"][0]["message"]["content"])

Assistant 2's total score is calculated as follows:

Coverage: 1.5
Coherence: 1.5
Conciseness: 1.5
Accuracy: 1

Total Score: 1.5 + 1.5 + 1.5 + 1 = 5.5
