# Steps
### **Fase 1**

1) Firstly, we need to ask ChatGPT the questions of type 1 from figure 1 of the article (file ".txt"). This part is used, according to the article as a quality control by ensuring that ChatGPT is capable of recognizing the relation.

2) Secondly, we need to ask ChatGPT the questions of type 2 from figure 1 of the article (file ".txt"). Here the same prototype pairs are used as in the first part, but now we ask it to generate 4 new pairs with a similar relationship to the one used in these prototypes. Here we also need to specify that the direction of the relation is important **X->Y** or **Y->X**.

As far as I understood, we need to ask each question from the first part together with the same pairs of the second part, like in the figure 1. So we are feeding ChatGPT with 2 questions every iteration - one from the first part, another from the second (for the same prototype pairs).

### **Fase 2**

We need to rank responses from **Fase 1** according to their prototypicality - for that we should use MaxDiff questions. (MaxDiff is a
choice procedure consisting of a question about a
target concept and four or five alternatives. A participant must choose both the best and worse answers
from the given alternatives)

1) Again we ask the first kind of questions (Question 1, figure 2 from the article). Again it serves as a quality control.

2) In the second part ChatGPT should select the most and least
illustrative example of that relation from among
the four examples of pairs generated by it in
**Phase 1**.

And here, apparently, we need to apply MaxDiff questions again to assess the results.

### **Notes about the data from what I understood**

/Training/Phase1Questions - we have 10 files of type 1 question from Phase 1 for 3 different pairs each - as we need. We can extract these pairs separately for each file and create a "card" with the second type of question. However I haven't found the second type of question in the data, but since it's always the same ('... create your 4 pairs for the same relation ...') we can just define a string for this when feeding ChatGPT.

/Training/Phase1Answers - in each file there are around 40 pairs generated by Turckers (the number is different in the files, like 41, 43, 44, etc). These are the predicted examples after asking questions of type 2. I don't know if we need to use these ones somewhere but we need to generate them ourselves.

In [None]:
import numpy as np
import pandas as pd
import os
!pip install --upgrade openai
import openai
from openai import OpenAI

!unzip data.zip



In [None]:
api_key = 'YOUR API KEY'

This is the general structure for the second type of questions of Phase 1. We just need to extract the pairs for each of the 10 questions.

In [None]:
Type2QuestionPhase1Part_1 = """Question 2: Consider the following word pairs: """ # then we need to add here the line with real pairs like this: "pilgrim:shrine, hunter:quarry, assassin:victim, climber:peak."
pairs = "pilgrim:shrine, hunter:quarry, assassin:victim, climber:peak."
Type2QuestionPhase1Part_2 = """These X:Y pairs share a relation, “X R Y ”. Give four additional word pairs that illustrate the same relation, in the
same order (X on the left, Y on the right). Please do not
use phrases composed of two or more words in your examples (e.g., “racing car”). Please do not use names of
people, places, or things in your examples (e.g., “Europe”,
“Kleenex”)."""

print(Type2QuestionPhase1Part_1 + pairs + Type2QuestionPhase1Part_2)

Question 2: Consider the following word pairs: pilgrim:shrine, hunter:quarry, assassin:victim, climber:peak.These X:Y pairs share a relation, “X R Y ”. Give four additional word pairs that illustrate the same relation, in the
same order (X on the left, Y on the right). Please do not
use phrases composed of two or more words in your examples (e.g., “racing car”). Please do not use names of
people, places, or things in your examples (e.g., “Europe”,
“Kleenex”).


The following script reads all 10 questions from the Phase 1 folder and stores it as lines in a dictionary. It stores the part of them starting from "Consider the following ..." ending with the last example of four possible relations.

In [None]:
def extract_text(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
        start_idx = content.find("Consider the following word pairs:")
        end_idx = content.find("Correct Answer:")
        if start_idx != -1 and end_idx != -1:
            return content[start_idx:end_idx].strip()
        else:
            return None

def process_files(folder_path):
    extracted_texts = {}
    for index, filename in enumerate(os.listdir(folder_path)):
        if filename.endswith('.txt'):
            file_path = os.path.join(folder_path, filename)
            extracted_text = extract_text(file_path)
            if extracted_text:
                extracted_texts[index] = extracted_text
    return extracted_texts

folder_path = '/content/Training/Phase1Questions'
Type1QuestionsPhase1 = process_files(folder_path) # these are full questions, we can directly feed them into ChatGPT
print(Type1QuestionsPhase1)

def extract_and_join_word_pairs(data_dict):
    word_pairs_dict = {}
    for key, text in data_dict.items():
        lines = text.split('\n')
        pairs = []
        for line in lines:
            if ':' in line and not line.startswith("Consider") and not line.startswith('What'):
                pairs.append(line.strip())
        word_pairs_dict[key] = '; '.join(pairs)
    return word_pairs_dict

joined_word_pairs = extract_and_join_word_pairs(Type1QuestionsPhase1)
print(joined_word_pairs)

{0: 'Consider the following word pairs:\n\ncar:auto\nbuy:purchase\nrapid:quick\n\nWhat relation best describes these X:Y word pairs?\n\nY is an instrument through with X receives some object/service/role\nan X and Y are a similar type of action/thing/attribute\nsomeone perform the action X on Y\nsomeone/something who is X is unlikely to Y', 1: 'Consider the following word pairs:\n\nmillionaire:money\nauthor:copyright\nrobin:nest\n\nWhat relation best describes these X:Y word pairs?\n\nX causes/compels a person to Y\nX is intended to produce Y\nan X cannot have attribute Y; Y is antithetical to being X\nX possesses/owns/has Y', 2: 'Consider the following word pairs:\n\nflower:tulip\nemotion:rage\npoem:sonnet\n\nWhat relation best describes these X:Y word pairs?\n\nto X is to have a Y receive some object/service/idea\nY is an unacceptable form of X\na Y is a part of an X\nY is a kind/type/instance of X', 3: 'Consider the following word pairs:\n\ntailor:suit\noracle:prophesy\nbaker:flour\

In [None]:
client = OpenAI(api_key=api_key)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": Type1QuestionsPhase1[0], # taking the first question from the dictionary
        }
    ],
    model="gpt-3.5-turbo", # choosing the model
)

In [None]:
chat_completion

# The answer here looks like this:
"""
ChatCompletion(id='chatcmpl-8RiDrElxkDf2cGXFQ0axunf9x4uda', choices=[Choice(finish_reason='stop', index=0,
message=ChatCompletionMessage(content='Y is an instrument through which X receives some object/service/role',
role='assistant', function_call=None, tool_calls=None))], created=1701615539, model='gpt-3.5-turbo-0613', object='chat.completion',
system_fingerprint=None, usage=CompletionUsage(completion_tokens=13, prompt_tokens=82, total_tokens=95))
"""

ChatCompletion(id='chatcmpl-8RiDrElxkDf2cGXFQ0axunf9x4uda', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='Y is an instrument through which X receives some object/service/role', role='assistant', function_call=None, tool_calls=None))], created=1701615539, model='gpt-3.5-turbo-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=13, prompt_tokens=82, total_tokens=95))

This part is for automatic running through the dictionary, but to be fixed, it doesn't like that a loop is used

In [None]:
client = OpenAI(api_key=api_key)

def ask_openai(questions_dict):
    responses = {}
    for key, question in questions_dict.items():
        try:
            # Sending the question to OpenAI's API using the new interface
            chat_completion = client.chat.completions.create(
                messages=[{"role": "user", "content": question}],
                model="gpt-3.5-turbo",  # Or another model you prefer
            )
            # Extracting the response text properly
            response_text = chat_completion.choices[0].message['content'].strip()
            responses[key] = response_text
        except Exception as e:
            print(f"An error occurred with question {key}: {e}")
            responses[key] = "Error"
    return responses

# Assuming your dictionary is named 'Type1QuestionsPhase1'
responses_dict = ask_openai(Type1QuestionsPhase1)
print(responses_dict)

An error occurred with question 0: 'ChatCompletionMessage' object is not subscriptable
An error occurred with question 1: 'ChatCompletionMessage' object is not subscriptable


KeyboardInterrupt: ignored

# Continuation with parsing

Parsing the number of files in the folder and creation of a folder with output files

In [None]:
!unzip data_CBSD.zip

In [20]:
import csv
import os
from collections import Counter

def process_files(input_folder, output_folder, k):
    files = os.listdir(input_folder)[:k]  # Get first k files
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for file in files:
        unique_lines = {}
        relation = ""

        with open(os.path.join(input_folder, file), 'r') as infile:
            reader = csv.reader((line.replace('\t', '\t') for line in infile), delimiter='\t')
            next(reader, None)  # Skip header if exists

            for i, row in enumerate(reader):
                if i == 0:  # Extract relation from the second row (i.e., first data row)
                    relation = row[-1] if len(row) > 6 else ""
                if len(row) >= 7:  # Ensure the row has enough columns
                    key = tuple(row[:4])  # First 4 columns as key
                    least_most = tuple(row[4:6])  # least_illustrative and most_illustrative
                    if key not in unique_lines:
                        unique_lines[key] = []
                    unique_lines[key].append(least_most)

        with open(os.path.join(output_folder, file), 'w', newline='') as outfile:
            writer = csv.writer(outfile, delimiter='\t')
            writer.writerow([relation])  # Write the relation as the first line
            for key, values in unique_lines.items():
                if values:
                    most_common = Counter(values).most_common(1)[0][0]
                    writer.writerow(list(key) + list(most_common))
                else:
                    writer.writerow(list(key) + ["", ""])  # Empty values for missing data

# Example usage
input_folder = '/content/Testing/Phase2Answers'  # Adjust to your input folder path
output_folder = '/content/output_files'  # Adjust to your output folder path
k = 1  # Number of files to process
process_files(input_folder, output_folder, k)

In [21]:
def create_no_most_least_files(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for file in os.listdir(input_folder):
        with open(os.path.join(input_folder, file), 'r') as infile, \
             open(os.path.join(output_folder, file.replace('.txt', '_no_least_most.txt')), 'w', newline='') as outfile:
            reader = csv.reader(infile, delimiter='\t')
            writer = csv.writer(outfile, delimiter='\t')

            for i, row in enumerate(reader):
                if i == 0:  # Copy the first line as is (relation)
                    writer.writerow(row)
                else:
                    writer.writerow(row[:4])  # Write only the first four columns

# Example usage
input_folder = '/content/output_files'  # Adjust to your input folder path
output_folder = '/content/output_no_least_most'  # Adjust to your output folder path
create_no_most_least_files(input_folder, output_folder)

### GPT feeding

In [15]:
def process_files_and_query_gpt(input_folder, output_folder, api_key):
    openai.api_key = api_key  # Set the API key for OpenAI

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for file in os.listdir(input_folder):
        with open(os.path.join(input_folder, file), 'r') as infile:
            reader = csv.reader(infile, delimiter='\t')
            relation = next(reader, [])[0]  # First row for the relation
            pairs = [row for row in reader]

            # Construct dynamic instructions including the relation
            # instructions = ("I'm going to give you several lines of the same type. "
            #                 "Your task is for each line to output the least illustrative "
            #                 "and the most illustrative representation of this relation: '"
            #                 + relation + "' (the order is important here!). "
            #                 "So, the output should be multiple lines with 2 pairs: "
            #                 "least illustrative and most illustrative. Output only this "
            #                 "information without any other comments.")
            instructions = ("For each line, output the least illustrative "
                            "and the most illustrative representation of this relation: '"
                            + relation + "'. The output should be two pairs: "
                            "least illustrative and most illustrative.")

            # Divide into batches of max 20 lines
            batches = [pairs[i:i + 20] for i in range(0, len(pairs), 20)]
            responses = []

            for batch in batches:
                # Prepare messages for API call, including the instructions
                messages = [{"role": "system", "content": instructions}]
                messages.extend([{"role": "user", "content": " ".join(row)} for row in batch])

                # Make API calls using chat completions
                chat_completion = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=messages
                )
                # Correctly extracting the assistant's response
                assistant_message = chat_completion['choices'][0]['message']
                if assistant_message['role'] == 'assistant':
                    responses.append(assistant_message['content'])

            # Write to new file
            with open(os.path.join(output_folder, file.replace('.txt', '_gpt.txt')), 'w', newline='') as outfile:
                writer = csv.writer(outfile, delimiter='\t')
                writer.writerow([relation])
                for response in responses:
                    writer.writerow([response])

# Example usage
input_folder = '/content/output_no_least_most'
output_folder = '/content/output_gpt'
api_key = ''  # Replace with your actual API key
process_files_and_query_gpt(input_folder, output_folder, api_key)

In [23]:
def process_files_and_query_gpt(input_folder, output_folder, api_key):
    openai.api_key = api_key  # Set the API key for OpenAI

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for file in os.listdir(input_folder):
        with open(os.path.join(input_folder, file), 'r') as infile:
            reader = csv.reader(infile, delimiter='\t')
            relation = next(reader, [])[0]  # First row for the relation

            # Update instructions including the relation
            instructions = ("In this line, based on the pairs provided, choose among them the least illustrative "
                            "and the most illustrative representation for this relation: '"
                            + relation + "' (the order of the relation matters). The output should be these four pairs "
                            "and the least illustrative and the most illustrative as the 5th and 6th column, accordingly."
                            "The output should be written in one line, 6 pairs overall in the following format:"
                            "pair1, pair2, pair3, pair4, least_illustrative, most_illustrative "
                            "And that's it, no brackets, no quotes, nothing else, it must be in this format.")

            responses = []

            for pairs in reader:
                # Prepare the message for API call, including the instructions and the line
                message = [{"role": "system", "content": instructions},
                           {"role": "user", "content": " ".join(pairs)}]

                # Make API calls for each line
                chat_completion = openai.ChatCompletion.create(
                    model="gpt-4",
                    messages=message
                )
                # Extracting the assistant's response
                assistant_message = chat_completion['choices'][0]['message']
                if assistant_message['role'] == 'assistant':
                    responses.append(assistant_message['content'])

            # Write to new file
            with open(os.path.join(output_folder, file.replace('.txt', '_gpt.txt')), 'w', newline='') as outfile:
                writer = csv.writer(outfile, delimiter='\t')
                writer.writerow([relation])
                for response in responses:
                    writer.writerow([response])

# Example usage
input_folder = '/content/output_no_least_most'
output_folder = '/content/output_gpt'
api_key = ''  # Replace with your actual API key
process_files_and_query_gpt(input_folder, output_folder, api_key)
