# Evaluate ChatGPT on QALD-9 Data
## Evaluate precision, recall, f1-score on the correctness of translated queries
- Method 0: Answer the QALD-9 questions directly. Instruct ChatGPT to answer the QALD-9 questions directly without querying DBpedia
- Method 1: Translate QALD-9 questions directly. Instruct ChatGPT to translate QALD-9 questions directly, and then query DBpedia using the translated queries.
- Method 2: 1-shot learning from a pair of train question and query. Using the embeddings of the test and train questions to find the most similar train question to the test question. Prompt ChatGPT with the pair of matched train question and query. Instruct ChatGPT to translate a test question to a SPARQL query over DBpedia.
- Method 3: 1-shot learning from a pair of train question and query, and the chain-of-thought of the train query. As in Method 2, include the chain-of-thought of the train query in the prompt, in addition to the pair of matched question and query

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json, re, os, nltk

## Load the train and test data

In [2]:
train = pd.read_csv('../data/QALD/9/data/qald-9-train-with-embeddings-cot.csv')
train.head()

Unnamed: 0,id,answertype,aggregation,onlydbo,hybrid,question_text,question_keywords,sparql_query,answer_head,answer_results,question,query,answers,train_question_embedding,similarity,train_cot,masked_question,masked_query,train_masked_question_embedding,masked_cot
0,1,resource,False,True,False,List all boardgames by GMT.,"boardgame, GMT",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...","[{'language': 'de', 'string': 'Liste alle Bret...",{'sparql': 'PREFIX dbo: <http://dbpedia.org/on...,"[{'head': {'vars': ['uri']}, 'results': {'bind...",[-0.00911681 -0.00381883 -0.01784656 ... -0.01...,0.835177,- We need to select a URI from DBpedia. - We w...,List all boardgames by [MASK1].,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"[-0.027410538867115974, -0.03040986880660057, ...",- We need to retrieve URIs from DBpedia. - We ...
1,2,resource,False,True,False,Who developed Skype?,"develop, Skype",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...","[{'language': 'de', 'string': 'Wer entwickelt ...",{'sparql': 'PREFIX dbo: <http://dbpedia.org/on...,"[{'head': {'vars': ['uri']}, 'results': {'bind...",[ 0.0144061 -0.02094096 0.02329297 ... -0.02...,0.869972,1. We are selecting a URI from DBpedia. 2. We ...,Who developed [MASK1]?,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"[-0.014005924575030804, -0.012475568801164627,...",1. We are querying DBpedia. 2. We define three...
2,3,resource,False,False,False,Which people were born in Heraklion?,"people, born, heraklion",PREFIX yago: <http://dbpedia.org/class/yago/> ...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...","[{'language': 'de', 'string': 'Welche Menschen...",{'sparql': 'PREFIX yago: <http://dbpedia.org/c...,"[{'head': {'vars': ['uri']}, 'results': {'bind...",[ 0.03373604 -0.00393256 -0.00088309 ... -0.00...,0.865406,1. We need to retrieve data from DBpedia datab...,Which people were born in [MASK1]?,PREFIX yago: <http://dbpedia.org/class/yago/> ...,"[-0.005803690291941166, -0.016199836507439613,...",1. We need to select some entity [MASK1]. 2. W...
3,4,resource,False,True,False,In which U.S. state is Area 51 located?,"Area 51, located, U.S. state",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...","[{'language': 'de', 'string': 'Im welche US Zu...",{'sparql': 'PREFIX dbo: <http://dbpedia.org/on...,"[{'head': {'vars': ['uri']}, 'results': {'bind...",[ 0.0022676 -0.03222399 -0.00734459 ... 0.00...,0.859453,1. Start with defining three prefixes for the ...,In which U.S. state is [MASK1] located?,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"[-0.013672643341124058, -0.012233780696988106,...",1. We are selecting the distinct URIs. 2. We n...
4,5,resource,False,True,False,Who is the mayor of New York City?,"New York City, mayor",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...","[{'language': 'de', 'string': 'Wer ist der Bür...",{'sparql': 'PREFIX dbo: <http://dbpedia.org/on...,"[{'head': {'vars': ['uri']}, 'results': {'bind...",[-0.00202743 -0.01871667 -0.00329676 ... -0.02...,0.868842,1. We want to retrieve information about a lea...,Who is the mayor of [MASK1]?,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"[-0.01334498543292284, -0.020737532526254654, ...",1. We want to retrieve a list of distinct URIs...


In [3]:
train.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'train_question_embedding',
       'similarity', 'train_cot', 'masked_question', 'masked_query',
       'train_masked_question_embedding', 'masked_cot'],
      dtype='object')

In [4]:
type(train.train_masked_question_embedding.loc[1])

str

In [5]:
train['train_masked_question_embedding'] = train.train_masked_question_embedding.apply(eval).apply(np.array)

In [6]:
type(train.train_masked_question_embedding.loc[1])

numpy.ndarray

In [7]:
test = pd.read_csv('../data/QALD/9/data/qald-9-test-with-embeddings-cot.csv')
test.head()

Unnamed: 0,id,answertype,aggregation,onlydbo,hybrid,question_text,question_keywords,sparql_query,answer_head,answer_results,...,chatgpt_nomasked_train_cot_fewshot_query,chatgpt_nomasked_train_cot_fewshot_query_results,chatgpt_nomasked_train_only_fewshot_query,chatgpt_nomasked_train_only_fewshot_query_results,chatgpt_nomasked_train_only_3fewshot_query,chatgpt_nomasked_train_only_3fewshot_query_results,cot_noWordLimit,chatgpt_cot_noWordLimit_query,chatgpt_cot_noWordLimit_query_results,chatgpt_cot_query_results_2nd
0,99,resource,False,True,False,What is the time zone of Salt Lake City?,"Salt Lake City, time zone",PREFIX res: <http://dbpedia.org/resource/> PRE...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...",...,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",PREFIX dbo: <http://dbpedia.org/ontology/> ...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",- Set the prefixes for the resources and prope...,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['timezone']}, '...","{'head': {'link': [], 'vars': ['timezone']}, '..."
1,98,resource,False,False,False,Who killed Caesar?,"who , killed, Caesar",PREFIX dct: <http://purl.org/dc/terms/> PREFIX...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...",...,SELECT DISTINCT ?uri WHERE { ?uri a <http://db...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",1. We want to retrieve information about assas...,PREFIX dct: <http://purl.org/dc/terms/> PREFIX...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...","{'head': {'link': [], 'vars': ['s']}, 'results..."
2,86,resource,False,True,False,What is the highest mountain in Germany?,"highest, mountain, germany",PREFIX rdfs: <http://www.w3.org/2000/01/rdf-sc...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...",...,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...","1. Declare prefixes for RDF Schema, RDF Syntax...",PREFIX rdfs: <http://www.w3.org/2000/01/rdf-sc...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...","{'head': {'link': [], 'vars': ['mountain']}, '..."
3,84,resource,False,False,False,Which American presidents were in office durin...,"American presidents, office, Vietnam War",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...",...,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['president']}, ...",PREFIX dbpedia: <http://dbpedia.org/resource/>...,"{'head': {'link': [], 'vars': ['president']}, ...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,QueryBadFormed: A bad request has been sent to...,"1. Declare four prefixes: dbo, res, dct, dbc. ...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...","{'head': {'link': [], 'vars': ['president']}, ..."
4,81,resource,False,False,False,Butch Otter is the governor of which U.S. state?,"U.S. state, governor, Butch Otter",SELECT DISTINCT ?uri WHERE { ?uri a <http://db...,{'vars': ['uri']},"{'bindings': [{'uri': {'type': 'uri', 'value':...",...,PREFIX dbpedia: <http://dbpedia.org/resource/>...,"{'head': {'link': [], 'vars': ['state']}, 'res...",PREFIX dbpedia: <http://dbpedia.org/resource/>...,"{'head': {'link': [], 'vars': ['state_uri']}, ...",PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['uri']}, 'resul...",- I want to find the URIs of all the states of...,PREFIX dbo: <http://dbpedia.org/ontology/> PRE...,"{'head': {'link': [], 'vars': ['state']}, 'res...","{'head': {'link': [], 'vars': ['state']}, 'res..."


In [8]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [9]:
test['test_question_embedding'] = test.test_question_embedding.apply(eval).apply(np.array)

In [10]:
type(test.test_question_embedding.loc[1])

numpy.ndarray

In [11]:
train.shape, test.shape

((408, 20), (150, 46))

# Use ChatGPT to answer the test questions 
- Use ChatGPT to answer the questions directly
- Use ChatGPT to translate user questions to queries
- Use ChatGPT to translate user questions to queries by few-shot learning
- Use ChatGPT to translate user questions to queries by few-shot learning and chain of thought

## Use ChatGPT to answer the question directly

### Load the True Answers

In [12]:
with open('../data/QALD/9/data/qald-9-test-multilingual.json', 'r') as file:
    test_json = json.load(file)

In [13]:
test_json_df = pd.DataFrame(test_json['questions'])

In [14]:
answer_terms = []
count = 0
for idx, row in test_json_df.iterrows():
    try:
        bindings = row['answers'][0]['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        answer_terms.append(terms)
              
    except:
        answer_terms.append([str(row['answers'][0]['boolean']).lower()])
        count += 1
        print(row['answers'])

[{'head': {}, 'results': {}, 'boolean': True}]
[{'head': {}, 'results': {}, 'boolean': True}]
[{'head': {}, 'results': {}, 'boolean': True}]
[{'head': {}, 'results': {}, 'boolean': True}]


In [15]:
len(answer_terms)

150

### Evaluate the results of using ChatGPT to answer the questions directly on DBpedia-03132023

In [16]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [17]:
# extract the answer terms from chatgpt_answers_text_DBpedia_2023_03
chatgpt_answer_terms = []
for idx, row in test.iterrows():
    answers_text = row['chatgpt_answers_text_DBpedia_2023_03']
    terms = answers_text.replace('https://dbpedia.org/resource/', '').\
    replace('http://dbpedia.org/resource/', '').\
    replace('dbo:', '').strip().lower().split('\n')
    #terms = answers_text.strip().lower().split('\n')
    chatgpt_answer_terms.append([t.strip() for t in terms])

In [18]:
len(chatgpt_answer_terms)

150

In [19]:
import urllib.parse

chatgpt_answer_terms_parsed = []
for terms in chatgpt_answer_terms:
    terms_parsed = []
    for term in terms:
        parsed_string = urllib.parse.unquote(term)
        terms_parsed.append(parsed_string)
    chatgpt_answer_terms_parsed.append(terms_parsed)

In [20]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_answer_terms_parsed):
    gold_terms = answer_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    predicted_correct_idx = False
    for pterm in pred_terms:
        if len(pterm) > 0: # skip an empty string
            for gterm in gold_terms:
                #if pterm ==  gterm:
                pterm = pterm.replace("_", " ")
                gterm = gterm.replace("_", " ")
                if (pterm in gterm) or (gterm in pterm):
                    predicted_correct_idx = True
                    predicted_correct += 1
                    break # this pterm is a correct prediction, skip to next pterm
                          # don't double count this pterm anymore
                
    some_matched[idx] = predicted_correct_idx

In [21]:
pre_gold_lengths

[(0, 1, 1),
 (1, 4, 9),
 (2, 1, 1),
 (3, 3, 3),
 (4, 1, 1),
 (5, 1, 1),
 (6, 10, 3),
 (7, 1, 1),
 (8, 1, 1),
 (9, 1, 1),
 (10, 3, 1),
 (11, 19, 4),
 (12, 8, 26),
 (13, 49, 208),
 (14, 8, 1),
 (15, 1, 1),
 (16, 1, 1),
 (17, 19, 45),
 (18, 1, 2),
 (19, 3, 1),
 (20, 1, 1),
 (21, 5, 2),
 (22, 4, 1),
 (23, 1, 243),
 (24, 1, 1),
 (25, 1, 1),
 (26, 4, 1),
 (27, 1, 1),
 (28, 1, 1),
 (29, 1, 1),
 (30, 1, 1),
 (31, 1, 1),
 (32, 7, 10),
 (33, 1, 1),
 (34, 1, 1),
 (35, 50, 21),
 (36, 6, 2),
 (37, 1, 1),
 (38, 2, 1),
 (39, 1, 1),
 (40, 1, 1),
 (41, 1, 1),
 (42, 1, 1),
 (43, 55, 2),
 (44, 9, 11),
 (45, 1, 1),
 (46, 1, 1),
 (47, 1, 6),
 (48, 1, 1),
 (49, 1, 1),
 (50, 1, 1),
 (51, 1, 1),
 (52, 20, 915),
 (53, 1, 1),
 (54, 1, 1),
 (55, 1, 1),
 (56, 10, 19),
 (57, 20, 26),
 (58, 2, 1),
 (59, 1, 1),
 (60, 5, 1),
 (61, 1, 1),
 (62, 1, 1),
 (63, 1, 1),
 (64, 1, 2),
 (65, 19, 482),
 (66, 1, 1),
 (67, 1, 1),
 (68, 15, 1),
 (69, 10, 22),
 (70, 11, 15),
 (71, 1, 49),
 (72, 32, 7),
 (73, 9, 13),
 (74, 9, 26),
 

In [22]:
precision = predicted_correct / predicted
precision

0.4415862808145766

In [23]:
recall = predicted_correct/gold
recall

0.08968219416630388

In [24]:
f1 = 2 / (1/precision + 1/recall)
f1

0.14908630360050662

In [25]:
adj_precision = (predicted_correct-240) / (predicted-240)
adj_recall = (predicted_correct-240)/ (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.2481962481962482,
adj_recall:0.059722222222222225,
adj_f1:0.0962776378393507


(412, 933, 4594)

In [26]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_answer_terms_parsed):
    
    # the set of correct terms for this questions
    gold_terms = answer_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [27]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

88

In [28]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

63.100732828751674

In [29]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

50.0

In [30]:
# prediction, recall, and f1-score
88/150, 63.1/150, 50/150

(0.5866666666666667, 0.4206666666666667, 0.3333333333333333)

## Use ChatGPT to translate user questions to queries

In [31]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

### Evaluate the chatgpt_query_results_DBpedia_2023_03

In [32]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [33]:
# retrieve chatgpt query results
import ast

chatgpt_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['chatgpt_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_query_results_terms.append(terms)
    except SyntaxError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['chatgpt_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            chatgpt_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': False}


In [34]:
len(chatgpt_query_results_terms)

150

In [35]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
                             
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        #if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            #break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

In [36]:
precision = predicted_correct / predicted
precision

0.7110091743119266

In [37]:
recall = predicted_correct/gold
recall

0.1272577996715928

In [38]:
f1 = 2 / (1/precision + 1/recall)
f1

0.2158774373259053

In [39]:
adj_precision = (predicted_correct -240) / (predicted - 240)
adj_recall = (predicted_correct - 240) / (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.5434782608695652,
adj_recall:0.11597938144329897,
adj_f1:0.1911639762107052


(465, 654, 3654)

In [40]:
# Evaluate the precision and recall based on the total numbers of test questions
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [41]:
sum(predicted_question.values())

52

In [42]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

52

In [43]:
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

45.0

In [44]:
# precision, recall, f1-score
52/150, 52/150, 45/150

(0.3466666666666667, 0.3466666666666667, 0.3)

## Use ChatGPT to translate user questions to queries by few-shot learning

### Evaluate the chatgpt_train_1fewshot_query_results

In [45]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [46]:
test.shape

(150, 46)

In [47]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [48]:
# retrieve chatgpt train 1fewshot query results
import ast

chatgpt_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        #bindings = row['chatgpt_train_1fewshot_query_results']['results']['bindings']
        bindings = ast.literal_eval(row['chatgpt_train_1fewshot_query_results'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_query_results_terms.append(terms)
    except TypeError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
    except SyntaxError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_train_1fewshot_query_results'])
        #ex_ans = row['chatgpt_train_1fewshot_query_results']['boolean']
        ex_ans = ast.literal_eval(row['chatgpt_train_1fewshot_query_results'])['boolean']
        if ex_ans:
            chatgpt_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': False}


In [49]:
chatgpt_query_results_terms

[['mountain_time_zone'],
 [],
 ['zugspitze'],
 ['ERROR ERROR ERROR'],
 [],
 [],
 ['vesna_pisarović', 'kelly_kelekidou', 'cameron_cartio'],
 [],
 [],
 [],
 [],
 [],
 ['georgia_(country)',
  'western_australia',
  'jamaica',
  'japan',
  'northern_territory',
  'spain',
  'india',
  'venezuela',
  'derbyshire',
  'united_states',
  'turkey',
  'somerset',
  'france',
  'germany',
  'canada',
  'austria',
  'republic_of_ireland',
  'italy',
  'cumbria',
  'taiwan',
  'brazil',
  'north_yorkshire',
  'gozo',
  'philippines',
  'mexico',
  'vietnam',
  'united_kingdom',
  'serbia',
  'lancashire',
  'greece',
  'romania',
  'bosnia_and_herzegovina',
  'israel',
  'malta',
  'wales',
  'devon',
  'gibraltar',
  'british_columbia',
  'england',
  'abkhazia',
  'china',
  'slovakia',
  'slovenia',
  'south_africa',
  'russia',
  'australia',
  'portugal',
  'nepal'],
 ['ERROR ERROR ERROR'],
 [],
 ['angela dorothea kasner'],
 [],
 [],
 ['riven'],
 [],
 ['4'],
 [],
 [],
 ['ERROR ERROR ERROR'],
 

In [50]:
pd.set_option('display.max_colwidth', None)
test.iloc[0][['question_text', 'sparql_query', 'chatgpt_train_1fewshot_query', 'gold_query_results_DBpedia_2023_03', \
               'chatgpt_train_1fewshot_query_results']]

question_text                                                                                                                                                                                        What is the time zone of Salt Lake City?
sparql_query                                              PREFIX res: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/property/> SELECT DISTINCT ?uri WHERE { res:Salt_Lake_City <http://dbpedia.org/ontology/timeZone> ?uri }
chatgpt_train_1fewshot_query                                                                                     SELECT ?timezone WHERE {   <http://dbpedia.org/resource/Salt_Lake_City> <http://dbpedia.org/ontology/timeZone> ?timezone . }
gold_query_results_DBpedia_2023_03                {'head': {'link': [], 'vars': ['uri']}, 'results': {'distinct': False, 'ordered': True, 'bindings': [{'uri': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Mountain_Time_Zone'}}]}}
chatgpt_train_1fewshot_query_results    {'head':

In [51]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            #break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [52]:
pre_gold_lengths

[(0, 1, 1),
 (1, 0, 12),
 (2, 1, 1),
 (3, 1, 2),
 (4, 0, 0),
 (5, 0, 1),
 (6, 3, 4),
 (7, 0, 1),
 (8, 0, 1),
 (9, 0, 1),
 (10, 0, 1),
 (11, 0, 5),
 (12, 48, 48),
 (13, 1, 85),
 (14, 0, 1),
 (15, 1, 1),
 (16, 0, 0),
 (17, 0, 61),
 (18, 1, 1),
 (19, 0, 0),
 (20, 1, 1),
 (21, 0, 0),
 (22, 0, 0),
 (23, 1, 1714),
 (24, 1, 1),
 (25, 0, 1),
 (26, 0, 1),
 (27, 0, 1),
 (28, 0, 0),
 (29, 1, 1),
 (30, 1, 1),
 (31, 0, 0),
 (32, 0, 10),
 (33, 0, 0),
 (34, 0, 1),
 (35, 26, 26),
 (36, 0, 2),
 (37, 0, 1),
 (38, 0, 3),
 (39, 0, 0),
 (40, 1, 1),
 (41, 1, 1),
 (42, 1, 1),
 (43, 0, 0),
 (44, 26, 26),
 (45, 1, 1),
 (46, 0, 1),
 (47, 1, 0),
 (48, 1, 1),
 (49, 0, 0),
 (50, 0, 1),
 (51, 0, 1),
 (52, 0, 197),
 (53, 0, 1),
 (54, 1, 1),
 (55, 0, 1),
 (56, 0, 15),
 (57, 2, 2),
 (58, 0, 4),
 (59, 1, 0),
 (60, 0, 0),
 (61, 1, 1),
 (62, 0, 0),
 (63, 1, 1),
 (64, 0, 0),
 (65, 0, 164),
 (66, 1, 1),
 (67, 0, 1),
 (68, 1, 1),
 (69, 37, 36),
 (70, 0, 10),
 (71, 74, 74),
 (72, 8, 8),
 (73, 1, 20),
 (74, 43, 44),
 (75, 0, 

In [53]:
precision = predicted_correct / (predicted)
precision

0.836852207293666

In [54]:
test.iloc[23][['question_text', 'sparql_query', 'chatgpt_train_1fewshot_query']]

question_text                                                                                                                                                                                                                                                                                                                                                                                                                  Give me all Argentine films.
sparql_query                    PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX yago: <http://dbpedia.org/class/yago/> PREFIX res: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/property/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?uri WHERE { { ?uri rdf:type yago:ArgentineFilms } UNION { ?uri rdf:type dbo:Film { ?uri dbo:country res:Argentina } UNION { ?uri dbp:country "Argentina"@en } } }
chatgpt_train_1fewshot_query                                                                                    

In [55]:
recall = predicted_correct/gold
recall

0.11932129173508484

In [56]:
f1 = 2 / (1/precision + 1/recall)
f1

0.2088622754491018

In [57]:
adj_precision = predicted_correct / (predicted - 1)
adj_recall = predicted_correct/ (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.8384615384615385,
adj_recall:0.22474226804123712,
adj_f1:0.3544715447154471


(436, 521, 3654)

In [58]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [59]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

59

In [60]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

58.60163734115348

In [61]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

54.0

In [62]:
# prediction, recall, and f1-score
59/150, 58.6/150, 54/150

(0.3933333333333333, 0.39066666666666666, 0.36)

### Evaluate chatgpt_train_3fewshot_query_results

In [63]:
# retrieve chatgpt train 3fewshot query results
import ast

chatgpt_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        #bindings = row['chatgpt_train_1fewshot_query_results']['results']['bindings']
        bindings = ast.literal_eval(row['chatgpt_train_3fewshot_query_results'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_query_results_terms.append(terms)
    except TypeError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
    except SyntaxError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_train_3fewshot_query_results'])
        #ex_ans = row['chatgpt_train_1fewshot_query_results']['boolean']
        ex_ans = ast.literal_eval(row['chatgpt_train_3fewshot_query_results'])['boolean']
        if ex_ans:
            chatgpt_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': True}


In [64]:
query_results_terms[149]

['pinta_(ship)', 'niña', 'santa_maría_(ship)']

In [65]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            #break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [66]:
precision = predicted_correct / (predicted)
precision

0.96

In [67]:
recall = predicted_correct/gold
recall

0.17733990147783252

In [68]:
f1 = 2 / (1/precision + 1/recall)
f1

0.2993762993762994

In [69]:
adj_precision = (predicted_correct-240) / (predicted-240)
adj_recall = (predicted_correct-240)/ (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.9379310344827586,
adj_recall:0.21030927835051547,
adj_f1:0.34357894736842104


(648, 675, 3654)

In [70]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [71]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

70

In [72]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

69.08981592647325

In [73]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

63.0

In [74]:
# prediction, recall, and f1-score
70/150, 69/150, 63/150

(0.4666666666666667, 0.46, 0.42)

## Use ChatGPT to translate user questions to queries by few-shot learning WITH MASKED ENTITIEs

### Evaluate the chatgpt_train_cot_fewshot_query_results

In [75]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [76]:
test.shape

(150, 46)

In [77]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [78]:
# retrieve chatgpt train 3fewshot query results
import ast

chatgpt_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        #bindings = row['chatgpt_nomasked_train_cot_fewshot_query_results']['results']['bindings']
        bindings = ast.literal_eval(row['chatgpt_nomasked_train_cot_fewshot_query_results'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_query_results_terms.append(terms)
    except TypeError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
    except SyntaxError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_nomasked_train_cot_fewshot_query_results'])
        #ex_ans = row['chatgpt_nomasked_train_cot_fewshot_query_results']['boolean']
        ex_ans = ast.literal_eval(row['chatgpt_nomasked_train_cot_fewshot_query_results'])['boolean']
        if ex_ans:
            chatgpt_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': True}


In [79]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            #break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [80]:
precision = predicted_correct / (predicted)
precision

0.9479940564635958

In [81]:
adj_precision = (predicted_correct-240) / (predicted - 240)
adj_recall = (predicted_correct-240) / (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.9191685912240185,
adj_recall:0.20515463917525772,
adj_f1:0.3354403708386009


(638, 673, 3654)

In [82]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0
predicted_correct = 0
some_matched = {}
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    gold_terms = query_results_terms[idx]
    
    count += 1
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        if not predicted_correct_idx:
                            #if (pterm in gterm) or (gterm in pterm):
                            if pterm == gterm:
                                predicted_correct_idx = True
                                predicted_correct += 1
                        else:
                            pass
                
        some_matched[idx] = predicted_correct_idx
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_correct += 1
        some_matched[idx] = True

In [83]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

71

In [84]:
precision = predicted_correct / count
precision

0.47333333333333333

In [85]:
recall = predicted_correct/count
recall

0.47333333333333333

In [86]:
f1 = 2 / (1/precision + 1/recall)
f1

0.4733333333333334

In [87]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [88]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

72

In [89]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

69.80677295318637

In [90]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

66.0

In [91]:
# prediction, recall, and f1-score
72/150, 69/150, 66/150

(0.48, 0.46, 0.44)

### Evaluate the chatgpt_train_only_fewshot_query_results

In [92]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [93]:
test.shape

(150, 46)

In [94]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [95]:
# retrieve chatgpt train 3fewshot query results
import ast

chatgpt_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        #bindings = row['chatgpt_nomasked_train_only_fewshot_query_results']['results']['bindings']
        bindings = ast.literal_eval(row['chatgpt_nomasked_train_only_fewshot_query_results'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_query_results_terms.append(terms)
    except TypeError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
    except SyntaxError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_nomasked_train_only_fewshot_query_results'])
        #ex_ans = row['chatgpt_nomasked_train_only_fewshot_query_results']['boolean']
        ex_ans = ast.literal_eval(row['chatgpt_nomasked_train_only_fewshot_query_results'])['boolean']
        if ex_ans:
            chatgpt_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': True}


In [96]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            #break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [97]:
precision = predicted_correct / (predicted)
precision

0.8875675675675676

In [98]:
recall = predicted_correct/gold
recall

0.22468527640941435

In [99]:
f1 = 2 / (1/precision + 1/recall)
f1

0.358593579384145

In [100]:
adj_precision = (predicted_correct-240) / (predicted - 240)
adj_recall = (predicted_correct-240) / (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.8481751824817518,
adj_recall:0.2994845360824742,
adj_f1:0.44266666666666665


(821, 925, 3654)

In [101]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0
predicted_correct = 0
some_matched = {}
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    gold_terms = query_results_terms[idx]
    
    count += 1
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        if not predicted_correct_idx:
                            #if (pterm in gterm) or (gterm in pterm):
                            if pterm == gterm:
                                predicted_correct_idx = True
                                predicted_correct += 1
                        else:
                            pass
                
        some_matched[idx] = predicted_correct_idx
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_correct += 1
        some_matched[idx] = True

In [102]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

71

In [103]:
precision = predicted_correct / count
precision

0.47333333333333333

In [104]:
recall = predicted_correct/count
recall

0.47333333333333333

In [105]:
f1 = 2 / (1/precision + 1/recall)
f1

0.4733333333333334

In [106]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [107]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

71

In [108]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

66.65113128302858

In [109]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

61.0

In [110]:
# prediction, recall, and f1-score
71/150, 66/150, 61/150

(0.47333333333333333, 0.44, 0.4066666666666667)

### Evaluate the chatgpt_train_only_3fewshot_query_results

In [111]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [112]:
# retrieve chatgpt train 3fewshot query results
import ast

chatgpt_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        #bindings = row['chatgpt_nomasked_train_only_3fewshot_query_results']['results']['bindings']
        bindings = ast.literal_eval(row['chatgpt_nomasked_train_only_3fewshot_query_results'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_query_results_terms.append(terms)
    except TypeError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
    except SyntaxError:
        chatgpt_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_nomasked_train_only_3fewshot_query_results'])
        #ex_ans = row['chatgpt_nomasked_train_only_3fewshot_query_results']['boolean']
        ex_ans = ast.literal_eval(row['chatgpt_nomasked_train_only_3fewshot_query_results'])['boolean']
        if ex_ans:
            chatgpt_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}


In [113]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}
pre_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pre_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            #break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [114]:
precision = predicted_correct / (predicted)
precision

0.8111455108359134

In [115]:
recall = predicted_correct/gold
recall

0.21510673234811165

In [116]:
f1 = 2 / (1/precision + 1/recall)
f1

0.34003893575600264

In [117]:
adj_precision = (predicted_correct-240) / (predicted - 240)
adj_recall = (predicted_correct-240) / (gold - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.7489711934156379,
adj_recall:0.28144329896907216,
adj_f1:0.40914200074934437


(786, 969, 3654)

In [118]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0
predicted_correct = 0
some_matched = {}
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    gold_terms = query_results_terms[idx]
    
    count += 1
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        if not predicted_correct_idx:
                            #if (pterm in gterm) or (gterm in pterm):
                            if pterm == gterm:
                                predicted_correct_idx = True
                                predicted_correct += 1
                        else:
                            pass
                
        some_matched[idx] = predicted_correct_idx
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_correct += 1
        some_matched[idx] = True

In [119]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

75

In [120]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [121]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

75

In [122]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

72.19051806985833

In [123]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

66.0

In [124]:
# prediction, recall, and f1-score
75/150, 72/150, 66/150

(0.5, 0.48, 0.44)

## Explain test query in chain of thought in less 100 words

### Evaluate the chatgtp_cot_query_results

In [125]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [126]:
test.shape

(150, 46)

In [127]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [128]:
# retrieve gpt query results
import ast

chatgpt_cot_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['chatgpt_cot_query_results_2nd'])['results']['bindings']
        #bindings = row['chatgpt_cot_query_results_2nd']['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_cot_query_results_terms.append(terms)
    except SyntaxError:
        chatgpt_cot_query_results_terms.append(['ERROR ERROR ERROR'])
        
    except TypeError:
        chatgpt_cot_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_cot_query_results_2nd'])
        ex_ans = ast.literal_eval(row['chatgpt_cot_query_results_2nd'])['boolean']
        #ex_ans = row['chatgpt_cot_query_results_2nd']['boolean']
        if ex_ans:
            chatgpt_cot_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_cot_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': False}


In [129]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}

pred_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pred_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):

        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        #if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [130]:
pred_gold_lengths

[(0, 0, 1),
 (1, 12, 12),
 (2, 0, 1),
 (3, 0, 2),
 (4, 0, 0),
 (5, 1, 1),
 (6, 1, 4),
 (7, 1, 1),
 (8, 0, 1),
 (9, 1, 1),
 (10, 1, 1),
 (11, 5, 5),
 (12, 1, 48),
 (13, 74, 85),
 (14, 1, 1),
 (15, 1, 1),
 (16, 0, 0),
 (17, 1, 61),
 (18, 1, 1),
 (19, 1, 0),
 (20, 1, 1),
 (21, 4709, 0),
 (22, 0, 0),
 (23, 1, 1714),
 (24, 0, 1),
 (25, 0, 1),
 (26, 1, 1),
 (27, 1, 1),
 (28, 0, 0),
 (29, 1, 1),
 (30, 1, 1),
 (31, 0, 0),
 (32, 0, 10),
 (33, 0, 0),
 (34, 0, 1),
 (35, 1, 26),
 (36, 2, 2),
 (37, 1, 1),
 (38, 0, 3),
 (39, 0, 0),
 (40, 0, 1),
 (41, 1, 1),
 (42, 1, 1),
 (43, 0, 0),
 (44, 26, 26),
 (45, 1, 1),
 (46, 1, 1),
 (47, 0, 0),
 (48, 0, 1),
 (49, 0, 0),
 (50, 0, 1),
 (51, 1, 1),
 (52, 197, 197),
 (53, 0, 1),
 (54, 1, 1),
 (55, 1, 1),
 (56, 1, 15),
 (57, 0, 2),
 (58, 0, 4),
 (59, 0, 0),
 (60, 0, 0),
 (61, 1, 1),
 (62, 0, 0),
 (63, 1, 1),
 (64, 0, 0),
 (65, 386, 164),
 (66, 0, 1),
 (67, 0, 1),
 (68, 1, 1),
 (69, 36, 36),
 (70, 10, 10),
 (71, 74, 74),
 (72, 1, 8),
 (73, 0, 20),
 (74, 82, 44),
 

In [131]:
test.at[23, 'cot'] = '1. Declare prefixes for the namespaces used in the query. \
2. Select distinct URIs.  3. First, check if the URI is of type `yago:ArgentineFilms`. \
4. Union the second set of URIs that are of type `dbo:Film`. \
5. For the second set of URIs, check the `country` property of the film is `res:Argentina`. \
6. And union the `country` property of the film is `"Argentina"@en` (in English). \
7. Return the matching URIs.'

In [132]:
test.iloc[23]['cot']

'1. Declare prefixes for the namespaces used in the query. 2. Select distinct URIs.  3. First, check if the URI is of type `yago:ArgentineFilms`. 4. Union the second set of URIs that are of type `dbo:Film`. 5. For the second set of URIs, check the `country` property of the film is `res:Argentina`. 6. And union the `country` property of the film is `"Argentina"@en` (in English). 7. Return the matching URIs.'

In [133]:
predicted_correct, predicted, gold

(1294, 6321, 3654)

In [134]:
pres = 0
gols = 0
for _, pre, gol in pred_gold_lengths:
    pres += pre
    gols += gol
pres, gols

(6321, 3654)

In [135]:
precision = (predicted_correct) / (predicted)
precision

0.20471444391710172

In [136]:
recall = predicted_correct/gold
recall

0.3541324575807334

In [137]:
f1 = 2 / (1/precision + 1/recall)
f1

0.2594486215538847

In [138]:
adj_precision = predicted_correct / (predicted - 4709 -1)
adj_recall = predicted_correct/ (gold - 0 - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.803227808814401,
adj_recall:0.6670103092783505,
adj_f1:0.7288087862573923


(1294, 6321, 3654)

In [139]:
adj_precision = (predicted_correct + 1714) / (predicted - 4709 + 1714)
adj_recall = (predicted_correct + 1714)/ (gold - 0)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.9043896572459411,
adj_recall:0.8232074438970991,
adj_f1:0.8618911174785101


(1294, 6321, 3654)

In [140]:
#Stage No Noise: 
2/(1/0.8028 + 1/0.8040)

0.8033995519044063

In [141]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0
predicted_correct = 0
some_matched = {}
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    gold_terms = query_results_terms[idx]
    
    count += 1
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if not predicted_correct_idx:
                            if (pterm in gterm) or (gterm in pterm):
                                predicted_correct_idx = True
                                predicted_correct += 1
                        else:
                            pass
                
        some_matched[idx] = predicted_correct_idx
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_correct += 1
        some_matched[idx] = True

In [142]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

93

In [143]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [144]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

93

In [145]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

93.72550280550087

In [146]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

85.0

In [147]:
# prediction, recall, and f1-score
93/150, 93.75/150, 85/150

(0.62, 0.625, 0.5666666666666667)

## Explain test query in chain of thought without word limit

In [148]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

### Evaluate the chatgpt_cot_noWordLimit_query_results

In [149]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [150]:
test.iloc[2].chatgpt_cot_noWordLimit_query_results

"{'head': {'link': [], 'vars': ['uri']}, 'results': {'distinct': False, 'ordered': True, 'bindings': [{'uri': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Zugspitze'}}]}}"

In [151]:
# retrieve gpt query results
import ast

chatgpt_cot_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['chatgpt_cot_noWordLimit_query_results'])['results']['bindings']
        #bindings = row['chatgpt_cot_noWordLimit_query_results']['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_cot_query_results_terms.append(terms)
    except SyntaxError:
        chatgpt_cot_query_results_terms.append(['ERROR ERROR ERROR'])
        
    except TypeError:
        chatgpt_cot_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_cot_noWordLimit_query_results'])
        ex_ans = ast.literal_eval(row['chatgpt_cot_noWordLimit_query_results'])['boolean']
        #ex_ans = row['chatgpt_cot_noWordLimit_query_results']['boolean']
        if ex_ans:
            chatgpt_cot_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_cot_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [152]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}

pred_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pred_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):

        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        #if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [153]:
test.iloc[12].question_text

'Which countries have places with more than two caves?'

In [154]:
test.iloc[12].sparql_query

'PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?uri WHERE { ?cave rdf:type dbo:Cave ; dbo:location ?uri . ?uri rdf:type dbo:Country } GROUP BY ?uri HAVING ( COUNT(?cave) > 2 )'

In [155]:
test.iloc[12].chatgpt_cot_noWordLimit_query

'PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX dbr: <http://dbpedia.org/resource/>  SELECT DISTINCT ?countryURI WHERE {    ?cave dbo:type dbo:Cave .    ?cave dbo:location ?place .    ?place dbo:country ?countryURI .    ?countryURI dbo:type dbo:Country . }  GROUP BY ?countryURI  HAVING (COUNT(?cave) > 2)'

In [156]:
test.iloc[12].masked_question

'Which countries have places with more than two [MASK1]?'

In [157]:
test.iloc[12].chatgpt_train_cot_fewshot_query

'PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?uri WHERE { ?cave rdf:type dbo:Cave ; dbo:location ?uri . ?uri rdf:type dbo:Country } GROUP BY ?uri HAVING ( COUNT(?cave) > 2 )'

In [158]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [159]:
pred_gold_lengths

[(0, 1, 1),
 (1, 0, 12),
 (2, 1, 1),
 (3, 2, 2),
 (4, 0, 0),
 (5, 1, 1),
 (6, 0, 4),
 (7, 1, 1),
 (8, 1, 1),
 (9, 1, 1),
 (10, 1, 1),
 (11, 5, 5),
 (12, 0, 48),
 (13, 82, 85),
 (14, 1, 1),
 (15, 1, 1),
 (16, 0, 0),
 (17, 19, 61),
 (18, 0, 1),
 (19, 0, 0),
 (20, 1, 1),
 (21, 0, 0),
 (22, 0, 0),
 (23, 240, 1714),
 (24, 0, 1),
 (25, 0, 1),
 (26, 0, 1),
 (27, 1, 1),
 (28, 0, 0),
 (29, 1, 1),
 (30, 1, 1),
 (31, 0, 0),
 (32, 10, 10),
 (33, 0, 0),
 (34, 1, 1),
 (35, 26, 26),
 (36, 0, 2),
 (37, 1, 1),
 (38, 0, 3),
 (39, 0, 0),
 (40, 1, 1),
 (41, 1, 1),
 (42, 1, 1),
 (43, 1, 0),
 (44, 2, 26),
 (45, 1, 1),
 (46, 1, 1),
 (47, 0, 0),
 (48, 1, 1),
 (49, 0, 0),
 (50, 0, 1),
 (51, 0, 1),
 (52, 197, 197),
 (53, 2, 1),
 (54, 1, 1),
 (55, 1, 1),
 (56, 15, 15),
 (57, 0, 2),
 (58, 4, 4),
 (59, 0, 0),
 (60, 0, 0),
 (61, 1, 1),
 (62, 0, 0),
 (63, 1, 1),
 (64, 0, 0),
 (65, 4, 164),
 (66, 1, 1),
 (67, 0, 1),
 (68, 0, 1),
 (69, 0, 36),
 (70, 37, 10),
 (71, 74, 74),
 (72, 1, 8),
 (73, 0, 20),
 (74, 10000, 44),


In [160]:
pres = 0
gols = 0
for _, pre, gol in pred_gold_lengths:
    pres += pre
    gols += gol
pres, gols

(11452, 3654)

In [161]:
precision = (predicted_correct) / (predicted)
precision

0.12198742577715684

In [162]:
recall = predicted_correct/gold
recall

0.3823207443897099

In [163]:
f1 = 2 / (1/precision + 1/recall)
f1

0.18495961869455843

In [164]:
adj_precision = (predicted_correct - 240) / (predicted - 10000 -1)
adj_recall = (predicted_correct - 240) / (gold - 44 - 1714)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))
predicted_correct, predicted, gold

adj_precision:0.7973811164713991,
adj_recall:0.6102320675105485,
adj_f1:0.6913654018524051


(1397, 11452, 3654)

In [165]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0
predicted_question = {}
some_matched = {}
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    gold_terms = query_results_terms[idx]
    
    count += 1
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        predicted_correct = 0
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        #if not predicted_correct_idx:
                        if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1

                
        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)
        
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1

In [166]:
for k in some_matched:
    if some_matched[k] == 0:
        print(k)

45
61
66
72
92
100
104
112
123
127
145


In [167]:
sum(predicted_question.values())

109

In [168]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

109

In [169]:
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += 1
total

99

In [170]:
109 / 150, 99/150 

(0.7266666666666667, 0.66)

In [171]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [172]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

109

In [173]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

103.66968152023185

In [174]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

99.0

In [175]:
# prediction, recall, and f1-score
109/150, 103.67/150, 99/150

(0.7266666666666667, 0.6911333333333334, 0.66)

## Explain train query in chain of thought and few-shot learning

### Evaluate the chatgpt_train_cot_query_results: undone yet

In [176]:
test.columns

Index(['id', 'answertype', 'aggregation', 'onlydbo', 'hybrid', 'question_text',
       'question_keywords', 'sparql_query', 'answer_head', 'answer_results',
       'question', 'query', 'answers', 'gpt_answers_text_DBpedia-2016-04',
       'chatgpt_answers_text_DBpedia_2016_04',
       'gold_query_results_DBpedia_2023_03',
       'chatgpt_answers_text_DBpedia_2023_03', 'gpt_query_DBpedia_2023_03',
       'gpt_query_results_DBpedia_2023_03', 'chatgpt_query_DBpedia_2023_03',
       'chatgpt_query_results_DBpedia_2023_03', 'test_question_embedding',
       'chatgpt_train_3fewshot_query', 'chatgpt_train_3fewshot_query_results',
       'chatgpt_train_1fewshot_query', 'chatgpt_train_1fewshot_query_results',
       'gpt_fewshot_query', 'gpt_fewshot_query_results', 'cot',
       'chatgpt_cot_query', 'chatgpt_cot_query_results', 'masked_question',
       'masked_query', 'masked_cot', 'chatgpt_train_cot_fewshot_query',
       'chatgpt_train_cot_fewshot_query_results',
       'chatgpt_nomasked_tra

In [177]:
test.shape

(150, 46)

In [178]:
# retrieve query results
import ast

query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        query_results_terms.append(terms)
              
    except:
        print(row['gold_query_results_DBpedia_2023_03'])
        ex_ans = ast.literal_eval(row['gold_query_results_DBpedia_2023_03'])['boolean']
        if ex_ans:
            query_results_terms.append([str(ex_ans).lower()])
        else:
            query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [179]:
# retrieve gpt query results
import ast

chatgpt_cot_query_results_terms = []
count = 0
for idx, row in test.iterrows():
    try:
        bindings = ast.literal_eval(row['chatgpt_cot_query_results'])['results']['bindings']
        #bindings = row['chatgpt_cot_query_results']['results']['bindings']

        answer_list = []
        for item in bindings:
            for k in item:
                answer_list.append(item[k]['value'])

        terms = []
        for ans in answer_list:
            terms.append(ans.replace('http://dbpedia.org/resource/', '').replace('dbo:', '').strip().lower())
        #if terms not in answer_terms:
        chatgpt_cot_query_results_terms.append(terms)
    except SyntaxError:
        chatgpt_cot_query_results_terms.append(['ERROR ERROR ERROR'])
        
    except TypeError:
        chatgpt_cot_query_results_terms.append(['ERROR ERROR ERROR'])
              
    except:
        print(row['chatgpt_cot_query_results'])
        ex_ans = ast.literal_eval(row['chatgpt_cot_query_results'])['boolean']
        #ex_ans = row['chatgpt_cot_query_results']['boolean']
        if ex_ans:
            chatgpt_cot_query_results_terms.append([str(ex_ans).lower()])
        else:
            chatgpt_cot_query_results_terms.append([])
        count += 1

{'head': {'link': []}, 'boolean': False}
{'head': {'link': []}, 'boolean': True}
{'head': {'link': []}, 'boolean': True}


In [180]:
chatgpt_cot_query_results_terms

[['mountain_time_zone'],
 ['decimus_junius_brutus_albinus',
  'gaius_cassius_longinus',
  'marcus_junius_brutus',
  'pontius_aquila',
  'pacuvius_labeo',
  'quintus_ligarius',
  'publius_servilius_casca',
  'gaius_cassius_parmensis',
  'gaius_trebonius',
  'lucius_minucius_basilus',
  'marcus_porcius_cato_(son_of_cato_the_younger)',
  'tillius_cimber'],
 [],
 [],
 [],
 ['0'],
 ['vesna_pisarović', 'gizem_saka', 'kelly_kelekidou', 'cameron_cartio'],
 [],
 [],
 ['50035'],
 ['miles_&_more'],
 ['andorra', 'denmark', 'belgium', 'united_kingdom', 'sweden'],
 ['abkhazia',
  'australia',
  'austria',
  'bosnia_and_herzegovina',
  'brazil',
  'british_columbia',
  'canada',
  'china',
  'cumbria',
  'derbyshire',
  'devon',
  'england',
  'france',
  'georgia_(country)',
  'germany',
  'gibraltar',
  'gozo',
  'greece',
  'india',
  'israel',
  'italy',
  'jamaica',
  'japan',
  'lancashire',
  'malta',
  'mexico',
  'nepal',
  'north_yorkshire',
  'northern_territory',
  'philippines',
  'portu

In [181]:
# Evaluate the precision and recall based on the total numbers of 
# gold answers and predicted answers
predicted = 0
gold = 0
predicted_correct = 0
some_matched = {}

pred_gold_lengths = []
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    
    gold_terms = query_results_terms[idx]
    
    predicted +=  len(pred_terms)
    gold += len(gold_terms)
    
    pred_gold_lengths.append((idx, len(pred_terms), len(gold_terms)))
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):

        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        if pterm ==  gterm:
                        #pterm = pterm.replace("_", " ")
                        #gterm = gterm.replace("_", " ")
                        #if (pterm in gterm) or (gterm in pterm):
                            predicted_correct_idx = True
                            predicted_correct += 1
                            break # this pterm is a correct prediction, skip to next pterm
                                    # don't double count this pterm anymore

        some_matched[idx] = predicted_correct_idx

In [182]:
predicted_correct, predicted, gold

(1307, 21352, 3654)

In [183]:
pres = 0
gols = 0
for _, pre, gol in pred_gold_lengths:
    pres += pre
    gols += gol
pres, gols

(21352, 3654)

In [184]:
precision = (predicted_correct) / (predicted)
precision

0.06121206444361184

In [185]:
recall = predicted_correct/gold
recall

0.3576902025177887

In [186]:
f1 = 2 / (1/precision + 1/recall)
f1

0.1045349116212109

In [187]:
adj_precision = predicted_correct / (predicted - 10000 -1 - 10000)
adj_recall = predicted_correct/ (gold - 0 - 1714 - 28)
adj_f1 = 2 / (1/adj_precision +  1/adj_recall)
print('adj_precision:{},\nadj_recall:{},\nadj_f1:{}'.format(adj_precision, adj_recall, adj_f1))

adj_precision:0.9674315321983715,
adj_recall:0.6835774058577406,
adj_f1:0.8011032791909286


In [188]:
#Stage No Noise: 
2/(1/0.8028 + 1/0.8040)

0.8033995519044063

In [189]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0
predicted_correct = 0
some_matched = {}
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    gold_terms = query_results_terms[idx]
    
    count += 1
    
    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct_idx = False
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        if not predicted_correct_idx:
                            if (pterm in gterm) or (gterm in pterm):
                                predicted_correct_idx = True
                                predicted_correct += 1
                        else:
                            pass
                
        some_matched[idx] = predicted_correct_idx
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_correct += 1
        some_matched[idx] = True

In [190]:
total = 0
for k in some_matched:
    if some_matched[k]:
        total += 1
total

102

In [191]:
precision = predicted_correct / count
precision

0.68

In [192]:
recall = predicted_correct/count
recall

0.68

In [193]:
f1 = 2 / (1/precision + 1/recall)
f1

0.68

In [194]:
# Evaluate the precision and recall based on the total numbers of test questions
count = 0 # total num of questions

predicted_question = {} # dictionary from question to whether it was correctly predicted
some_matched = {} # dictionary from question to how much is was correctly predicted 

# iterate each question to retrieve the set of predicted terms for a question
for idx, pred_terms in enumerate(chatgpt_cot_query_results_terms):
    
    # the set of correct terms for this questions
    gold_terms = query_results_terms[idx]
    
    # increment the number of questions
    count += 1
    
    # flag: assuming this question has not predicted yet
    predicted_correct_idx = False

    if (len(pred_terms) > 0) and (len(gold_terms) > 0):
        predicted_correct = 0
        predicted_correct_idx = False
        # iterate through the set of predicted terms for this question
        for pterm in pred_terms:
            if len(pterm) > 0: # skip an empty string
                for gterm in gold_terms:
                    if len(gterm) > 0:
                        #if pterm ==  gterm:
                        # normalize both predicted and correct terms for comparison
                        pterm = pterm.replace("_", " ")
                        gterm = gterm.replace("_", " ")
                        # if the predicted term is correct, mark the question is predicted correctly
                        # and count the number of correctly predicted terms
                        if (pterm in gterm) or (gterm in pterm):
                            #if pterm == gterm:
                            predicted_correct_idx = True
                            predicted_correct += 1

        predicted_question[idx] = predicted_correct_idx
        some_matched[idx] = predicted_correct / len(gold_terms)

    # if both predicted and correct results are empty
    elif (len(pred_terms) == 0) and (len(gold_terms) == 0):
        predicted_question[idx] = True
        some_matched[idx] = 1
    # incorrectly predicted if one of them is empty
    else:
        predicted_question[idx] = False
        some_matched[idx] = 0

In [195]:
# total predicted questions if one of predicted result is correct
total_predicted_questions = sum(list(predicted_question.values()))
total_predicted_questions

102

In [196]:
# total predicted questions by adding up all predicted percentage
total_matched = 0
for k in some_matched:
    if some_matched[k]:
        total_matched += some_matched[k]
total_matched

99.3144953470124

In [197]:
# total predicted questions if the set of predicted answers equals to the set of correct answers
total = 0
for k in some_matched:
    if some_matched[k] == 1:
        total += some_matched[k]
total

90.0

In [198]:
# prediction, recall, and f1-score
102/150, 99/150, 90/150

(0.68, 0.66, 0.6)