In [1]:
import os
import json
import argparse
from tqdm import tqdm
from datetime import datetime
import pandas as pd
import requests

In [3]:
parser = argparse.ArgumentParser()
parser.add_argument("--option", default="cot", type=str)
parser.add_argument("--model", default="llama-2-7b-chat", type=str, help="qwen1.5-14b-chat and qwen-turbo are better")
parser.add_argument("--start", default=0, type=int)
parser.add_argument("--end", default=100, type=int)
parser.add_argument(
    "--temperature",
    type=float,
    default=0.5,
    help="temperature of 0 implies greedy sampling.",
)
parser.add_argument(
    "--traced_json_file",
    default=r"released_data\train.traced.json",
    type=str,
)
parser.add_argument(
    "--tables_json_file",
    default=r"data\traindev_tables.json",
    type=str,
)
parser.add_argument(
    "--topk_path",
    default=r"data\traindev_request_tok",
    
    type=str,
)

args = parser.parse_args("")

In [4]:
demonstration = {}
demonstration["none"] = ""
demonstration[
    "direct"
] = """
Read the table below regarding "geographic and economic characteristics of five diverse nations" to answer the following questions.

Country | Land Area | Population Density | Exports | Arable Land 
Singapore | City-state | Extremely high | Machinery | Minimal
Australia | Vast | Very low | Ores | Grazing land
Japan | Limited | High | Vehicles | Limited arable land
Nigeria | Large | Moderate | Petroleum | Significant arable areas
Mexico | Expansive | Low | Manufactured goods | Croplands

This is the introduction of the table:
This table provides an overview of key geographic and economic characteristics across five nations. 
Land area and population density descriptions offer insights into spatial distributions, 
while the primary exports and agricultural land use columns shed light on economic drivers. 

I believe the following information will help you to answer the question:
Singapore's compact size belies its economic might, punching above its weight as a trade hub. 
In contrast, Australia's vast expanses provide room for agriculture and mining, though its low population density could strain growth. 
Japan overcomes territorial constraints through innovation and a focus on high-value exports. 
For developing economies like Nigeria and Mexico, balanced policies promoting industrialization and agricultural self-sufficiency are crucial.
Effective governance aligns a nation's policies with its unique geographic and demographic realities.

read the question first, and then answer the given question. 

Question: What principle should guide policymaking for nations?
Answer: Aligning policies with geographic/demographic realities

Question: What phrase best summarizes Singapore's economic role given its "city-state" status and "extremely high" population density?
Answer:Trade hub

Question: Name the country whose primary exports are listed as "petroleum".
Answer:Nigeria

Question: Despite its "limited" land area in the table, what strategic advantage is highlighted for Japan?
Answer:Innovation
"""

demonstration[
    "cot"
] = """
This is a demonstration:

Read the table below regarding the "2006 League of Ireland Premier Division". In order to get the answer to the question, you need to combine information from both the table and the text.

Team | Manager | Main sponsor | Kit supplier | Stadium | Capacity
Bohemians | Gareth Farrelly | Des Kelly Carpets | O'Neills | Dalymount Park | 8,500
Bray Wanderers | Eddie Gormley | Slevin Group | Adidas | Carlisle Grounds | 7,000
Cork City | Damien Richardson | Nissan | O'Neills | Turners Cross | 8,000
Derry City | Stephen Kenny | MeteorElectrical.com | Umbro | The Brandywell | 7,700
Drogheda United | Paul Doolin | Murphy Environmental | Jako | United Park | 5,400
Dublin City | Dermot Keely | Carroll 's Irish Gift Stores | Umbro | Dalymount Park | 8,500
Longford Town | Alan Mathews | Flancare | Umbro | Flancare Park | 4,500
Shelbourne | Pat Fenlon | JW Hire | Umbro | Tolka Park | 10,100
Sligo Rovers | Sean Connor | Toher 's | Jako | The Showgrounds | 5,500
St Patrick 's Athletic | John McDonnell | Smart Telecom | Umbro | Richmond Park | 5,500
UCD | Pete Mahon | Budweiser | O'Neills | Belfield Park | 1,900
Waterford United | Gareth Cronin | ThermoFrame | Diadora | Waterford Regional Sports Centre | 8,000

This is the introduction to the table:
The 2006 League of Ireland Premier Division was the 22nd season of the League of Ireland Premier Division. The division was made up of 12 teams. Shelbourne were champions while Derry City finished as runners-up. However Shelbourne were subsequently relegated to the First Division and had to withdraw from the 2007-08 UEFA Champions League and 2007 Setanta Sports Cup because of their financial difficulties.

I believe the following text information will help answer the question:
The Carlisle Grounds is a football stadium in Bray , County Wicklow , Ireland . Situated directly behind the Bray D.A.R.T . station , it is home to Bray Wanderers A.F.C . Its current capacity is roughly 4,000 .

Question: The home stadium of the Bray Wanderers of 2006 League of Ireland is situated behind what station ?
Answer: Bray D.A.R.T station
The resoning process of this question: 
Let's think step by step, from the first row of the table, we can see that Bray Wanderers participated in the 2006 League of Ireland Premier Division. Their stadium is listed as "Carlisle Grounds".The additional text information mentions that the Carlisle Grounds is situated behind a station in Bray, County Wicklow, Ireland.Putting both pieces of information together, we can conclude that the home stadium of Bray Wanderers, the Carlisle Grounds, is situated behind a station in Bray, County Wicklow. The text specifically mentions the Bray D.A.R.T station.Therefore, the answer to the question is "Bray D.A.R.T station".

"""

def read_data(args):
    data_train_traced = json.load(open(args.traced_json_file, "r"))
    traindev_table = json.load(open(args.tables_json_file, "r"))

    data_list = []
    for sample in tqdm(data_train_traced[args.start:args.end]):
        table_id = sample["table_id"]
        try:
            topk = json.load(open(os.path.join(args.topk_path, f"{table_id}.json"), "r"))
        except Exception:
            print(f"The file {os.path.join(args.topk_path, f'{table_id}.json')} does not exist.")
            continue
        question_text = sample["question"]
        answer_text = sample["answer-text"]
        wikis = [
            node[2]
            for node in sample["answer-node"]
            if node[2] is not None and node[2].startswith("/wiki")
        ]
        if len(wikis) == 0:
            wiki_text = ""
        else:
            wiki_text = "\n".join([topk[wiki] for wiki in wikis])
        df = pd.DataFrame(
            [tuple(zip(*row))[0] for row in traindev_table[table_id]["data"]],
            columns=list(zip(*traindev_table[table_id]["header"]))[0],
        )
        data_list.append(
            {
                "question": question_text,
                "answer": answer_text,
                "title": traindev_table[table_id]["title"],
                "table": df,
                "wiki": wiki_text,
                "table_id": table_id,
                "intro": traindev_table[table_id]["intro"]
            }
        )
    return data_list


def df_format(data):
    try:
        formatted_str = " | ".join(data.columns) + "\n"
        for _, row in data.iterrows():
            row_str = " | ".join([str(row[col]) for col in data.columns])
            formatted_str += row_str + "\n"
        return formatted_str
    except:
        print(f"wrong table: {csv_path}")
        return ""




In [12]:
now = datetime.now()
dt_string = now.strftime("%d_%H_%M")
fw = open(f"outputs/response_s{args.start}_e{args.end}_{args.option}_{args.model}_{dt_string}.json", "w",)
tmp = {"demonstration": demonstration[args.option]}
fw.write(json.dumps(tmp) + "\n")

2833

In [13]:
data_list = read_data(args)

100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1727.75it/s]

The file data\traindev_request_tok\List_of_Important_Cultural_Properties_of_Japan_(Okinawa:_structures)_3.json does not exist.





In [14]:
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-13b-chat-hf"
headers = {"Authorization": "Bearer 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"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

In [15]:
for entry in tqdm(data_list):
    question = entry['question']
    answer = entry['answer']

    #### Formalizing the k-shot demonstration. #####
    prompt = demonstration[args.option] + '\n\n'
    prompt += f'Read the table and text regarding "{entry["title"]}" to answer the following question.\n\n'
    prompt += f"The table contains important information and this is the introduction of the table:" + '\n' + entry['intro'] + '\n\n'
    prompt += df_format(entry['table']) + '\n'
    
    if entry['wiki']:
        prompt += "I believe the following text information will help answer the question:" + '\n' + entry['wiki'] + '\n\n'
        prompt += "It's better to use words from the text and table as answer to achieve better correct rate. Please think step by step." + '\n\n'
    prompt += 'Question: ' + question + '\nAnswer:'

    response_raw = query({'inputs': prompt})
    try:
        response = response_raw[0].get('generated_text', '').split('\nAnswer:')[2].split('Reasoning process')[0].strip()
    except KeyError:
        response = ''

    response = response.split('\n')[0].strip()

    tmp = {
        "question": question,
        "response": response,
        "answer": answer,
        "table_id": entry["table_id"],
    }

    fw.write(json.dumps(tmp) + "\n")

fw.close()


100%|██████████████████████████████████████████████████████████████████████████████████| 99/99 [05:55<00:00,  3.59s/it]


In [11]:
print(prompt)


This is a demonstration:

Read the table below regarding the "2006 League of Ireland Premier Division". In order to get the answer to the question, you need to combine information from both the table and the text.

Team | Manager | Main sponsor | Kit supplier | Stadium | Capacity
Bohemians | Gareth Farrelly | Des Kelly Carpets | O'Neills | Dalymount Park | 8,500
Bray Wanderers | Eddie Gormley | Slevin Group | Adidas | Carlisle Grounds | 7,000
Cork City | Damien Richardson | Nissan | O'Neills | Turners Cross | 8,000
Derry City | Stephen Kenny | MeteorElectrical.com | Umbro | The Brandywell | 7,700
Drogheda United | Paul Doolin | Murphy Environmental | Jako | United Park | 5,400
Dublin City | Dermot Keely | Carroll 's Irish Gift Stores | Umbro | Dalymount Park | 8,500
Longford Town | Alan Mathews | Flancare | Umbro | Flancare Park | 4,500
Shelbourne | Pat Fenlon | JW Hire | Umbro | Tolka Park | 10,100
Sligo Rovers | Sean Connor | Toher 's | Jako | The Showgrounds | 5,500
St Patrick 's A

In [52]:
response_raw[0]['generated_text'].split('\nAnswer:')[2].split('\n')[0].strip()

'14 December 1979.'

In [44]:
print(response_raw[0]['generated_text'])


This is a demonstration:

Read the table blow regarding "2006 League of Ireland Premier Division". In order to get the answer of the question, you neeed to combine both table and text information.

Team | Manager | Main sponsor | Kit supplier | Stadium | Capacity
Bohemians | Gareth Farrelly | Des Kelly Carpets | O'Neills | Dalymount Park | 8,500
Bray Wanderers | Eddie Gormley | Slevin Group | Adidas | Carlisle Grounds | 7,000
Cork City | Damien Richardson | Nissan | O'Neills | Turners Cross | 8,000
Derry City | Stephen Kenny | MeteorElectrical.com | Umbro | The Brandywell | 7,700
Drogheda United | Paul Doolin | Murphy Environmental | Jako | United Park | 5,400
Dublin City | Dermot Keely | Carroll 's Irish Gift Stores | Umbro | Dalymount Park | 8,500
Longford Town | Alan Mathews | Flancare | Umbro | Flancare Park | 4,500
Shelbourne | Pat Fenlon | JW Hire | Umbro | Tolka Park | 10,100
Sligo Rovers | Sean Connor | Toher 's | Jako | The Showgrounds | 5,500
St Patrick 's Athletic | John Mc