# ChatGPT experiments


### Task

We would like to see what you can build with the data provided. Feel free to present the results in any format you prefer and explore any additional ideas you have with the dataset. You may use any model or architecture of your choice. The goal is to demonstrate your knowledge and experience. We are particularly interested in the logic and reasoning behind your choice of accuracy metrics and your ability to communicate your solutions and ideas effectively.
 
We’d like you to please demonstrate a LLM driven prototype that can answer questions based on financial documents (texts, tables, figures etc.). The dataset we'd like you to use is the train.json in the ConvFinQA dataset.
 
Here is a snippet of the json which contains the question, as well as the correct answer that your solution should aim to produce:
 
qa": {             
                       "question": "what was the percentage change in the net cash from operating activities from 2008 to 2009”, 
            "answer": "14.1%",             }

Please produce a report on metrics with the accuracies of your system, and a short write-up of your findings and shortcomings.  
 

In [113]:

import os
from openai import OpenAI
from openai import OpenAIError
import json
import pandas as pd
from langchain_core.prompts import PromptTemplate
from pprint import pprint
from pathlib import Path
from utils import *

# instructor and pydantic
import instructor
from pydantic import BaseModel, Field
from typing import Iterable, Literal
from typing import List, Optional, Union
from graphviz import Digraph
from dotenv import load_dotenv
import enum

# Load the environment variables
from dotenv import load_dotenv
load_dotenv()


True

In [114]:
def get_eval(response_text:str, exe_ans_list: list)->dict:
    """
    score this row for evaluation

    """
    print(response_text)
    print(exe_ans_list)



    scores = {}    

    exe_accuracy = 0
    scores['exe_accuracy_score'] = exe_accuracy


    program_accuracy = 0
    scores['program_accuracy_score'] = program_accuracy


    partial_accuracy = 0
    scores['partial_accuracy_score'] = partial_accuracy

    retrieval_accuracy = 0
    scores['retrieval_accuracy_score'] = retrieval_accuracy

    return scores


# def evaluate_results(results:list)->dict:  # todo
#     """
#     Evaluate the results of the model

#     """
#     scores = {}
#     for row in results:
#         row_scores = get_eval(row)
#         for k, v in row_scores.items():
#             scores[k] = scores.get(k, []) + [v]
#     return scores

In [122]:
# prompts etc

system_prompt = """ I am a highly intelligent bot. I can have conversations with the user to
answer a series of questions. Later questions may depend on previous questions to answer. You need
to provide me with the series of questions as the context and I will answer the last question.
"""

prompt_template = PromptTemplate.from_template(
    "Answer the following questions using the context provided:\ncontext: {context} \nquestions:\n{questions} \nanswer: {answer}"
)


In [116]:
client = OpenAI(
  organization='org-4KKv9l0WIKIGW7Je0CSuQgIW',
  project='proj_R48KNHQ3L710dyHpLG6aeIR4',
)

In [118]:
# load data set
data = json.load(open('../data/ConvFinQA/data/train.json'))
df = pd.DataFrame(data)

usecols = ['q', 'q_type', 'q_difficulty', 'q_chain', 'a_chain', 'answer_0', 'exe_ans_0', 'answer_1', 
           'exe_ans_1','pre_text', 'post_text',  'table_ori', 'table', 'annotation']

df['q_type'] = df['id'].map(lambda x: 'Type I' if 'Single' in x  else 'Type II')  # get question type
df['q_difficulty'] = df.annotation.map(get_difficulty)                            # get question difficulty
df['q'] = df.apply(get_q, axis=1)                                                 # get top level question
df['q_chain'] = df.apply(lambda x: x['annotation']['dialogue_break'], axis=1)     # get question chain
df['a_chain'] = df.apply(lambda x: x['annotation']['exe_ans_list'], axis=1)       # get answer chain

df['answers'] = df.apply(get_a, axis=1)                                               # get answer
df['answer_0'] = df.answers.map(lambda x: x.get('answer_0').strip())     # answer
df['exe_ans_0'] = df.answers.map(lambda x: x.get('exe_ans_0'))     # execution answer
df['answer_1'] = df.answers.map(lambda x: x.get('answer_1', '').strip())     # answer
df['exe_ans_1'] = df.answers.map(lambda x: x.get('exe_ans_1'))     # execution answer

df[usecols].head()

Unnamed: 0,q,q_type,q_difficulty,q_chain,a_chain,answer_0,exe_ans_0,answer_1,exe_ans_1,pre_text,post_text,table_ori,table,annotation
0,{'question_0': 'what was the percentage change...,Type I,{0: 4},[what is the net cash from operating activitie...,"[206588.0, 181001.0, 25587.0, 0.14136]",14.1%,0.14136,,,"[26 | 2009 annual report in fiscal 2008 , reve...","[year ended june 30 , cash provided by operati...","[[, Year ended June 30, 2009], [2008, 2007], [...","[[2008, year ended june 30 2009 2008, year end...",{'amt_table': '<table class='wikitable'><tr><t...
1,{'question_0': 'what was the percent of the gr...,Type I,{0: 4},"[what were revenues in 2008?, what were they i...","[9362.2, 9244.9, 117.3, 0.01269]",1.3%,0.01269,,,[substantially all of the goodwill and other i...,[the above unaudited pro forma financial infor...,"[[, Year Ended December 31, 2008 (Unaudited), ...","[[, year ended december 31 2008 ( unaudited ),...",{'amt_table': '<table class='wikitable'><tr><t...
2,{'question_0': 'what was the percentage change...,Type I,{0: 4},"[what was the total of net sales in 2001?, and...","[5363.0, 7983.0, -2620.0, -0.3282]",-32%,-0.3282,,,[in a new business model such as the retail se...,[.],"[[, 2002, 2001, 2000], [Net sales, $5,742, $5,...","[[, 2002, 2001, 2000], [net sales, $ 5742, $ 5...",{'amt_table': '<table class='wikitable'><tr><t...
3,{'question_0': 'what was the difference in per...,Type I,{0: 6},[what was the change in the performance of the...,"[-24.05, -0.2405, 102.11, 2.11, 0.0211, -0.2616]",-26.16%,-0.2616,,,[( 1 ) includes shares repurchased through our...,[.],"[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...","[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...",{'amt_table': '<table class='wikitable'><tr><t...
4,{'question_0': 'what is the roi of an investme...,Type II,"{0: 2, 1: 5}",[what was the fluctuation of the performance p...,"[-8.94, -0.0894, -24.05, -0.2405, 2.11, 0.0211...",-8.9%,-0.0894,-26.16%,-0.2616,[( 1 ) includes shares repurchased through our...,[.],"[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...","[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...",{'amt_table': '<table class='wikitable'><tr><t...


In [119]:
# Create base model for multiple hop QA

class AmountType(str, enum.Enum):
    """Enumeration representing the types of numerical amounts that can be used in an answer."""
    PERCENT = "PERCENT"
    NET_AMOUNT = "NET_AMOUNT"
    OTHER = "OTHER"


class OperationType(str, enum.Enum):
    """Enumeration representing the types of operations that can be used in an answer."""
    ADD = "ADD"
    SUBTRACT = "SUBTRACT"
    MULTIPLY = "MULTIPLY"
    DIVIDE = "DIVIDE"
    EXP = "EXP"
    GREATER = "GREATER"
    ASK_FOR_NUMBER = "ASK_FOR_NUMBER"


class Operation(BaseModel):
    """Class representing a single operation in an answer list"""

    operation: str = Field(
        ...,
        description=("Operation used to calculate the answer for example: add, multiply, subtract, divide, exp, greater. format as operation(arg1, arg2). "
                     "examples: add(10, 20), ask_for_number(arg1), divide(2000, 120)"
                     "Leave percentage calculations as decimals ie do not multiply by 100"
                     "if the operation is a query on the text to retrieve an amount then format as: Ask for number arg1"
                    
                ),
    )
    operation_type: OperationType = Field(
        description="Operation used to calculate the answer for example: add, multiply, subtract, divide, exp, greater.",
    )

    arg1: str = Field(
        description="the first number used by the operation",
    )
    arg2: Optional[str] = Field(
        description="the second number used by the operation",
    )

    def __str__(self):
        return f"{self.operation_type}({self.arg1}, {self.arg2})"


class Answer(BaseModel):
    """Class representing a single answer in an answer list"""

    id: int = Field(..., description="Unique id of the answer")
    question: str = Field(
        ...,
        description="Question asked using a question answering system",
    )
    explanation: str = Field(
        ...,
        description="Explanation of the answer and the calculation",
    )
    amount: float = Field(
        ...,
        description="Amount of the answer",
    )
    operation: Operation = Field(
        ...,
        description=("Operation used to calculate the answer for example: add, multiply, subtract, divide, exp, greater. format as operation(arg1, arg2). "
                     "for example: add(10, 20). "
                     "Leave percentage calculations as decimals ie do not multiply by 100"
                     "if the operation is a query on the text to retrieve an amount then format as: Ask for number arg1"
                    
                ),
    )
    dependencies: List[int] = Field(
        default_factory=list,
        description="List of ids of any previous answers that need to be calculated before this answer can be calculated",
    )
    amount_type: AmountType = Field(
        ...,
        description="Type of amount asked for the question, either a net amount or a percent or other",
    )


class AnswerList(BaseModel):
    """Container class representing a tree of questions to ask a question answering system."""

    answer_list: List[Answer] = Field(
        ..., description="The list of answers"
    )

    def _dependencies(self, ids: List[int]) -> List[Answer]:
        """Returns the dependencies of a query given their ids."""
        return [q.id for q in self.answer_list if q.id in ids]



In [128]:
client = instructor.from_openai(OpenAI())
def structuredQA(system_prompt: str, user_prompt:str, model = "gpt-4o-mini", client=client) -> AnswerList:
    """ 
    Format the response from the model into a structured answer list using the custom pydanctic model AnswerList
    
    args:
    system_prompt: str: system prompt
    user_prompt: str: user prompt
    model: str: model to use for the completion
    return:
    AnswerList: list of answers
    """

    messages = [ { "role": "system", "content": system_prompt, }, 
                 { "role": "user", "content": user_prompt, }  ]
    answers_list = client.chat.completions.create_with_completion(
        model=model,
        temperature=0,
        response_model=AnswerList,
        messages=messages,
        max_tokens=1000,
    )
    return answers_list

In [129]:
model = "gpt-4o-mini"
res = []    # save working results

for i, row in df.iterrows():
    print(f'row {i}: {row.q}')
    
    # prompt
    context = row.annotation.get('amt_table') + '\n ' + row.annotation.get('amt_pre_text') + '\n '+ row.annotation.get('amt_post_text')
    questions = '\n'.join([f'Q{j}: {s}' for j,s in enumerate(row.annotation['dialogue_break'])])
    prompt = prompt_template.format(context=context, questions=questions, answer="")
    print(prompt)
    # call the API
    answers, completion = structuredQA(system_prompt, prompt)

    response = answers.answer_list
    for a in response:
        print(a.id)
        print(a.question)
        print(a.explanation)
        print(a.amount)
        print(a.operation)
        print(a.dependencies)
        print(a.amount_type)
        print()

    exe_ans_list = row.annotation.get('exe_ans_list', [])
    # eval = get_eval(response_text, exe_ans_list)
    
    print(exe_ans_list[-1])
    # print(eval)

    print('\n~~~~\n')

    eval = {}
    row = { 'id': row.id,
            'qu_only': row.q,
            'context': context,
            'questions': questions,
            'prompt': prompt,
            'response': response,
            'response_ans_list': [a.amount for a in response],
            'ans_list': row.annotation.get('exe_ans_list', ['not found']),
            'exe_answer': row.annotation.get('exe_ans_list', ['not found'])[-1],
            **eval,
            'model': model,
            **dict(completion.usage),
    }
    res.append(row)


    if i >1:
        save_results(res)
        break

row 0: {'question_0': 'what was the percentage change in the net cash from operating activities from 2008 to 2009'}
Answer the following questions using the context provided:
context: <table class='wikitable'><tr><td>1</td><td>2008</td><td>year ended june 30 2009 2008</td><td>year ended june 30 2009 2008</td><td>year ended june 30 2009</td></tr><tr><td>2</td><td>net income</td><td>$ 103102</td><td>$ 104222</td><td>$ 104681</td></tr><tr><td>3</td><td>non-cash expenses</td><td>74397</td><td>70420</td><td>56348</td></tr><tr><td>4</td><td>change in receivables</td><td>21214</td><td>-2913 ( 2913 )</td><td>-28853 ( 28853 )</td></tr><tr><td>5</td><td>change in deferred revenue</td><td>21943</td><td>5100</td><td>24576</td></tr><tr><td>6</td><td>change in other assets and liabilities</td><td>-14068 ( 14068 )</td><td>4172</td><td>17495</td></tr><tr><td>7</td><td>net cash from operating activities</td><td>$ 206588</td><td>$ 181001</td><td>$ 174247</td></tr></table>
 26 | 2009 annual report in fis

In [130]:
res

[{'id': 'Single_JKHY/2009/page_28.pdf-3',
  'qu_only': {'question_0': 'what was the percentage change in the net cash from operating activities from 2008 to 2009'},
  'context': "<table class='wikitable'><tr><td>1</td><td>2008</td><td>year ended june 30 2009 2008</td><td>year ended june 30 2009 2008</td><td>year ended june 30 2009</td></tr><tr><td>2</td><td>net income</td><td>$ 103102</td><td>$ 104222</td><td>$ 104681</td></tr><tr><td>3</td><td>non-cash expenses</td><td>74397</td><td>70420</td><td>56348</td></tr><tr><td>4</td><td>change in receivables</td><td>21214</td><td>-2913 ( 2913 )</td><td>-28853 ( 28853 )</td></tr><tr><td>5</td><td>change in deferred revenue</td><td>21943</td><td>5100</td><td>24576</td></tr><tr><td>6</td><td>change in other assets and liabilities</td><td>-14068 ( 14068 )</td><td>4172</td><td>17495</td></tr><tr><td>7</td><td>net cash from operating activities</td><td>$ 206588</td><td>$ 181001</td><td>$ 174247</td></tr></table>\n 26 | 2009 annual report in fiscal 

In [110]:
model = "gpt-4o-mini"
res = []    # save working results

for i, row in df.iterrows():
    print(f'row {i}: {row.q}')
    
    # prompt
    context = row.annotation.get('amt_table') + '\n ' + row.annotation.get('amt_pre_text') + '\n '+ row.annotation.get('amt_post_text')
    questions = '\n'.join([f'Q{j}: {s}' for j,s in enumerate(row.annotation['dialogue_break'])])
    prompt = prompt_template.format(context=context, questions=questions, answer="")
    print(prompt)
    # call the API
    response = client.chat.completions.create(
    model=model,
    messages=[
        {
        "role": "assistant",
        "content": [
            {
            "type": "text",
            "text": assistant_role
            }
        ]
        },
        {
        "role": "user",
        "content": [
            {
            "type": "text",
            "text": prompt
            }
        ]
        }, 
    ],
    # todo
    temperature=0.5,
    max_tokens=1000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
    )
    response_text = response.choices[0].message.content
    exe_ans_list = row.annotation.get('exe_ans_list', [])

    eval = get_eval(response_text, exe_ans_list)
    
    print(response_text)
    print(exe_ans_list[-1])
    print(eval)

    print('\n~~~~\n')


    row = { 'id': row.id,
            'qu_only': row.q,
            'context': context,
            'questions': questions,
            'prompt': prompt,
            'response': response.choices[0].message.content,
            'true_answer': row.annotation.get('exe_ans_list', ['not found'])[-1],
            **eval,
            'model': model,
            **dict(response.usage),
    }
    res.append(row)


    if i >= 3:

        save_results(res)
        break

row 0: {'question_0': 'what was the percentage change in the net cash from operating activities from 2008 to 2009'}
Answer the following questions using the context provided:
context: <table class='wikitable'><tr><td>1</td><td>2008</td><td>year ended june 30 2009 2008</td><td>year ended june 30 2009 2008</td><td>year ended june 30 2009</td></tr><tr><td>2</td><td>net income</td><td>$ 103102</td><td>$ 104222</td><td>$ 104681</td></tr><tr><td>3</td><td>non-cash expenses</td><td>74397</td><td>70420</td><td>56348</td></tr><tr><td>4</td><td>change in receivables</td><td>21214</td><td>-2913 ( 2913 )</td><td>-28853 ( 28853 )</td></tr><tr><td>5</td><td>change in deferred revenue</td><td>21943</td><td>5100</td><td>24576</td></tr><tr><td>6</td><td>change in other assets and liabilities</td><td>-14068 ( 14068 )</td><td>4172</td><td>17495</td></tr><tr><td>7</td><td>net cash from operating activities</td><td>$ 206588</td><td>$ 181001</td><td>$ 174247</td></tr></table>
 26 | 2009 annual report in fis