# Tomoro assessment

## Intro

I outline how I am planning to tackle this assessment. At a high-level, I want to educate myself with what the paper is trying to achieve. After I have a basic understanding of what the authors are doing, I want to familiarise myself with the data. With this, basic modelling can act as our benchmark - which we can then enhance to reach an improved model.

Attack plan:
1. Read paper
2. Investigate what data we are working with
3. Think about what modelling is relevant
4. Start writing utils
    - in parallel look at paper + deep-dive data examples
6. Basic benchmarking
7. Improvements
8. Validation

## Preliminary paper overview

First pass of reading the paper. I present my notes below, for reference.

Authors create a new dataset, `ConvFINQA`
- Conversational  question/answer type dataset
- Questions require multi-step reasoning, specifically within the context of maths/finance
- Specialised domain versus general domain reasoning
- Answering questions requires context which is either given as text or tables

Multihop reasoning is a difficult task to model
- These questions require multiple operations until you reach the answer
    - Example: what is the % increase of variable X w.r.t to its value last year?
        - you need to know the variable X given at the current and last year; then calculate a % change
- Two types of questions
    - Simple: single multi-hop question (single isolated question that requires multiple operations to answer)
    - Hybrid: multiple multi-hop questions (require cross-question reasoning/ dependencies)
- Different math operations
    - Add, subtract, divide, multiply

How modelling is approached
- Neural symbolic approach: 
    - Combines a retriever to find the relevant facts/context; then a generator that uses question + context to decode the "reasoning program"
    - Reasoning program: They define it as a collection of operations. `op(arg1, arg2)` 
        - think of this as a functional expression of the reasoning required to solve the 
- Generative GPT like:
    - Relies on the context given during prompting "gold supporting facts"
        - Bad context, horrible output
    - Instructs the output to be like the reasoning program via examples
    - Investigates chain-of-thought

Key learnings:
- The model struggles with long reasoning chains.
- The model excels at number selection questions.
- The model suffers from the lack of domain knowledge.
- GPT-3 can do simple calculations by itself.
- GPT-3 performs better for its familiar program format.
- GPT-3 struggles with new complex task paradigms

## Preliminary data overview

In this section, I describe what I've learned through a preliminary data exploration. 


*Note: Data exploration can be messy. For this reason, I separate the actual work into a different script. Here, I report what I've learned. For further detail, please look at `scripts/data_exploration.ipynb`*

In [1]:
import pandas as pd

raw_data = pd.read_json('../data/train.json')
raw_data.iloc[0]

pre_text      [26 | 2009 annual report in fiscal 2008 , reve...
post_text     [year ended june 30 , cash provided by operati...
filename                                  JKHY/2009/page_28.pdf
table_ori     [[, Year ended June 30, 2009], [2008, 2007], [...
table         [[2008, year ended june 30 2009 2008, year end...
qa            {'question': 'what was the percentage change i...
id                               Single_JKHY/2009/page_28.pdf-3
annotation    {'amt_table': '<table class='wikitable'><tr><t...
qa_0                                                        NaN
qa_1                                                        NaN
Name: 0, dtype: object

Must knows:
1. Questions are stored under `.qa` for simple questions (single multi-hop) or `.qa_0` and `.qa_1` for hybrid questions (cross-question dependencies)
2. Each entry has an associated `.table` ; `.pre_text` and `.post_text` attribute that is used to construct our "context" - relevant during prompting
    - Basically this would look like a concatenation of `pre_text` + `table` + `post_text`
3. Each entry has also an `.annotation` attribute which contains the major information for the conversations.
    - Different information is captured if the entry is a simple question versus a hybrid question
    - Importantly, contains the `.dialogue_break` attribute, which splits the question(-s) into simpler subquestions, used to reach the answer
    - Contains `.qa_split` to tell you which info corresponds to which question in the case of hybrid questions
        - for example, exe_ans_list = [0,0,0,1,1] can be read as the first three elements of the lists correspond to `.qa_0` and the latter two correspond to `.qa_1`

In [43]:
print('Short example:\n')
print(f'    .qa.question:\n     {raw_data.iloc[0].qa.get("question")}\n')
print(f'    .annotation.dialogue_break:\n     {raw_data.iloc[0].annotation.get("dialogue_break")}\n')
print(f'    table:\n     {raw_data.iloc[0].table}\n')
print(f'    pre_text (excerpt):\n     {raw_data.iloc[0].pre_text[0][:100]}...\n')
print(f'    post_text (excerpt):\n     {raw_data.iloc[0].post_text[0][:100]}...\n')

Short example:

    .qa.question:
     what was the percentage change in the net cash from operating activities from 2008 to 2009

    .annotation.dialogue_break:
     ['what is the net cash from operating activities in 2009?', 'what about in 2008?', 'what is the difference?', 'what percentage change does this represent?']

    table:
     [['2008', 'year ended june 30 2009 2008', 'year ended june 30 2009 2008', 'year ended june 30 2009'], ['net income', '$ 103102', '$ 104222', '$ 104681'], ['non-cash expenses', '74397', '70420', '56348'], ['change in receivables', '21214', '-2913 ( 2913 )', '-28853 ( 28853 )'], ['change in deferred revenue', '21943', '5100', '24576'], ['change in other assets and liabilities', '-14068 ( 14068 )', '4172', '17495'], ['net cash from operating activities', '$ 206588', '$ 181001', '$ 174247']]

    pre_text (excerpt):
     26 | 2009 annual report in fiscal 2008 , revenues in the credit union systems and services business ...

    post_text (excerpt):
     

Important to note:
1. I've already found some bad examples - The data isn't perfect!
    - Cases where the maths operations don't seem to be correct
    - Some column names are repeated; effectively not being able to distinguish what numbers are

2. We need to convert the table into something parsable by the LLM
    - Sometimes column names are supplied sometimes are not
    - I've spent sometime to think about it but there doesn't appear to be a global solution for this
    - You either assume that first entries are always column names or don't
    - Other option is to check the similarity of the first entries versus the rest of the table (in an attempt to understand if its a column name versus a cell value -- but this seems like a very computationally expensive operation)
3. Seems like the `pre_text` and `post_text` can be very noisy / dirty
    - Not always needed, but you can't know this apriori
    - There's the option of the `.gold_inds` but this feels like cheating as its the perfect retrieval

## Utils and functions

The relevant `utils`, which at a high level include the data preprocessing infra, model infra, and other evaluator functions, can be found in the utils folder.

Please look at each respective file for explanation of their contents - this report will not go through in detail, unless relevant/ interesting.

## Basic benchmarking

I want to write the basic functionality to parse all this information coherently, and run a few benchmarks.
I'm thinking I'll take the API approach as my computer cannot run much...