### ConvFinQA project
Some points on experiement setup:
* I focused most of my attention on llama3, and slightly tuned the prompt with based on the output I got from it.
* Evaluation across all models was done only on the first 10 examples due to a combination of long run time (25min for smaller models) and an error from an edge case I had with Pydantic through Ragas, that I haven't had time to debug.
* I used the same embedding model throughout nomic-embed-text for simplicity.
* I evaluted with Ragas as well as by eye, just to have an understanding how models perform and how evaluation are done.
* I chose to evaluate with llama3.2 for simplicity. Models could report higher metrics for responses coming from its own model (family) 
* I tried to evaluate five open source LLM models: llama3, mistral-nemo, codegemma, qwen2-math, command-r served on my local machine. I couldn't run commdand-r, not enough memory


#### Evaluate llama3 RAG by eye and llm metrics
Architecture and design choices:
* I ended up using retrieval thinking I could probably increase accuracy by filtering out only relevant information (hence will appear on top by default), based on Cohere's finding that even ranking the retrieved documents according to relevancy can increase accuracy.
* I used LLM to first fact finding in the table before integrating it with the pre and post texts.
* I used RAGAS for evaluation, metrics includes LLMContextRecall, Faithfullness, LLMConextPrecisionReference, SemanticSimilarity. Except SemanticSimilary, the others are all LLM based, this could introduce additional uncertainty when interpreting the results, will see from the example below. I chose these metrics because I want to look at the accuracy of the result (faithfullness), and the effectiveness of context  retrieval the answer is based on(LLMContextRecal, LLMContextPrecision, SemanticSimilarity)



#### Observations during experiment and conclusions:
* Enforcing json output could harm results (for Lllama3). For the LLM to explain its steps and actions could help it find the right answers, especially when there are multiple steps needed.
* Random seed affects math and reasoning more than retrieving relavent information.
* There were some problem evaluating the results with Ragas using models that are not the same the one used for qa (?). This might need further  adjustments.
* Faithfulness is not accurate for llama3 as evaluator.
* When there is a llm component introduced into the chain in addition to the question answering, additional unpredictability is introduced. That demonstarted in assessing 10 examples vs 1. By chance the component extracting facts from the table was not performing as intended, it hurt the result. 
* With the limited time, the relevancy of the retrieved information is hugely important in getting the answer correct, and in this case, because of the brittleness of the table transformation, the relevant information is often not retrieved correctly.

#### Lessons learned (what I would do differently next time):
* Extracting information correctly and completely can have a huge impact on result. But question answering even with the same seed and same context could vary! Adding challenges in evaluating model results.
* There is a huge variation in different evalution from different models. Therefore, it is important to quickly find out common mistakes and collect examples of the RAG system, and find a good evaluator that agrees with humanns
* Quickly test with different models with minimum fine tuning to select qa LLM candidates
* Without degrading the performance, introduce the least amount of LLm component to reduce brittleness of the system
* Use tools to do maths and logic instead of relying on the models. Or explore generating code for it's reasoning calculation step, for accuracy.





##### One example

In [1]:
# load and prep data
import json
import pandas as pd
from utils import split_questions
from eval import data_to_samples, eval
from rag import FinRAG
import os
LLM_MODELS = json.loads(os.getenv("SUPPORTED_LLMS"))
eval_llm = "llama3.2"
data_path = "data/train.json"
with open(data_path, "r") as f:
    data = json.load(f) 
data = split_questions(data)
# only looking at the first 10 examples because of long run time!
data = data[:10]

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# randome seed can affect both retrieval and maths, but it seems to affect matchs a more
from rag import FinRAG
for seed in [0, 42, 1337]:
    finrag = FinRAG(llm_model="llama3", context=data[0], seed=seed)
    result = finrag.qa(data[0]['qa']['question'])
    print(f"Seed {seed} result: {result}")

Seed 0 result: {'answer': '14.1%', 'contexts': ['The net cash from operating activities increased from $181,001 in 2008 to $206,588 in 2009.', 'Cash provided by operations increased $25587 to $206588 for the fiscal year ended June 30, 2009 as compared to $181001 for the fiscal year ended June 30, 2008.']}
Seed 42 result: {'answer': '14.2%', 'contexts': ['The net cash from operating activities increased from $181,001 in 2008 to $206,588 in 2009.', 'The increase is $25,587.']}
Seed 1337 result: {'answer': '14.1%', 'contexts': ['The net cash from operating activities increased from $181,001 in 2008 to $206,588 in 2009.', 'The increase is $25,587.']}


In [3]:
# ground truth
data[0]['qa']

{'question': 'what was the percentage change in the net cash from operating activities from 2008 to 2009',
 'answer': '14.1%',
 'explanation': '',
 'ann_table_rows': [6],
 'ann_text_rows': [],
 'steps': [{'op': 'minus2-1',
   'arg1': '206588',
   'arg2': '181001',
   'res': '25587'},
  {'op': 'divide2-2', 'arg1': '#0', 'arg2': '181001', 'res': '14.1%'}],
 'program': 'subtract(206588, 181001), divide(#0, 181001)',
 'gold_inds': {'table_6': '2008 the net cash from operating activities of year ended june 30 2009 2008 is $ 206588 ; the net cash from operating activities of year ended june 30 2009 2008 is $ 181001 ; the net cash from operating activities of year ended june 30 2009 is $ 174247 ;'},
 'exe_ans': 0.14136,
 'program_re': 'divide(subtract(206588, 181001), 181001)'}

In [None]:
# evaluating the system using one example, correct answer, good context
eval_sample = data_to_samples([data[0]], 'llama3', seed=0)

In [None]:
eval_llms = ["llama3", "llama3.2", "mistral-nemo", 'codegemma', 'qwen2-math']
eval_results = [eval(eval_sample, eval_llm) for eval_llm in eval_llms]
import pandas as pd
result = pd.concat(eval_results, axis=0)
result['source'] = eval_llms
result.set_index('source', inplace=True)
result.to_csv("eval_results/data0-results.csv")

Evaluating: 100%|██████████| 4/4 [00:37<00:00,  9.38s/it]
  Expected `dict[str, any]` but got `EvaluationResult` with value `{'context_recall': 0.6667...tic_similarity': 1.0000}` - serialized value may not be as expected
  Expected `dict[str, any]` but got `EvaluationResult` with value `{'context_recall': 0.6667...tic_similarity': 1.0000}` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
Evaluating:  50%|█████     | 2/4 [00:43<00:47, 23.63s/it]Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
Exception raised in Job[2]: RagasOutputParserExcept

In [13]:
# it seems a good idea to evaluate with the model that was used to generate the results
result

Unnamed: 0_level_0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,llm_context_precision_with_reference,semantic_similarity
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
llama3,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.1%,14.1%,0.666667,1.0,1.0,1.0
llama3.2,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.1%,14.1%,0.5,,,1.0
mistral-nemo,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.1%,14.1%,1.0,,,1.0
codegemma,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.1%,14.1%,1.0,,,1.0
qwen2-math,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.1%,14.1%,,,1.0,1.0


In [None]:
# Evaluate the result, wrong answer, good context
eval_sample = data_to_samples([data[0]], 'llama3', seed=42)
eval_llms = ["llama3", "llama3.2", "mistral-nemo", 'codegemma', 'qwen2-math']
eval_results = [eval(eval_sample, eval_llm) for eval_llm in eval_llms]
result = pd.concat(eval_results, axis=0)
result['source'] = eval_llms
result.set_index('source', inplace=True)
result.to_csv("eval_results/data0-wrong-results.csv")

Evaluating: 100%|██████████| 4/4 [00:42<00:00, 10.69s/it]
  Expected `dict[str, any]` but got `EvaluationResult` with value `{'context_recall': 0.6667...tic_similarity': 0.9770}` - serialized value may not be as expected
  Expected `dict[str, any]` but got `EvaluationResult` with value `{'context_recall': 0.6667...tic_similarity': 0.9770}` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
Evaluating:  50%|█████     | 2/4 [00:24<00:24, 12.45s/it]Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
Exception raised in Job[2]: RagasOutputParserExcept

In [16]:
result


Unnamed: 0_level_0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,llm_context_precision_with_reference,semantic_similarity
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
llama3,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.2%,14.1%,0.666667,1.0,1.0,0.97698
llama3.2,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.2%,14.1%,1.0,,,0.97698
mistral-nemo,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.2%,14.1%,0.0,,,0.97698
codegemma,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.2%,14.1%,1.0,,,0.97698
qwen2-math,what was the percentage change in the net cash...,[The net cash from operating activities increa...,[2008 the net cash from operating activities o...,14.2%,14.1%,,,1.0,0.97698


##### Ten examples with all different models


In [None]:
# Using the same model for eval
eval_examples = data_to_samples(data, "llama3", seed=0)
eval_results_df = eval(eval_examples, eval_llm="llama3", qa_llm="llama3")

In [None]:
# llama3 on ten examples
eval_results_df

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,llm_context_precision_with_reference,semantic_similarity
0,what was the percentage change in the net cash...,[The table shows net cash from operating activ...,[2008 the net cash from operating activities o...,25%,14.1%,,,,0.637103
1,what was the percent of the growth in the reve...,"[Revenues for 2007: [], Revenues for 2008: []]",[the revenue of year ended december 31 2008 ( ...,No answer found,1.3%,0.666667,,,0.356857
2,what was the percentage change in net sales fr...,[gross margin declined to 23% of net sales in ...,[the net sales of 2002 is $ 5742 ; the net sal...,21%,-32%,1.0,,,0.671372
3,what was the difference in percentage cumulati...,[Error],[the united parcel service inc . of 12/31/04 i...,A financial analysis question!\n\nTo answer th...,-26.16%,,,0.0,0.306561
4,what is the roi of an investment in ups in 200...,"[The years mentioned are 2004, 2005, 2006, 200...",[the united parcel service inc . of 12/31/04 i...,15.0%,-8.9%,,,,0.712236
5,what was the difference in percentage cumulati...,[The comparison is for a five-year period ende...,[the united parcel service inc . of 12/31/04 i...,11.1%,-26.16%,,,,0.723709
6,what portion of the total shares subject to ou...,[the 2009 global incentive plan enables the co...,[the 2009 global incentive plan of shares avai...,No answer found,70.1%,0.0,,,0.342488
7,what was the percentage increase in litigation...,[the prior year included expense of $ 3.2 bill...,[the current year included expense of $ 3.7 bi...,24.0%,15.6%,,,,0.678361
8,what was the percent of the change in the comp...,[changes in the company 2019's warranty liabil...,[the balance at december 31 of 2012 is $ 118 ;...,12.5%,15.7%,,,,0.721895
9,what was the percentage change in the company'...,[changes in the company 2019s warranty liabili...,[the balance at december 31 of 2012 is $ 118 ;...,No answer found,16%,0.0,,,0.312955


In [25]:
# Use 
import pickle
codegemma_examples = pickle.load(open("eval_samples_codegemma.pkl", "rb"))
mistral_examples = pickle.load(open("eval_samples_mistral-nemo.pkl", "rb"))
qwen2_examples = pickle.load(open("eval_samples_qwen2-math.pkl", "rb"))
codegemma_results = eval(codegemma_examples, eval_llm="codegemma", qa_llm="codegemma")
mistral_results = eval(mistral_examples, eval_llm="mistral-nemo", qa_llm="mistral-nemo")
qwen2_results = eval(qwen2_examples, eval_llm="qwen2-math", qa_llm="qwen2-math")

Evaluating:  50%|█████     | 20/40 [02:40<05:54, 17.75s/it]Exception raised in Job[1]: TimeoutError()
Evaluating:  52%|█████▎    | 21/40 [03:01<05:52, 18.56s/it]Exception raised in Job[28]: TimeoutError()
Exception raised in Job[37]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[21]: TimeoutError()
Evaluating:  57%|█████▊    | 23/40 [03:01<03:10, 11.20s/it]Exception raised in Job[13]: TimeoutError()
Evaluating:  68%|██████▊   | 27/40 [03:21<01:41,  7.84s/it]Exception raised in Job[5]: TimeoutError()
Exception raised in Job[14]: TimeoutError()
Evaluating:  72%|███████▎  | 29/40 [03:21<01:02,  5.70s/it]Exception raised in Job[6]: TimeoutError()
Exception raised in Job[32]: TimeoutError()
Evaluating:  78%|███████▊  | 31/40 [03:25<00:42,  4.74s/it]Exception raised in Job[24]: TimeoutError()
Exception raised in Job[0]: TimeoutError()
Evaluating:  82%|████████▎ | 33/40 [03:51<00:49,  7.01s/it]Exception raised in Job[33]: TimeoutError()
Evaluating:  88%|██

In [30]:
codegemma_results

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,llm_context_precision_with_reference,semantic_similarity
0,what was the percentage change in the net cash...,[Error],[2008 the net cash from operating activities o...,"```json\n{""answer"": 8.3%,\n""contexts"": [\n""the...",14.1%,,,0.0,0.412681
1,what was the percent of the growth in the reve...,[Error],[the revenue of year ended december 31 2008 ( ...,"```json\n{\n""answer"": 12.3%,\n""contexts"": [\n""...",1.3%,0.0,,,0.474853
2,what was the percentage change in net sales fr...,[Error],[the net sales of 2002 is $ 5742 ; the net sal...,"```json\n{""answer"": -4%, ""contexts"": [""gross m...",-32%,0.5,,0.0,0.496941
3,what was the difference in percentage cumulati...,[united parcel service inc . cumulative return...,[the united parcel service inc . of 12/31/04 i...,1.4%,-26.16%,1.0,,,0.745983
4,what is the roi of an investment in ups in 200...,[],[the united parcel service inc . of 12/31/04 i...,The provided context does not contain any info...,-8.9%,0.0,,0.0,0.384821
5,what was the difference in percentage cumulati...,[],[the united parcel service inc . of 12/31/04 i...,No answer found,-26.16%,1.0,,0.0,0.389638
6,what portion of the total shares subject to ou...,[Error],[the 2009 global incentive plan of shares avai...,"```json\n{""answer"": 58.4%, ""contexts"": [""For t...",70.1%,,,,0.491466
7,what was the percentage increase in litigation...,[],[the current year included expense of $ 3.7 bi...,No answer found,15.6%,,,0.0,0.309002
8,what was the percent of the change in the comp...,[],[the balance at december 31 of 2012 is $ 118 ;...,No answer found,15.7%,,,0.0,0.315268
9,what was the percentage change in the company'...,[],[the balance at december 31 of 2012 is $ 118 ;...,No answer found,16%,1.0,,0.0,0.312955


In [31]:
mistral_results

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,llm_context_precision_with_reference,semantic_similarity
0,what was the percentage change in the net cash...,[],[2008 the net cash from operating activities o...,No answer found,14.1%,0.0,,0.0,0.341602
1,what was the percent of the growth in the reve...,[Error],[the revenue of year ended december 31 2008 ( ...,"```json\n{\n ""answer"": ""15.9%"",\n ""contexts""...",1.3%,0.0,,0.0,0.547215
2,what was the percentage change in net sales fr...,[Error],[the net sales of 2002 is $ 5742 ; the net sal...,"```json\n{\n ""answer"": ""-5.3%"",\n ""contexts""...",-32%,,,0.0,0.479359
3,what was the difference in percentage cumulati...,[],[the united parcel service inc . of 12/31/04 i...,No answer found,-26.16%,,,0.0,0.389638
4,what is the roi of an investment in ups in 200...,[The cumulative total return for UPS from 2004...,[the united parcel service inc . of 12/31/04 i...,40.0%,-8.9%,,,,0.679892
5,what was the difference in percentage cumulati...,[],[the united parcel service inc . of 12/31/04 i...,No answer found,-26.16%,,,0.0,0.389638
6,what portion of the total shares subject to ou...,"[- The ""2009 global incentive plan"" has **2,53...",[the 2009 global incentive plan of shares avai...,48.3%,70.1%,,,,0.727335
7,what was the percentage increase in litigation...,[Prior year (2011) litigation reserves: $3.2 b...,[the current year included expense of $ 3.7 bi...,14.8%,15.6%,,,,0.861376
8,what was the percent of the change in the comp...,[],[the balance at december 31 of 2012 is $ 118 ;...,No answer found,15.7%,,,0.0,0.315268
9,what was the percentage change in the company'...,[],[the balance at december 31 of 2012 is $ 118 ;...,No answer found,16%,,,0.0,0.312955


In [32]:
qwen2_results

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,llm_context_precision_with_reference,semantic_similarity
0,what was the percentage change in the net cash...,[The net cash from operating activities for 20...,[2008 the net cash from operating activities o...,15.3%,14.1%,,,,0.861487
1,what was the percent of the growth in the reve...,"[The revenue for the year 2007 was $100,000 an...",[the revenue of year ended december 31 2008 ( ...,15.3%,1.3%,,,,0.813926
2,what was the percentage change in net sales fr...,[gross margin declined to 23% ( 23 % ) of net ...,[the net sales of 2002 is $ 5742 ; the net sal...,No answer found,-32%,,,0.0,0.310607
3,what was the difference in percentage cumulati...,[The table shows the five-year cumulative retu...,[the united parcel service inc . of 12/31/04 i...,8.4%,-26.16%,,,1.0,0.725909
4,what is the roi of an investment in ups in 200...,[The ups stock price in 2004 was $18.50 per sh...,[the united parcel service inc . of 12/31/04 i...,13.7%,-8.9%,,,,0.741482
5,what was the difference in percentage cumulati...,[The table shows the five-year cumulative retu...,[the united parcel service inc . of 12/31/04 i...,8.4%,-26.16%,,,1.0,0.725909
6,what portion of the total shares subject to ou...,"[2009 global incentive plan has 2,530,454 shar...",[the 2009 global incentive plan of shares avai...,0.68,70.1%,,,,0.587002
7,what was the percentage increase in litigation...,[the prior year included expense of $ 3.2 bill...,[the current year included expense of $ 3.7 bi...,No answer found,15.6%,,,0.0,0.309002
8,what was the percent of the change in the comp...,[changes in the company 2019s warranty liabili...,[the balance at december 31 of 2012 is $ 118 ;...,No answer found,15.7%,,,,0.315268
9,what was the percentage change in the company'...,[Error],[the balance at december 31 of 2012 is $ 118 ;...,"\n```{""answer"": ""35.7%\n prediction: No answer...",16%,,,0.0,0.494957


#### Why the huge difference in results between runs?
* Table summary mistake can have a big impact on the result. Take data[0] for example failure of extracting the facts correctly from table has made final results unreliable
* Even with the correct information extracted from the table (data[1]), the generation can still make quite obviously makes even with clear instruction, and despite previous test run returning correct answer!

In [2]:
for i in range(5):
    finrag = FinRAG(llm_model="llama3", context=data[0], seed=0)
    result = finrag.qa(data[0]["qa"]["question"])
    print(f"Result is {result}")
    print("=" * 70)

Table summary: Here are the concise facts extracted from the table:

* The net income for year ended June 30, 2009 was $104,222.
* Non-cash expenses in 2008 were $74,397.
* In 2008, there was a change in receivables of $21,214.
* In 2008, there was a change in deferred revenue of $21,943.
* In 2008, there was a change in other assets and liabilities of -$14,068.
* Net cash from operating activities for year ended June 30, 2009 was $206,588.

Let me know if you'd like me to extract any additional information!
----------------------------------------------------------------------
Result before parsing is Step 1: List facts from the context that can help us find out the answer:

* The net cash from operating activities increased from $181,001 in 2008 to $206,588 in 2009.
* The percentage change in cash provided by operations from 2008 to 2009 is not explicitly stated.

Step 2: Reasoning or calculation:

To calculate the percentage change, we need to find the difference between the two val

In [5]:
for i in range(5):
    finrag = FinRAG(llm_model="llama3", context=data[1], seed=0)
    result = finrag.qa(data[1]["qa"]["question"])
    print(f"Result is {result}")
    print("=" * 70)

Table summary: Here are the concise facts extracted from the table:

* Revenue for the year ended December 31, 2008 (unaudited) was $9,362.2.
* Revenue for the year ended December 31, 2007 (unaudited) was $9,244.9.
* Income from continuing operations available to common stockholders for the year ended December 31, 2008 (unaudited) was $285.7.
* Income from continuing operations available to common stockholders for the year ended December 31, 2007 (unaudited) was $423.2.
* Basic earnings per share for the year ended December 31, 2008 (unaudited) were $.76.
* Basic earnings per share for the year ended December 31, 2007 (unaudited) were $1.10.
* Diluted earnings per share for the year ended December 31, 2008 (unaudited) were $.75.
* Diluted earnings per share for the year ended December 31, 2007 (unaudited) were $1.09.
----------------------------------------------------------------------
Result before parsing is I'd be happy to help!

Here's my analysis:

**Step 1: List facts from the c