# Automated evaluation of FinanceBench using ROUGE-SU

Currently, there are not that many automated eval options with LLMs.


**Insert more text here
- Goals
- Approach
- Outline

## Preprocessing
We can use NLTK to remove stop words and remap the questions into a `List[str]` type

In [24]:
import pandas as pd
from openai import OpenAI
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from functools import partial
import rouge_metric
from tqdm import tqdm
import time


#Data processing
df = pd.read_csv("financebench_sample_150.csv")[["question", "answer"]]
stop_words = list(set(stopwords.words('english')))
def preprocess(sequence: str):
    returnVal = [w for w in word_tokenize(sequence) if not w.lower() in stop_words]
    #returnVal = [''.join(c for c in s if c not in string.punctuation) for s in returnVal]
    return list(filter(None, returnVal))

# df["answer"] = df["answer"].apply(preprocess)
# df.head()

In [24]:
print(custom_completion(df['question'][1]))

I currently do not have access to specific financial data for 3M for the year end FY2018, including the net PP&E (Property, Plant, and Equipment). To obtain this information, please refer to 3M's official financial statements for FY2018, which can typically be found in their annual report or on their investor relations website.


In [16]:
#OpenAI client, default GPT4
default_client = OpenAI()
custom_client = OpenAI(api_key="NONE", base_url="https://finsearch-l35slsdlnq-uc.a.run.app")
def gpt_wrap_message(client: OpenAI, query: str):
    return client.chat.completions.create(
        model = "gpt-4-turbo",
        messages = [
            {"role": "user", "content": query}
        ]
        ).choices[0].message.content

def customgpt_wrap_message(client: OpenAI, query: str):
    return client.chat.completions.create(
        model = "gpt-4-turbo",
        messages = [
            {"role": "user", "content": query}
        ]
        ).choices[0].content


#Check the progress meter for each of our df.apply() methods
tqdm.pandas()

#Fix argument 1 with our OpenAI clients
default_completion = partial(gpt_wrap_message, default_client)
custom_completion = partial(customgpt_wrap_message, custom_client)

#df['gpt_answers'] = df['question'].progress_apply(default_completion)
#df['custom_answers'] = df['question'].progress_apply(custom_completion)

In [27]:
df.to_pickle("./financebench_answers.pkl")

### Start Here if data is loaded from pickle

In [3]:
new_df = pd.read_pickle('./financebench_answers.pkl')

In [18]:
custom_results = []
for i in tqdm(range(len(new_df))):
    custom_results.append(custom_completion(df['question'][i]))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [20:14<00:00,  8.10s/it]


In [24]:
new_df2 = new_df.assign(custom_answers=custom_results)
new_df2.to_pickle("./financebench_final.pkl")

In [25]:
third_df = pd.read_pickle('./financebench_final.pkl')

In [30]:
third_df['custom_answers'][0]

'The FY2018 capital expenditure amount for 3M in USD millions was 1,550 million. This figure is derived from the snippet stating that 3M reported a capital expenditure of 1.55 billion in the previous year. Since 1 billion equals 1,000 million, the conversion from 1.55 billion to million is 1,550 million.'

### Running Metrics

In [29]:
third_df = pd.read_pickle('./financebench_df.pkl')
third_df.columns

Index(['question', 'answer', 'gpt_answers', 'custom_answers'], dtype='object')

In [43]:
from rouge_metric import PyRouge

rouge = PerlRouge(rouge_n_max=3, rouge_l=True, rouge_w=True,
    rouge_w_weight=1.2, rouge_s=True, rouge_su=True, skip_gap=4)

In [38]:
real_answers = list(third_df['answer'].to_list())
gpt_answers = list(map(lambda x: [x], third_df['gpt_answers'].to_list()))
custom_answers = list(map(lambda x: [x], third_df['custom_answers'].to_list()))

In [39]:
third_df['answer'].head()

0                                             $1577.00
1                                                $8.70
2    No, the company is managing its CAPEX and Fixe...
3    Operating Margin for 3M in FY2022 has decrease...
4     The consumer segment shrunk by 0.9% organically.
Name: answer, dtype: object

#### GPT-4 Turbo Base vs. Our RAG-enhanced API

Our new RAG engine does better on every ROUGE metric, an automatic package for evaluation of summaries. Alternatively, we can use GPT-4 to evaluate our answer outputs via OpenAI Function Calling.

In [44]:
#TODO: Real answers are in the correct format, so fix GPT answers and Custom Answers

rouge.evaluate(real_answers, gpt_answers)

{'rouge-1': {'r': 0.04483,
  'r_conf_int': (0.0364, 0.05399),
  'p': 0.38542,
  'p_conf_int': (0.34145, 0.42995),
  'f': 0.07402,
  'f_conf_int': (0.06113, 0.0874)},
 'rouge-2': {'r': 0.01419,
  'r_conf_int': (0.00985, 0.01939),
  'p': 0.10653,
  'p_conf_int': (0.08209, 0.13313),
  'f': 0.02325,
  'f_conf_int': (0.01657, 0.03074)},
 'rouge-3': {'r': 0.00662,
  'r_conf_int': (0.00384, 0.01003),
  'p': 0.0428,
  'p_conf_int': (0.0289, 0.05753),
  'f': 0.01067,
  'f_conf_int': (0.00652, 0.01554)},
 'rouge-l': {'r': 0.03621,
  'r_conf_int': (0.02943, 0.04415),
  'p': 0.33889,
  'p_conf_int': (0.29708, 0.38082),
  'f': 0.06013,
  'f_conf_int': (0.05014, 0.07142)},
 'rouge-w-1.2': {'r': 0.01264,
  'r_conf_int': (0.01013, 0.01575),
  'p': 0.27893,
  'p_conf_int': (0.24479, 0.31593),
  'f': 0.02308,
  'f_conf_int': (0.01889, 0.0282)},
 'rouge-s4': {'r': 0.01055,
  'r_conf_int': (0.00715, 0.01474),
  'p': 0.08543,
  'p_conf_int': (0.06587, 0.10757),
  'f': 0.01727,
  'f_conf_int': (0.01216, 0.0

In [45]:
rouge.evaluate(real_answers, custom_answers)

{'rouge-1': {'r': 0.07275,
  'r_conf_int': (0.05778, 0.09002),
  'p': 0.35806,
  'p_conf_int': (0.31333, 0.40407),
  'f': 0.1044,
  'f_conf_int': (0.08607, 0.12384)},
 'rouge-2': {'r': 0.02727,
  'r_conf_int': (0.01967, 0.0359),
  'p': 0.12032,
  'p_conf_int': (0.09326, 0.1489),
  'f': 0.03957,
  'f_conf_int': (0.02974, 0.05082)},
 'rouge-3': {'r': 0.01417,
  'r_conf_int': (0.00891, 0.02009),
  'p': 0.06365,
  'p_conf_int': (0.04297, 0.08572),
  'f': 0.02062,
  'f_conf_int': (0.0133, 0.02908)},
 'rouge-l': {'r': 0.05916,
  'r_conf_int': (0.0477, 0.07246),
  'p': 0.31299,
  'p_conf_int': (0.27379, 0.35562),
  'f': 0.08571,
  'f_conf_int': (0.07148, 0.10173)},
 'rouge-w-1.2': {'r': 0.02309,
  'r_conf_int': (0.01809, 0.02915),
  'p': 0.26555,
  'p_conf_int': (0.23032, 0.30412),
  'f': 0.03773,
  'f_conf_int': (0.03069, 0.04601)},
 'rouge-s4': {'r': 0.01931,
  'r_conf_int': (0.01385, 0.02555),
  'p': 0.09398,
  'p_conf_int': (0.07114, 0.11701),
  'f': 0.02831,
  'f_conf_int': (0.02085, 0.0