![image](../images/kdd24-logo-small.jpeg)

# Hands-on Tutorial
## Domain-Driven LLM Development: Insights into RAG and Fine-Tuning Practices
### Lab 3: Lab3: RAG + Fine-tuning and Benchmarking
#### Summary: 
This lab explore tools to supporting deciding between RAG and domain fine-tuning for an LLM's tasks such as Q&A or summarizing documents.

Based on results from `Lab 1 - Advanced Techniques in Retrieval Augmented Generation (RAG)`, and `Lab 2 - LLM fine-tuning`, we wil explore how we can compare both approaches and create data to support decison making on which solution would be chosen.

Decision making related to choosing between Fine-tuned and RAG models performances should take into account model performance as well as pricing. 

In the first sesssion of this notebook we compare model performance to assess if one or both achieve a minimum aceptable performance based on appropriate metrics realted to the task on hand.

The second session is about offering pricing analysis tools to help decide between both models.
## Important note
### The pricing values used here are simulated and DO NOT reflect any real AWS pricing
The values used are consistent with the references below: 
> https://aws.amazon.com/kendra/pricing/ </br>
> https://aws.amazon.com/kendra/pricing/

---

#### Installing and importing nedeed libraries

In [2]:
%%capture
!pip install -qU pinecone-client==2.2.1 ipywidgets==7.0.0
# install packages needed for plotting
! pip install -U kaleido --quiet
! pip install plotly --quiet
! pip install spacy --quiet

In [3]:
import pandas as pd
import os
import json
import os
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'

#### Load RAG and fine-tuned data from lab 1 and lab 2
These files contain samples with evaluation metrics as well as associated prompts used. 

For this short lab we are considering 30 interaction samples from each model, with the prompts and associated metric values.

In [7]:
FINETUNED_FILE = "../lab-data/"+"ft_evaldata.csv"  
RAG_FILE = "../lab-data/"+"rag_evaldata.csv"  

df_rag = pd.read_csv(RAG_FILE)
df_finetuned = pd.read_csv(FINETUNED_FILE)

In [8]:
df_rag.head()

Unnamed: 0,input_prompt,output_prompt,metric1,metric2,metric3,metric4,metric5
0,<ANSWER> This is an example of question with <...,this is an example of output,1.234,0.023,1.456,0.321,9.241
1,<ANSWER> This is an example of question with <...,this is an example of output,1.234,0.023,1.456,0.321,9.241
2,<ANSWER> This is an example of question with <...,this is an example of output,1.234,0.023,1.456,0.321,9.241
3,<ANSWER> This is an example of question with <...,this is an example of output,4.234,0.023,2.456,0.321,9.241
4,<ANSWER> This is an example of question with <...,this is an example of output,1.234,0.023,1.456,0.321,9.241


In [9]:
df_finetuned.head()

Unnamed: 0,input_prompt,output_prompt,metric1,metric2,metric3,metric4,metric5
0,<ANSWER> This is an example of question withou...,this is an example of output,1.234,0.023,1.456,0.321,9.241
1,<ANSWER> This is an example of question withou...,this is an example of output,1.234,0.023,1.456,0.321,9.241
2,<ANSWER> This is an example of question withou...,this is an example of output,1.234,0.023,1.456,0.321,9.241
3,<ANSWER> This is an example of question withou...,this is an example of output,1.234,0.023,1.456,0.321,9.241
4,<ANSWER> This is an example of question withou...,this is an example of output,1.234,0.023,1.456,0.321,9.241


#### Now we get the average metrics over all samples

In [10]:
# metric averages
def get_metric_averages(metric_df):
    # return list of metrics
    average_metrics = metric_df.loc[:, ['metric1', 'metric2','metric3', 'metric4','metric5']].mean()
    return [average_metrics.iloc[0],
            average_metrics.iloc[1],
            average_metrics.iloc[2],
            average_metrics.iloc[3],
            average_metrics.iloc[4]]

In [14]:
l_finetuned_metrics = get_metric_averages(df_finetuned)
print ("metric 1:", l_finetuned_metrics[0],
       "\nmetric 2:", l_finetuned_metrics[1],
       "\nmetric 3:", l_finetuned_metrics[2],
       "\nmetric 4:", l_finetuned_metrics[3],
       "\nmetric 5:", l_finetuned_metrics[4])

metric 1: 1.2340000000000004 
metric 2: 0.023000000000000007 
metric 3: 1.4560000000000006 
metric 4: 0.32099999999999995 
metric 5: 9.240999999999996


In [15]:
l_rag_metrics = get_metric_averages(df_rag)
print ("metric 1:", l_rag_metrics[0],
       "\nmetric 2:", l_rag_metrics[1],
       "\nmetric 3:", l_rag_metrics[2],
       "\nmetric 4:", l_rag_metrics[3],
       "\nmetric 5:", l_rag_metrics[4])

metric 1: 1.3952903225806454 
metric 2: 0.2165483870967741 
metric 3: 1.5527741935483879 
metric 4: 0.32099999999999995 
metric 5: 9.240999999999996


## Session 1 - Performance analysis
#### Plot RAG vs fine-tuned performance metrics
We use a radar plot as a easy way to present all metrics together for comparison.

In [37]:
# list of name, degree, score
nme = ["metric1", "metric2", "metric3", "metric4", "metric5",
       "metric1", "metric2", "metric3", "metric4", "metric5"]
deg = ["finetuned", "finetuned", "finetuned", "finetuned", "finetuned",
      "rag", "rag", "rag", "rag", "rag"]
scr = [l_rag_metrics[0], l_rag_metrics[1], l_rag_metrics[2], l_rag_metrics[3], l_rag_metrics[4],
       l_finetuned_metrics[0], l_finetuned_metrics[1], l_finetuned_metrics[2], 
       l_finetuned_metrics[3], l_finetuned_metrics[4]]
 
# dictionary of lists 
dict = {'metric': nme, 'model': deg, 'value': scr} 
   
df_evaluation = pd.DataFrame(dict)
   
df_evaluation

Unnamed: 0,metric,model,value
0,metric1,finetuned,1.39529
1,metric2,finetuned,0.216548
2,metric3,finetuned,1.552774
3,metric4,finetuned,0.321
4,metric5,finetuned,9.241
5,metric1,rag,1.234
6,metric2,rag,0.023
7,metric3,rag,1.456
8,metric4,rag,0.321
9,metric5,rag,9.241


In [38]:
def visualize_metrics(df):
    #df = pd.read_csv("../lab-data/"+"new_test.csv")
    fig = px.line_polar(df, r="value",
                        theta="metric",
                        color="model",
                        line_close=True,
                        color_discrete_sequence=["#00eb93", "#4ed2ff"],
                        template="plotly_dark")

    fig.update_polars(angularaxis_showgrid=False,
                      radialaxis_gridwidth=0,
                      gridshape='linear',
                      bgcolor="#494b5a",
                      radialaxis_showticklabels=False
                      )
    fig.write_image(f"../lab-data/radarplot.pdf")
    
    fig.update_layout(paper_bgcolor="#2c2f36")
    fig.show()

In [39]:
visualize_metrics(df_evaluation);

## Analysis considerations
This plot allows you to have a quick view of the key metrics considered for this task. 

Before going through a cost analysis it is important to make sure if the models are above expected performance thresholds for the important metrics related to your use case. 

As long as both are above adequate thresholds, tyou can proceed to the costs analysis.

---


## Session 2 - Cost analysis
Once both models paseed the test related to performance, we can look at the costs involved.

We assume some premisses to proceed with this example:
+ As we need to use small datasets to run these labs, we will extrapolate RAG dataset sizes for a more realistic scenario
+ We also define a number of access per month for both solutions in order to calculate expected costs
+ Princing considered for RAG model:
    + Monthly costs related to the number of average input and output tokens
    + Monthly hosting costs for the RAG datasets
    + Monthly input and output token usage
    + We are considering pricing based on hte use of Amazon Kendra
+ Pricinng considered for Fine-tuned model
    + Fixed training cost
    + Monthly model hosting
    + Monthly input and output token usage
    
---

#### Process the input and output prompts from RAG and fine-tuned datasets form lab 1 and lab 2

In [42]:
df_finetuned_prompts = df_finetuned[["input_prompt", "output_prompt"]]
df_finetuned_prompts.head()

Unnamed: 0,input_prompt,output_prompt
0,<ANSWER> This is an example of question withou...,this is an example of output
1,<ANSWER> This is an example of question withou...,this is an example of output
2,<ANSWER> This is an example of question withou...,this is an example of output
3,<ANSWER> This is an example of question withou...,this is an example of output
4,<ANSWER> This is an example of question withou...,this is an example of output


In [43]:
df_rag_prompts = df_rag[["input_prompt", "output_prompt"]]
df_rag_prompts.head()

Unnamed: 0,input_prompt,output_prompt
0,<ANSWER> This is an example of question with <...,this is an example of output
1,<ANSWER> This is an example of question with <...,this is an example of output
2,<ANSWER> This is an example of question with <...,this is an example of output
3,<ANSWER> This is an example of question with <...,this is an example of output
4,<ANSWER> This is an example of question with <...,this is an example of output


#### Counting tokens from RAG and FT input and output prompts
We now proceed to counting the input and output tokens and calculating the sample mean for both.

In [44]:
# First we need to import spacy ADD ABOVE!!
import spacy 
# Creating blank language object then 
# tokenizing words of the sentence 
nlp = spacy.blank("en") 

In [45]:
#df_finetuned_prompts['input_prompt_tokens'] = df_finetuned_prompts['input_prompt'].apply()
pd.options.mode.copy_on_write = True
df_finetuned_prompts['num_input_prompt'] = df_finetuned_prompts['input_prompt'].map(lambda a: len(nlp(a)))
df_finetuned_prompts['num_output_prompt'] = df_finetuned_prompts['output_prompt'].map(lambda a: len(nlp(a)))

# below is using context only. Noit neede
# df_rag_prompts['num_input_prompt'] = df_rag_prompts['ctx_input_prompt'].map(lambda a: len(nlp(a)))
df_rag_prompts['num_input_prompt'] = df_rag_prompts['input_prompt'].map(lambda a: len(nlp(a)))
df_rag_prompts['num_output_prompt'] = df_rag_prompts['output_prompt'].map(lambda a: len(nlp(a)))


In [46]:
#df_finetuned_prompts[['num_input_prompt']].mean(axis=1)
ft_mean_input_size = df_finetuned_prompts.loc[:, 'num_input_prompt'].mean()
ft_mean_output_size = df_finetuned_prompts.loc[:, 'num_output_prompt'].mean()
rag_mean_input_size = df_rag_prompts.loc[:, 'num_input_prompt'].mean()
rag_mean_output_size = df_rag_prompts.loc[:, 'num_output_prompt'].mean()

In [49]:
#print(ft_mean_input_size, ft_mean_output_size, rag_mean_input_size, rag_mean_output_size)

In [48]:
# list of name, degree, score
model = ["finetuned", "finetuned", "rag", "rag"]
prompt = ["input", "output", "input", "output"]
num_tokens = [ft_mean_input_size, ft_mean_output_size, rag_mean_input_size, rag_mean_output_size]
 
# dictionary of lists 
dict = {'prompt_type': prompt, 'model': model, 'num_tokens': num_tokens} 
   
df_comp_in_out = pd.DataFrame(dict)
   
df_comp_in_out

Unnamed: 0,prompt_type,model,num_tokens
0,input,finetuned,11.0
1,output,finetuned,7.0
2,input,rag,17.0
3,output,rag,7.0


### Comparing prompt sizes for RAG and Fine-tuned models

In [52]:
fig = px.bar(df_comp_in_out, title = "Input and output mean token sizes", x="prompt_type", y="num_tokens", color="model", barmode="group")
fig.show()

## Can you see a difference above in prompt average sizing ocmparing RAG and FT?
+ For input prompts it is quite obvious: Prompt size is bigger for RAg as the solution incorporates context text into the prompt.
+ For output it is not so obvious, but you can see that fine-tuned mean prompt size is usual smaller. This is supported by research resuklts like <TODO: REFERENCE HERE>

## Fine-tuning scenario
An application developer customizes an `Llama3 8B Chat` model using `1000 question-answer pairs`as fine-tune data. 

+ After training, the developer uses custom model provisioned throughput for 1 hour to evaluate the performance of the model. 
+ The fine-tuned model is stored for 1 month. 
+ After evaluation, the developer uses provisioned throughput (1-month commitment term) to host the customized model.

The training cost incurred for fine-tuning is calculated taking into account the price per sample data seen, the number of training steps, and the batch size.
We considered for thios example:
+ price per sample: $0.005
+ number of training steps: 500
+ batch size: 64
+ fine-tune dataset size: 1000

**Note: We will be considering one year period of cummulative costs in this analysis but the time period can be asily changed to consider dsifferent timeframes**


In [89]:
# timeframe considered
start_date = '1/1/2023'
end_date = '12/31/2023'

In [90]:
# fine-tuning training cost (fixed cost) considering it took 1 month
num_hours_month = 720
price_per_sample = 0.005
number_of_steps = 500
batch_size = 64
dataset_size = 1000
evaluation_throuput_hour = 20 # used to evaluate the model during fine-tuning

ft_training_cost = price_per_sample * dataset_size * number_of_steps * batch_size
print("Fine-tuned fixed training pricing: ", ft_training_cost)
# fine-tuned model storage cost (monthly cost)
model_storage_per_month = 1.95 # custom model storage per month
fine_tuned_model_inference = evaluation_throuput_hour * num_hours_month
ft_model_hosting_cost_month = model_storage_per_month + fine_tuned_model_inference
print("Fine-tuned model monthly hosting pricing: ", ft_model_hosting_cost_month)

Fine-tuned fixed training pricing:  160000.0
Fine-tuned model monthly hosting pricing:  14401.95


## RAG scenario
We are considering Bedrock using a knoweledge base supported by Kendra for pricing simulation here.
An application developer uses a pre-trained `Llama3 8B Chat` model supported by RAG with the following premisses: 
+ 90,000 documents in the knowledge base 
+ 7000 searches per day
+ As we have less than 8k queries per day and 100k documents, the priccing per hour is $1.4

In [91]:
# RAG hosting calculation (simulating Amazon Kendra)
# 1. define pricing for bedrock use of input and output tokens (Pretrained model + RAG)
# a request to Amazon Titan Text Lite model to summarize an input of 2K tokens of input text 
# to an output of 1K token

# define typical file size and that it is in Kendra (RAG)
num_rag_doc_to_search = 90000 # just to see it is below 100K for pricing
num_searches_day = 7000
pricing_per_hour = 1.4 # Up to 8k queries per day, Up to 100k documents, 

rag_hosting_princing_per_month = pricing_per_hour * num_hours_month # $1.4 per hour x 720 hours/month = $1,008
print("RAG monthly hosting price: ", rag_hosting_princing_per_month)

RAG monthly hosting price:  1007.9999999999999


In [92]:
# token monthly prices
input_tokens_hourly_pricing = 0.0003
output_tokens_hourly_pricing = 0.0004
# FT tokens
ft_token_cost_per_month = (ft_mean_input_size/1000) * num_hours_month * input_tokens_hourly_pricing + \
                      (ft_mean_output_size/1000) * num_hours_month * output_tokens_hourly_pricing  
print("Fine-tuned tokens princing_per_month: ", ft_token_cost_per_month)

# RAG tokens
rag_token_cost_per_month = (rag_mean_input_size/1000) * num_hours_month * input_tokens_hourly_pricing + \
                      (rag_mean_output_size/1000) * num_hours_month * output_tokens_hourly_pricing  
print("RAG tokens princing_per_month: ", rag_token_cost_per_month)

Fine-tuned tokens princing_per_month:  0.004392
RAG tokens princing_per_month:  0.005688


### Plotting RAG princing
First, we look at at the different comntribution between RAG datasets hosting and tokens usage

In [93]:
# test stacke for FT and RAg separatedly
# FT frequency - monthly
def plot_pricing_rag_stacked():
    fig = px.bar(df_breakeven_rag, title="RAG stacked pricing",  x=df_breakeven_rag.month,  y=["RAG_hosting_monthly_pricing","RAG_tokens_monthly_pricing"])
    fig.show()

In [94]:
#df_rag_costs_stacked: month, RAG_hosting_monthly, RAG_tokens_monthly
num_months = pd.date_range(start=start_date, end=end_date, freq='ME').to_frame().shape[0]

l_accum_rag_token_pricing = []
l_accum_rag_hosting_pricing = []
rag_hosting_accum = 0
rag_token_accum = 0

#ft_accum = 0
#every_n_months += m
for i in range(num_months):
    #Accumulate fixed fine-tuning according to periodicity
    #if i % every_n_months == 0:
    #    ft_accum += ft_training_cost
    rag_hosting_accum += rag_hosting_princing_per_month
    rag_token_accum += rag_token_cost_per_month
    #ft_accum += ft_model_hosting_cost_month + ft_token_cost_per_month
    l_accum_rag_token_pricing.append(rag_token_accum*10000) # CASS REMOVE LATER: Multiplying tokens for n to show in plot
    l_accum_rag_hosting_pricing.append(rag_hosting_accum)
    #l_accum_ft_token_pricing.append(ft_accum)

# create dataframe for ploting
dt = pd.date_range(start=start_date, end=end_date, freq='ME')
dt.to_frame()
df_breakeven_rag = dt.to_frame(index=False)
df_breakeven_rag.rename(columns={0: "month"}, inplace=True)
df_breakeven_rag['RAG_hosting_monthly_pricing'] = l_accum_rag_hosting_pricing
df_breakeven_rag['RAG_tokens_monthly_pricing'] = l_accum_rag_token_pricing
df_breakeven_rag.shape

plot_pricing_rag_stacked()
#print("rag_accum:",sum(l_accum_rag_token_pricing))   
#print("ft_accum:",sum(l_accum_ft_token_pricing))

### Analysis considerations
We can see that hosting contributes much more than token prices in the cummulative costs of a RAG solution.

So the decision should be driven by efforts to reduce hosting costs.

## Plotting fine-tuning princing
Now, we look at at the different comntribution between fine-tuning fixed and monthly costs.

Let's take a first look at the different contributions between:
+ fine-tuning fixed training cost
+ fine-tuning monthly hosting cost
+ fine-tuning monthly token cost


In [95]:
# test stacke for FT and RAg separatedly
# FT frequency - monthly
def plot_pricing_ft_stacked(every_n_months, df):
    fig = px.bar(df, title="Fine-tuned stacked pricing - training frequency: every " + str(every_n_months) + " months",
                 x=df.month,  
                 y=["FT_hosting_monthly_pricing","FT_tokens_monthly_pricing", "FT_fixed_training_pricing"])
    fig.show()

In [106]:
def prep_finetuned(every_n_months, plot=True):
    l_accum_ft_training_pricing = []
    l_accum_ft_token_pricing = []
    l_accum_ft_hosting_pricing = []
    ft_hosting_accum = 0
    ft_token_accum = 0
    ft_training_accum = 0
    # here we can change the training frequency and see how it affects the pricing
    #every_n_months = 4
    #every_n_months += m
    for i in range(num_months):
        #Accumulate fixed fine-tuning according to periodicity
        if i % every_n_months == 0:
            ft_training_accum += ft_training_cost
        ft_hosting_accum += ft_model_hosting_cost_month
        ft_token_accum += ft_token_cost_per_month
        #ft_accum += ft_model_hosting_cost_month + ft_token_cost_per_month
        l_accum_ft_token_pricing.append(ft_token_accum* 1000000) # CASS REMOVE LATER: Multiplying tokens for n to show in plot
        l_accum_ft_hosting_pricing.append(ft_hosting_accum)
        l_accum_ft_training_pricing.append(ft_training_accum)
    # create dataframe for ploting
    dt = pd.date_range(start=start_date, end=end_date, freq='ME')
    dt.to_frame()
    df_breakeven_ft = dt.to_frame(index=False)
    df_breakeven_ft.rename(columns={0: "month"}, inplace=True)
    df_breakeven_ft['FT_hosting_monthly_pricing'] = l_accum_ft_hosting_pricing
    df_breakeven_ft['FT_tokens_monthly_pricing'] = l_accum_ft_token_pricing
    df_breakeven_ft['FT_fixed_training_pricing'] = l_accum_ft_training_pricing
    df_breakeven_ft.head()
    if plot==True:
        plot_pricing_ft_stacked(every_n_months,df_breakeven_ft)
    #print("ft_hosting_accum:",sum(l_accum_ft_hosting_pricing))   
    #print("ft_tokens_accum:",sum(l_accum_ft_hosting_pricing))
    return df_breakeven_ft

#### Comparing costs considering monthly fine-tuning with same dataset size

In [118]:
# input parameter: frequency of fine-tuning training (months)
df_ft = prep_finetuned(1)

### Analysis considerations
We can see that training cost has a bigger contribution compared to monthly costs.

And also that, for monthly costs, hosting contributes the most, the same as we have seen for the RAG solution.

So the decision here should be driven by efforts to reduce training costs:
+ By reducing training costs itself. For example:
1. if you can spend more time training, you can use a less powerful GPU for that
2. If you can reduce the frequency of fine-tuning jobs without compromising model performance quality (possible if data freshness is flexible for your solution)  

let's explore different fine-tune training frequencies now.

## Exploring different fine-tuning frequencies

In [119]:
# input parameter: frequency of fine-tuning training (months)
df_ft = prep_finetuned(3)

In [120]:
# input parameter: frequency of fine-tuning training (months)
df_ft = prep_finetuned(6)

In [121]:
# input parameter: frequency of fine-tuning training (months)
df_ft = prep_finetuned(12)

### Analysis considerations
We can clearly see that the fine-tuning frequency has a huge ipact on total costs depending on the frequency. that training cost has a bigger contribution compared to monthly costs.

Considering the final total cost after one year for the fine-tuned model, let's take a look at different training frequencies.

In [197]:
l_columns = ["train_freq", "value"]
l_cost_type = ['training', 'hosting', 'tokens']
l_acum_values = [1920000, 1728234, 52704]

#cost_type = ['training', 'hosting', 'tokens']
#accum_value = [1920000, 1728234, 52704]
 
# dictionary of lists 
dict = {'cost_type': l_cost_type, 'accum_value': l_acum_values} 
   
df_ft_pie = pd.DataFrame(dict)

pie = px.pie(df_ft_pie, values='accum_value', names='cost_type')
fig=make_subplots(rows=1, cols=2,
                  specs=[[{"type": "domain"},{"type": "xy"}]])

fig.add_trace(pie.data[0], row=1, col=1)
#ft_1month = [1920000, 1728234, 52704]
names = ['Training', 'Hosting', 'Tokens']
fig.add_trace(go.Bar(x=names, y=df_ft_pie.accum_value.to_list(), marker_color="blue", name= "cost_type"), row=1, col=2)
fig.update_layout(title="Fine-tune cost distribution - monthly training", width=700, height=350, bargap=0.05)

In [198]:
l_columns = ["train_freq", "value"]
l_cost_type = ['training', 'hosting', 'tokens']
l_acum_values = [640000, 1728234, 52704]
 
# dictionary of lists 
dict = {'cost_type': cost_type, 'accum_value': l_acum_values} 
   
df_ft_pie = pd.DataFrame(dict)

pie = px.pie(df_ft_pie, values='accum_value', names='cost_type')
fig=make_subplots(rows=1, cols=2,
                  specs=[[{"type": "domain"},{"type": "xy"}]])

fig.add_trace(pie.data[0], row=1, col=1)
names = ['Training', 'Hosting', 'Tokens']
fig.add_trace(go.Bar(x=names, y=df_ft_pie.accum_value.to_list(), marker_color="blue", name= "cost_type"), row=1, col=2)
fig.update_layout(title="Fine-tune cost distribution - quarterly training", width=700, height=350, bargap=0.05)

In [199]:
l_columns = ["train_freq", "value"]
l_cost_type = ['training', 'hosting', 'tokens']
l_acum_values = [320000, 1728234, 52704]
 
# dictionary of lists 
dict = {'cost_type': cost_type, 'accum_value': l_acum_values} 
   
df_ft_pie = pd.DataFrame(dict)

pie = px.pie(df_ft_pie, values='accum_value', names='cost_type')
fig=make_subplots(rows=1, cols=2,
                  specs=[[{"type": "domain"},{"type": "xy"}]])

fig.add_trace(pie.data[0], row=1, col=1)
names = ['Training', 'Hosting', 'Tokens']
fig.add_trace(go.Bar(x=names, y=df_ft_pie.accum_value.to_list(), marker_color="blue", name= "cost_type"), row=1, col=2)
fig.update_layout(title="Fine-tune cost distribution - semi-annually training", width=700, height=350, bargap=0.05)

In [201]:
l_columns = ["train_freq", "value"]
l_cost_type = ['training', 'hosting', 'tokens']
l_acum_values = [160000, 1728234, 52704]
 
# dictionary of lists 
dict = {'cost_type': cost_type, 'accum_value': l_acum_values} 
   
df_ft_pie = pd.DataFrame(dict)

pie = px.pie(df_ft_pie, values='accum_value', names='cost_type')
fig=make_subplots(rows=1, cols=2,
                  specs=[[{"type": "domain"},{"type": "xy"}]])

fig.add_trace(pie.data[0], row=1, col=1)
names = ['Training', 'Hosting', 'Tokens']
fig.add_trace(go.Bar(x=names, y=df_ft_pie.accum_value.to_list(), marker_color="blue", name= "cost_type"), row=1, col=2)
fig.update_layout(title="Fine-tune cost distribution - annually training",width=700, height=350, bargap=0.05)

### Analysis considerations
We can realize how the hosting monthly cost dominates as we decrease fine-tuning frequency.

## Finally, comparing  FT against RAG
# TODO

In [203]:
# FT frequency - monthly
def plot_pricing():
    fig = px.bar(df_breakeven, title="Fine-tuning frequency: every " + str(every_n_months) + " months",  x=df_breakeven.month, barmode='group', y=["RAG_pricing","FT_pricing"])
    fig.show()
#plot_pricing()

NameError: name 'df_breakeven' is not defined

# Conclusions

## 4. Get break even plot to support cost analysis

## 3. Run prompt experiment
after defined the performance baseline for fine-tuned model, we now run and experinebt to measure the input token size reduction when using fine-tuned model compared with pre-trained using RAG.
The goal is the measure the savings in token input use that we have by not using context information in RAG solutions. 
#### Question to answer
When is it worth to move to RAG instead of fine-tune the pre-trained model? considering input token costs only as this is the big cost difference between the same solutioj using RAg or fine-tuning.
RAG has an extra cost related to the context added to the prompt. 
We need to measure this difference on average and use this information to support a cost based decison related to use RAg or fine-tune.
basically, considering a break even approach: what is the fine-tuned frequency that would compensate for the same RAG cost, considering a defined fine-tune frequency? While keeping at least the same performance of fine-tuned model
We need to perfom the steps:
1. define a batch of at least 30 prompts and compute the mean size of input
2. run the pre-trained model + RAG with these prompts
    2.1 get the context retrieved and measure the mean size returned. This is the average difference of tokens added related to the fine-tun used (with same prompts)
3. Simulate an example of fine-tuned model costs considering a basic AWS architecture (OpenSearch, SageMaker, S3, etc) and calculate total cost for each traning
4. Create a break even plot based on tese 2 costs


Let's place all of this logic into a single RAG query function:

###### accessing the ft pre-trained end point and make prediscion with batch of 30 prompts