# Welcome to lumigator foxfooding!

## Agenda

+ Introduction and setup environment (credentials)
+ Platform Setup and Walkthrough
+ Explanation of and Examination of Thunderbird Ground Truth
+ Model Selection ( 1 encoder/decoder), (2 decoder), eval against GPT4
+ Run experiment and show results
+ Evaluate results and discuss

## Foxfooding Introduction and Setup Environment

## Who we are, what we do, about the platform, etc. 


## Platform Setup and Walkthrough

In [1]:
import time
import lumigator_demo as ld

from datasets import load_dataset
from IPython.display import clear_output

Write your team name below:

In [2]:
team_name = "lumigator_enthusiasts"

## Working with datasets

### Loading data
The following dataset is already in the format that we need as input: 
- one field called `examples` containing the text to summarize
- one field called `ground_truth` containing the summaries to the models' outputs against

Note that you can load many different types of file formats in a similar way (see https://huggingface.co/docs/datasets/loading#local-and-remote-files)

In [3]:
dataset_name = "thunderbird.csv"

### commented until the dataset is final - just use the following cell to download the dataset
# ds = load_dataset("csv", data_files = dataset_name, split="train")
# ds = ds.to_pandas()

# show / do things with the dataset here

## Dataset Upload

In [4]:
# r = ld.dataset_upload(dataset_name)
# dataset_id = ld.get_resource_id(r)

dataset_id = "95802e81-0334-476c-9b08-aa5da07fde9f" # thunderbird pre-saved dataset

### Check dataset info

At any point, one can get dataset info by just providing its UUID:

In [5]:
r = ld.dataset_info(dataset_id)

{
  "id": "95802e81-0334-476c-9b08-aa5da07fde9f",
  "filename": "thunderbird.csv",
  "format": "experiment",
  "size": 151137,
  "created_at": "2024-07-26T15:39:40.700918Z"
}


## Model Selection

What you see below are different lists of models we have already tested for the summarization task.
The `models` variable at the end provides you with a selection, but you can choose any combination of them.

In [6]:
enc_dec_models = [
    'hf://facebook/bart-large-cnn',
    'hf://mikeadimech/longformer-qmsum-meeting-summarization', 
    'hf://mrm8488/t5-base-finetuned-summarize-news',
    'hf://Falconsai/text_summarization',
]

dec_models = [
    'hf://mistralai/Mistral-7B-Instruct-v0.3',
    # TODO: test more dec_models such as
    # 'hf://meta-llama/Meta-Llama-3-8B',
    # 'hf://microsoft/Phi-3-mini-4k-instruct',
]

gpts = [
    "oai://gpt-4o-mini",
    "oai://gpt-4-turbo",
    "oai://gpt-3.5-turbo-0125"  
]

models = [
    enc_dec_models[0],
    dec_models[0],
    gpts[1]
]

In [7]:
models

['hf://facebook/bart-large-cnn',
 'hf://mistralai/Mistral-7B-Instruct-v0.3',
 'oai://gpt-4-turbo']

## Run Evaluations

In [9]:
# change the following to 0 to use all samples in the dataset
max_samples = 10

responses = []
for model in models:
    descr = f"Testing {model} summarization model on {dataset_name}"
    responses.append(ld.experiments_submit(model, team_name, descr, dataset_id, max_samples))

{
  "id": "5a2a95c4-9902-4841-81ab-c9818a9a86a2",
  "name": "lumigator_enthusiasts",
  "description": "Testing hf://facebook/bart-large-cnn summarization model on thunderbird.csv",
  "status": "created",
  "created_at": "2024-07-26T17:12:14.978142Z",
  "updated_at": null
}
{
  "id": "218cc213-6e5d-4b86-821b-727bc4527b35",
  "name": "lumigator_enthusiasts",
  "description": "Testing hf://mistralai/Mistral-7B-Instruct-v0.3 summarization model on thunderbird.csv",
  "status": "created",
  "created_at": "2024-07-26T17:12:15.730906Z",
  "updated_at": null
}
{
  "id": "ece1e25a-9f53-42cf-838b-984d58f07f91",
  "name": "lumigator_enthusiasts",
  "description": "Testing oai://gpt-4-turbo summarization model on thunderbird.csv",
  "status": "created",
  "created_at": "2024-07-26T17:12:16.449763Z",
  "updated_at": null
}


### Track evaluation jobs

Run the following to track your evaluation jobs.

- *NOTE*: you won't be able to run other cells while this one is running. However, you can interrupt it whenever you want by clicking on the "stop" button above and run it at a later time.

In [10]:
job_ids = [ld.get_resource_id(r) for r in responses]

wip = ld.show_experiment_statuses(job_ids)
while wip == True:
    time.sleep(5)
    clear_output()
    wip=ld.show_experiment_statuses(job_ids)

5a2a95c4-9902-4841-81ab-c9818a9a86a2: SUCCEEDED
218cc213-6e5d-4b86-821b-727bc4527b35: SUCCEEDED
ece1e25a-9f53-42cf-838b-984d58f07f91: SUCCEEDED


## Show evaluation results

In [11]:
# after the jobs complete, gather evaluation results
eval_results = []
for job_id in job_ids:
    eval_results.append(ld.experiments_result_download(job_id))

# convert results into a pandas dataframe
eval_table = ld.eval_results_to_table(models, eval_results)

In [12]:
eval_table

Unnamed: 0,Model,Meteor,BERT Precision,BERT Recall,BERT F1,ROUGE-1,ROUGE-2,ROUGE-L
0,facebook/bart-large-cnn,0.999994,1.0,1.0,1.0,1.0,1.0,1.0
1,mistralai/Mistral-7B-Instruct-v0.3,0.446223,0.808785,0.938535,0.86861,0.25075,0.238054,0.238879
2,gpt-4-turbo,0.316353,0.859208,0.882404,0.870604,0.328549,0.105376,0.214223


In [None]:
eval_results[2]