# Welcome to Lumigator foxfooding!

## Agenda

+ Setup
+ Dataset Analysis 
+ Explanation of and Examination of Thunderbird Ground Truth
+ Model Selection ( 1 encoder/decoder), (2 decoder), eval against GPT4
+ Run experiment and show results
+ Evaluate results and discuss

## Platform Setup and Jupyter Walkthrough

You'll be working in the Jupyter notebook and the platform itself is accessible via URL in the slide deck. To work with Jupyter, click "run cell" to run the code and see results below the cell you're currently running. 

In [10]:
import time
import lumigator_demo as ld
import pandas as pd
import matplotlib.pyplot as plt
import os

from datasets import load_dataset
from IPython.display import clear_output

# wrap columns for inspection
pd.set_option('display.max_colwidth', 0)
# stylesheet for visibility
plt.style.use("fast")

# Experiments
We're grouping experiments by team name to organize the data, pick a team name below and run the cell. 

In [11]:
# suggestion: "lumigator_enthusiasts", "your team name etc"
team_name = TEAM_NAME_HERE

## Generating Data for Ground Truth Evaluation

In order to generate a ground truth summary for our data, we first need an input dataset. In this case we use threads from the [Thunderbird public mailing list.](https://thunderbird.topicbox.com/latest).  In order to generate the ground truth and then later evaluate the model, we need at least 100 samples to start with, where a sample is a single email or single email conversation.

Our selection criteria: 

+ Collect 100 samples of email thread conversations, as recent as possible and fairly complete so they can be evaluated
+ Clean them of email formatting such as `>`
+ One consideration here will be that BART, the baseline model we're using, accepts 1024 token context window as input, i.e.  we have to have input email threads that are ~ approximately 1000 words, so keeping on the conservative side

Once we've collected them, we'd like to take a look at the data before we generate summaries. 

In [None]:
# show information about the Thunderbird dataset
dataset_id = "db7ff8c2-a255-4d75-915d-77ba73affc53"
r = ld.dataset_info(dataset_id)

In [None]:
# download the dataset into a pandas dataframe
df = ld.dataset_download(dataset_id)

In [None]:
# Examine a single sample 
# we define the data with examples
df['examples'].iloc[0]

In [None]:
# Add a function to do some simple character counts for model input
df['char_count'] = df['examples'].str.len()

In [None]:
df.head()

In [None]:
# Show statistics about characters count
df['char_count'].describe()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['char_count'], bins=30)
ax.set_xlabel('Character Count')
ax.set_ylabel('Frequency')

stats = df['char_count'].describe().apply(lambda x: f"{x:.0f}")

# Add text boxes for statistics
plt.text(1.05, 0.95, stats.to_string(), 
         transform=ax.transAxes, verticalalignment='top')

# Adjust layout
plt.tight_layout()
fig.subplots_adjust(right=0.75)

plt.show()

In [None]:
## Perform Ground Truth Generation with Mistral 

responses = []


for sample in df['examples'][0:10]:
    response = ld.get_mistral_ground_truth(sample)
    print(sample, response.text)
    responses.append((sample, response.text))

In [None]:
mistral_results_df = pd.DataFrame(responses, columns=['Original', 'Response'])

mistral_results_df

In [None]:
# Get all deployments 
ld.get_deployments()

In [None]:
## Perform Ground Truth Generation with BART

# PUT DEPLOYMENT ID HERE 
deployment_id = "2240dc80-5759-4e04-8aad-2d2eab2392e2"

for string in df['examples'][0:10]:
    response = ld.get_bart_ground_truth(deployment_id,string)
    print(string, response)
    responses.append((string, response))


## Loading Data

### Loading data
The following dataset is already in the format that we need as input: 
- one field called `examples` containing the text to summarize
- one field called `ground_truth` containing the summaries to the models' outputs against

Note that you can load many different types of file formats in a similar way (see https://huggingface.co/docs/datasets/loading#local-and-remote-files)

## Dataset Upload

In [None]:
dataset_name = "thunderbird.csv"
dataset_id = "f5d54efa-247d-4910-9393-f6003da9fb68" # thunderbird pre-saved dataset HuggingFace

r = ld.dataset_info(dataset_id)

## Model Selection

What you see below are different lists of models we have already tested for the summarization task.
The `models` variable at the end provides you with a selection, but you can choose any combination of them.

In [None]:
enc_dec_models = [
    'hf://facebook/bart-large-cnn',
    'hf://mikeadimech/longformer-qmsum-meeting-summarization', 
    'hf://mrm8488/t5-base-finetuned-summarize-news',
    'hf://Falconsai/text_summarization',
]

dec_models = [
    'hf://mistralai/Mistral-7B-Instruct-v0.3',
]

gpts = [
    "oai://gpt-4o-mini",
    "oai://gpt-4-turbo",
    "oai://gpt-3.5-turbo-0125"  
]

models = [
    enc_dec_models[0], # bart-large-cnn
    dec_models[0], # Mistral-7B-Instruct-v0.3
    gpts[0] # gpt-4o-mini
]

In [None]:
# show model names 

models

## Run Evaluations

In [None]:
# change the following to 0 to use all samples in the dataset
max_samples = 10

responses = []
for model in models:
    descr = f"Testing {model} summarization model on {dataset_name}"
    responses.append(ld.experiments_submit(model, team_name, descr, dataset_id, max_samples))

### Track evaluation jobs

Run the following to track your evaluation jobs.

- *NOTE*: you won't be able to run other cells while this one is running. However, you can interrupt it whenever you want by clicking on the "stop" button above and run it at a later time.

In [None]:
job_ids = [ld.get_resource_id(r) for r in responses]

wip = ld.show_experiment_statuses(job_ids)
while wip == True:
    time.sleep(5)
    clear_output()
    wip=ld.show_experiment_statuses(job_ids)

## Show evaluation results

In [None]:
# after the jobs complete, gather evaluation results
eval_results = []
for job_id in job_ids:
    eval_results.append(ld.experiments_result_download(job_id))

# convert results into a pandas dataframe
eval_table = ld.eval_results_to_table(models, eval_results)

In [None]:
eval_table

In [None]:
eval_results[0]