# Welcome to Lumigator Foxfooding from [Mozilla.ai](https://www.mozilla.ai/)! 🐊 🦊

## Agenda

+ Working with Jupyter Notebooks
+ Lumigator Platform Overview  🐊
+ Understanding Machine Learning Workflows 
+ Thunderbird Dataset Walkthrough
+ Explanation of and Examination of Thunderbird Ground Truth
+ Model Selection ( 1 encoder/decoder), (2 decoder), eval against GPT4
+ Run experiment and show results
+ Evaluate results and discuss

## Jupyter Walkthrough

[Jupyter Notebooks](https://jupyter-notebook.readthedocs.io/en/stable/) are an executable code/text environment for (usually) Python code. Our Jupyter environment is in JupyterHub. To work with Jupyter, click "run cell" to run the code and see results below the cell you're currently running. Cells are executed sequentially. 

In some cells, you will see cases where there are variables that you'll need to pre-populate before running the cell. They look like this. The code will not work unless you replace it!  

```python
# suggestion: "lumigator_enthusiasts"
team_name = TEAM_NAME_HERE
```

# Running cells 
To run a cell, press the "play" icon in the top bar. If a cell is taking too long, you can press stop. 


<img src="running.png" alt="drawing" width="400"/>

Your files are located on the left-hand side. They'll be saved for the duration of our session, but if you'd like to keep them, make sure to download them. 

<img src="files.png" alt="drawing" width="400"/>


In [52]:
## Lets' try running some code!

print("Welcome to Lumigator!🐊")

# You can see the output below!

Welcome to Lumigator!🐊


For more on notebooks and how cell works, check out this demo. [You can click links in cells.](https://github.com/nbgallery/Jupyter4Analysts/blob/main/J4A%20Notebook%201%20-%20Jupyter%20Syntax%20and%20Other%20Things.ipynb)

## Glossary of terms 

Some terms you'll hear us using throughout the session: 

+ **Machine learning** - The process of creating a model that learns from data
+ **Dataset** - Data used to train models
+ **LLM** - Large language model, [a text-based model that performs next-word predictions](https://www.nvidia.com/en-us/glossary/large-language-models/) 
+ **Tokens** - Words broken up into pieces to be used in an LLM 
+ **Inference** - The process of getting a prediction from a large language model 
+ **Embeddings** - Numerical representations of text generated by modeling 
+ **Encoder-decoder models** - a neural network architecture comprised of two neural networks, an encoder that takes the input vectors from our data and creates an embedding of a fixed length, and a decoder, also a neural network, which takes the embeddings encoded as input and generates a static set of outputs such as translated text or a text summary
+ **Decoder-only models** - Receive a prompt of text directly and predict the next word
+ **Task** - Machine learning tasks to fit a specific model type, including translation, summarization, completion, etc. 
+ **Ground truth** - A dataset that has been evaluated to be true by humans (or LLMs, in some cases) to be correct, that we can use as a point of comparison for our model. 

# Machine Learning Workflows

In machine learning, we are looking to generate a model artifact from data. We have several stages we care about: the data preprocessing, model training, model generation, inference, and evaluation. 

<img src="ml_workflow.png" alt="drawing" width="400"/>

Within the universe of modeling approaches, there are supervised and unsupervised approaches, as well as reinforcement learning. When we think of language modeling, that falls in the realm of neural network approaches. 


<img src="ml_family.png" alt="drawing" width="400"/>


Lumigator focuses on **inference** and **evaluation** for large language models: we want to be able to take our own dataset, perform inference on it, and evaluate the results to see if the model we would like to use produces good results for our use-cases. Use-cases include cases that are specific to our business. 


In order to select an LLM, we need the following stages: 

1. Generate ground truth for our business use-case
2. Pick several models we'd like to use to evaluate
3. Run an evaluation loop consisting of looking at the ground truth in comparison to model results
4. Analyze our evaluations. 

These are the steps that Lumigator completes. Here's a platform overview


<img src="platform.png" alt="drawing" width="400"/>


## Machine learning is alchemy

When we think of traditional software application workflows, we think of an example such as adding a button. We can clearly test that we've added a blue button to our application, and that it works correctly. Machine learning is not like this! It involves a lot of experimentation, tweaking of hyperparameters and prompts and trying different models. Expect for the process to be imperfect, with many iterative loops. Luckily, Lumigator helps take away the uncertainty of at least model selection :)

> There’s a self-congratulatory feeling in the air. We say things like “machine learning is the new electricity”. I’d like to offer an alternative metaphor: machine learning has become alchemy. - [Ben Recht and Ali Rahimi](https://archives.argmin.net/2017/12/05/kitchen-sinks/)

Ultimately, the final conclusion of whether a model is good is if humans think it's good. 

With that in mind, let's dive into setting up experiments with Lumigator to test our models!

In [54]:
# We have a library of utility functions that will help us connect to the Lumigator API
# Let's take a second to walk through them 

import lumigator_demo as ld


In [55]:
# Importing packages we need to work with data 
import time

import pandas as pd
import matplotlib.pyplot as plt
import os

from datasets import load_dataset
from IPython.display import clear_output

# wrap columns for inspection
pd.set_option('display.max_colwidth', 0)
# stylesheet for visibility
plt.style.use("fast")

# Understanding the Lumigator App and API 

 The app itself consists of an API, which you can access and test out methods with in the [OpenAPI spec](https://swagger.io/specification/), at the platform URL, under docs. 

<img src="openapi.png" alt="drawing" width="200"/>

[Here](https://lumigator.mzai.dev/docs) are the docs for the Lumigator API. In looking at them, we can see that we have 7 endpoints. 
The application is split up into `jobs`, `deployments`, `datasets`, `experiments`, and `completions`.

+ `Datasets` - Data that we add to the platform for evaluation. We can upload, delete, and save different data in the platform. 
+ `Experiments` - a tag that we create to associate all of our data
+ `Jobs` - Check running status of lm-buddy evaluation jobs
+ `Deployments` - Running Ray-serve deployments with locally-hosted models
+ `Completions` - Access to external APIs such as Mistral and OpenAI
+ `Health` - Status of the application, jobs and deployments. 




# Experiments
Let's start by creating a team name for our experiments to organize our data, pick a team name below and run the cell. 

In [None]:
# suggestion: "lumigator_enthusiasts"
team_name = TEAM_NAME_HERE

## Model Task

The task we'll be working with is summarization, aka we want to generate a summary of our text. In this particular case, emails. Finding a good model for summarization is a daunting task, as the typical intuition that larger parameter models generally perform better goes out the window. For summarization, we need to consider the input, which will likely be of a longer context size, and finding models that efficiently deal with those longer contexts is of paramount importance. In our business case, which is to create summaries of conversation threads, much as you might see in Slack or an email chain, the models need to be able to extract key information from those threads while still being able to accept a large context window to capture the entire conversation history.We identified that it is far more valuable to conduct abstractive summaries, or summaries that identify important sections in the text and generate highlights,  rather than extractive ones, which pick a subset of sentences and add them together for our use cases since the final interface will be natural language. We want the summary results not to need to be interpreted from often incoherent text snippets produced by extractive summaries. 

For more on summarization as a use-case, [see our blog post here.](https://blog.mozilla.ai/on-model-selection-for-text-summarization/)

## Ground Truth for Models

In order to generate ground truth, we need to generate some baseline summaries. We'll do this by performing inference against existing models that are trained for summarization. Let's take a look at the data we'll be using first. [Generating GT](https://thunderbird.topicbox.com/groups/addons/T18e84db141355abd-M4cca8e3f9e4fee9ae14b9dbb/self-hosted-version-of-extension-is-incorrectly-appearing-in-atn)



In [None]:
## What are LLMS? 

Large language models work by 

## Generating Data for Ground Truth Evaluation

In order to generate a ground truth summary for our data, we first need an input dataset. In this case we use threads from the [Thunderbird public mailing list.](https://thunderbird.topicbox.com/latest)  To generate the ground truth and then later evaluate the model, we need at least 100 samples to start with, where a sample is a single email or single email conversation.

Our selection criteria: 

+ Collect 100 samples of email thread conversations, as recent as possible and fairly complete so they can be evaluated
+ Clean them of email formatting such as `>`
+ One consideration here will be that BART, the baseline model we're using, accepts 1024 token context window as input, i.e.  we have to have input email threads that are ~ approximately 1000 words, so keeping on the conservative side

Once we've collected them, we'd like to take a look at the data before we generate summaries. 

In [None]:
# show information about the Thunderbird dataset
dataset_id = "db7ff8c2-a255-4d75-915d-77ba73affc53"
r = ld.dataset_info(dataset_id)

In [None]:
# download the dataset into a pandas dataframe
df = ld.dataset_download(dataset_id)
df.head()

In [None]:
import re
from string import punctuation

def preprocess_text(text:str):
    text = text.lower()  # Lowercase text
    text = re.sub(f"[{re.escape(punctuation)}]", "", text)  # Remove punctuation
    text = " ".join(text.split())  # Remove extra spaces, tabs, and new lines
    text = re.sub(r"\b[0-9]+\b\s*", "", text)
    return text

df["examples"].map(preprocess_text)

In [None]:
# Examine a single sample 
# we define the data with examples
df['examples'].iloc[0]

In [None]:
# Add a function to do some simple character counts for model input
df['char_count'] = df['examples'].str.len()

In [None]:
df.head

In [None]:
# Show statistics about characters count
df['char_count'].describe()

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['char_count'], bins=30)
ax.set_xlabel('Character Count')
ax.set_ylabel('Frequency')

stats = df['char_count'].describe().apply(lambda x: f"{x:.0f}")

# Add text boxes for statistics
plt.text(1.05, 0.95, stats.to_string(), 
         transform=ax.transAxes, verticalalignment='top')

# Adjust layout
plt.tight_layout()
fig.subplots_adjust(right=0.75)

plt.show()

In [None]:
#TODO

what is ground truth? how do we vibe-compare it between models? 

In [57]:
## Perform Ground Truth Generation with Mistral 

responses = []

for sample in df['examples'][0:10]:
    response = ld.get_mistral_ground_truth(sample)
    print(f"Mistral:", response)
    responses.append((sample, response))

{
  "text": "The user has released a beta version 7.0b1 of their extension, Clippings for Thunderbird, and made it available for testing separately while keeping the current stable release 6.3.5 for regular users. However, the beta version was mistakenly listed on the Add-ons for Thunderbird public listing. The user was advised to open an issue on the addons-server GitHub repository, as self-hosted add-ons should not be submitted to ATN, especially if they do not have an update_url entry in their manifest. The best practices for self-hosting add-ons include removing the beta version from ATN, creating a dedicated branch or repo for update information, hosting XPI files as \"beta\" assets in a GitHub release or directly in the repo, and ensuring the manifest points to the correct update.json file. The user has successfully followed these steps and removed the beta version from ATN."
}
Response from Mistral The user has released a beta version 7.0b1 of their extension, Clippings for Thun

{
  "text": "Thunderbird is preparing its next major release, Thunderbird 128 ESR. Developers of add-ons should check their compatibility with this new version as it is currently being distributed through the beta release channel. The changes required for compatibility mainly affect Experiment add-ons. WebExtensions generally do not need updates, but new permission 'messagesUpdate' has been introduced, and the 'browser.messages.update()' function will stop working if the new permission is not requested. Another significant change is the official support for Manifest Version 3 in Thunderbird 128. The add-ons team has also updated the API documentation on webextension-api.thunderbird.net, which now includes both Thunderbird and Firefox WebExtension APIs. For any assistance, developers can reach out to the Thunderbird community."
}
Response from Mistral Thunderbird is preparing its next major release, Thunderbird 128 ESR. Developers of add-ons should check their compatibility with this ne

In [None]:
# We're adding these results 
mistral_results_df = pd.DataFrame(responses, columns=['Original', 'Response'])

mistral_results_df

In [None]:
# TODO

ways to run LLMs: API access, running on cluster, and running locally 
what is Ray? 

In [None]:
# TODO

Lumigator and LM-Buddy, how they work together

In [None]:
# Let's take a look at all available deploys
ld.get_deployments()

In [66]:
## Perform Ground Truth Generation with BART

# PUT DEPLOYMENT ID HERE 
deployment_id = "510aff69-8634-43b8-8153-386957b03778"

for string in df['examples'][0:1]:
    response = ld.get_bart_ground_truth(deployment_id,string)
    print(response)
    responses.append((string, response))


Request failed: 500 Server Error: Internal Server Error for url: https://lumigator.mzai.dev/api/v1/ground-truth/deployments/bbdd0114-7e43-4222-86c6-541dab987586


HTTPError: 500 Server Error: Internal Server Error for url: https://lumigator.mzai.dev/api/v1/ground-truth/deployments/bbdd0114-7e43-4222-86c6-541dab987586

## Loading Data

### Loading data
The following dataset is already in the format that we need as input: 
- one field called `examples` containing the text to summarize
- one field called `ground_truth` containing the summaries to the models' outputs against

Note that you can load many different types of file formats in a similar way (see https://huggingface.co/docs/datasets/loading#local-and-remote-files)

In [None]:
#TODO:

huggingface datasets versus csv
and lm-buddy prefixes

## Dataset Upload

In [None]:
dataset_name = "thunderbird.csv"
dataset_id = "f5d54efa-247d-4910-9393-f6003da9fb68" # thunderbird pre-saved dataset HuggingFace

r = ld.dataset_info(dataset_id)

## Model Selection

What you see below are different lists of models we have already tested for the summarization task.
The `models` variable at the end provides you with a selection, but you can choose any combination of them.

In [None]:
enc_dec_models = [
    'hf://facebook/bart-large-cnn',
    'hf://mikeadimech/longformer-qmsum-meeting-summarization', 
    'hf://mrm8488/t5-base-finetuned-summarize-news',
    'hf://Falconsai/text_summarization',
]

dec_models = [
    'mistral://open-mistral-7b',
]

gpts = [
    "oai://gpt-4o-mini",
    "oai://gpt-4-turbo",
    "oai://gpt-3.5-turbo-0125"  
]

# TODO: add llamafile

models = [
    enc_dec_models[0], # bart-large-cnn
    dec_models[0], # Mistral-7B-Instruct-v0.3
    gpts[0] # gpt-4o-mini
]

# show selected models

models

In [None]:
# TODO
Introduce metrics 

## Run Evaluations

In [None]:
# change the following to 0 to use all samples in the dataset
max_samples = 10

responses = []
for model in models:
    descr = f"Testing {model} summarization model on {dataset_name}"
    responses.append(ld.experiments_submit(model, team_name, descr, dataset_id, max_samples))

In [None]:
# TODO
Discuss Ray dashboard/show dashboard

### Track evaluation jobs

Run the following to track your evaluation jobs.

- *NOTE*: you won't be able to run other cells while this one is running. However, you can interrupt it whenever you want by clicking on the "stop" button above and run it at a later time.

In [None]:
job_ids = [ld.get_resource_id(r) for r in responses]

wip = ld.show_experiment_statuses(job_ids)
while wip == True:
    time.sleep(5)
    clear_output()
    wip=ld.show_experiment_statuses(job_ids)

## Show evaluation results

In [None]:
# after the jobs complete, gather evaluation results
eval_results = []
for job_id in job_ids:
    eval_results.append(ld.experiments_result_download(job_id))

# convert results into a pandas dataframe
eval_table = ld.eval_results_to_table(models, eval_results)

In [None]:
eval_table

In [None]:
eval_results[0]

## Evaluation Results

In [None]:
#TODO

add eval discussion 