# Lumigator from [Mozilla.ai](https://www.mozilla.ai/) 🐊 🦊




+ Working with Jupyter Notebooks
+ Lumigator Platform 🐊 and Machine Learning Overview  
+ Understanding Machine Learning Workflows 
+ Thunderbird Ground Truth Dataset Walkthrough
+ Selection of models 
    + 1 encoder/decoder (BART) 
    + 2 decoder (Mistral and GPT-4o) to evaluate against GT
+ Run evaluation experiment
+ Discuss results

## Jupyter Walkthrough

[Jupyter Notebooks](https://jupyter-notebook.readthedocs.io/en/stable/) are an executable code/text environment for (usually) Python code. Our Jupyter environment is in JupyterHub. To work with Jupyter, click "run cell" to run the code and see results below the cell you're currently running. Cells are executed sequentially. 

# Running cells 
To run a cell, press the "play" icon in the top bar (you can also hit Shift+Enter to run and proceed to the following cell).


<img src="assets/running.png" alt="drawing" width="400"/>

Your files are located on the left-hand side. They'll be saved for the duration of our session, but if you'd like to keep them, make sure to download them. 

<img src="assets/files.png" alt="drawing" width="400"/>


In [None]:
# Lets' try running some code!
# You can see the output below!
print("Welcome to Lumigator!🐊")

## Machine learning glossary

Some terms you'll hear us using throughout the session: 

+ **Machine learning** - The process of creating a model that learns from data
+ **Dataset** - Data used to train models and evaluate their performance
+ **LLM** - Large language model, [a text-based model that performs next-word predictions](https://www.nvidia.com/en-us/glossary/large-language-models/) 
+ **Tokens** - Words broken up into pieces to be used in an LLM 
+ **Inference** - The process of getting a prediction from a large language model 
+ **Embeddings** - Numerical representations of text generated by modeling 
+ **Encoder-decoder models** - A neural network architecture comprised of two neural networks, an encoder that takes the input vectors from our data and creates an embedding of a fixed length, and a decoder, also a neural network, which takes the embeddings encoded as input and generates a static set of outputs such as translated text or a text summary
+ **Decoder-only models** - Given a fixed input prompt, uses its representation to generate a sequence of words one at a time, with each word being conditioned on the ones generated previously
+ **Task** - Machine learning tasks to fit a specific model type, including translation, summarization, completion, etc. 
+ **Ground truth** - Information that has been evaluated to be true by humans (or LLMs, in some cases) to be correct, that we can use as a point of comparison for our model. 

The process of **machine learning** is the process of creating a mathematical model that tries to approximate the world. A **machine learning model** is a set of instructions for generating a given
output from data. The instructions are learned from the features of the input data itself.

Within the universe of modeling approaches, there are **supervised** and **unsupervised** approaches, as well as reinforcement learning. 
Language modeling of the kind we do with LLMs falls in the realm of neural network approaches. 

<img src="assets/ml_family.png" alt="drawing" width="400"/>

## How do LLMs work? 

There are many different kinds of LLMs and many different kinds of architectures. For our evaluations, we use two different kinds:

+ **Encoder/Decoder** - BART model, Converts input data into a fixed-size representation (similar to encoder models). These models are trained first to generate text into numerical representations, then to output text based on those numerical representations. They're good for synthesis as opposed to generation. 
+ **Decoder-only** - most GPT-family models like Mistral, GPT, and others we'll be working with.  These models are pre-trained with text data in an autoregressive manner, for next-token prediction given previous tokens.  

## LLM Workflows

1. Pre-train the model itself and generate a model artifact
2. Generate ground truth for our business use-case
3. Pick several models we'd like to use to evaluate
4. Run an evaluation loop consisting of looking at the ground truth in comparison to model results
5. Analyze our evaluations.

<img src="assets/llm_steps.png" alt="drawing" width="600"/>

Lumigator on a technical level is a **Python-based FastAPI web app** with services that run **jobs and deployments on a Ray cluster** which can be run either locally or in the cloud, depending on your computer specs. Results are stored in **Postgres**. Larger models loaded from HuggingFace require GPUs.  

<img src="assets/platform.png" alt="drawing" width="400"/>

What is Ray? [A distributed runtime for Python programs](https://github.com/ray-project/ray) that includes a Core library with primitives (Tasks, Actors, and objects) and a suite of ML libraries (Tune, Serve) that allow to build components of the machine learning model workflow. 

## Nota bene: Machine learning is alchemy

When we think of traditional software application workflows, we think of an example such as adding a button. We can clearly test that we've added a blue button to our application, and that it works correctly. Machine learning is not like this! It involves a lot of experimentation, tweaking of hyperparameters and prompts and trying different models. Expect for the process to be imperfect, with many iterative loops. Luckily, Lumigator helps take away the uncertainty of at least model selection :)

> There’s a self-congratulatory feeling in the air. We say things like “machine learning is the new electricity”. I’d like to offer an alternative metaphor: machine learning has become alchemy. - [Ben Recht and Ali Rahimi](https://archives.argmin.net/2017/12/05/kitchen-sinks/)

Ultimately, the final conclusion of whether a model is good is if humans think it's good. 

With that in mind, let's dive into setting up experiments with Lumigator to test our models!

In [None]:
# We have a library of utility functions that will help us connect to the Lumigator API
# Let's take a second to walk through them 

import lumigator_demo as ld

In [None]:
# Importing packages we need to work with data 
# python standard libraries
import time
import json
import pandas as pd
import matplotlib.pyplot as plt
import os

# Random string generator
import random
import string
import shortuuid

# third-party libraries
from datasets import load_dataset
from IPython.display import clear_output

# wrap columns for inspection
pd.set_option('display.max_colwidth', 0)
# stylesheet for visibility
plt.style.use("fast")

%load_ext autoreload
%autoreload 2

# Understanding the Lumigator App and API 

 The app itself consists of an API, which you can access and test out methods in the [OpenAPI spec](https://swagger.io/specification/), at the platform URL, under docs. 

<img src="assets/openapi.png" alt="drawing" width="200"/>

Large language models today are consumed in one of several ways:

+ As **API endpoints** for proprietary models hosted by OpenAI, Anthropic, or major cloud providers
+ As **model artifacts** downloaded from HuggingFace’s Model Hub and/or trained/fine-tuned using HuggingFace libraries and hosted on local storage
+ As model artifacts available in a format optimized for **local inference**, typically GGUF, and accessed via applications like llama.cpp or ollama
+ As ONNX, a format which optimizes sharing between backend ML frameworks

We use API endpoints and local storage in Lumigator. 


We currently have 5 key endpoints on the platform. 

+ `Health` - Status of the application, running status of jobs and deployments. 
+ `Datasets` - Data that we add to the platform for evaluation. We can upload, delete, and save different data in the platform. - We'll use this to save our ground truth and experiment data
+ `Experiments` - Our actual evaluation experiments. We can list all previous evaluations, create new ones, and get their results.
+ `Groundtruth` - Running Ray-serve deployments with locally-hosted models
+ `Completions` - Access to external APIs such as Mistral and OpenAI

## Model Task: Summarization

The task we'll be working with is **summarization**, aka we want to generate a summary of our text. 

In our business case, which is to create summaries of conversation threads, much as you might see in Slack or an email chain, the models need to be able to extract key information from those threads while still being able to accept a large context window to capture the entire conversation history. 

We identified that it is far more valuable to conduct **abstractive** summaries, or summaries that identify important sections in the text and generate highlights,  rather than **extractive** ones, which pick a subset of sentences and add them together for our use cases since the final interface will be natural language. We want the summary results not to need to be interpreted from often incoherent text snippets produced by extractive summaries. 

For more on summarization as a use-case, [see our blog post here.](https://blog.mozilla.ai/on-model-selection-for-text-summarization/)

## Ground Truth for Models

The term ground truth comes from geology and geospatial sciences, where actual information was collected on the ground to validate data acquired through remote sensing, such as satellite imagery or aerial photography. Since then, the concept has been adopted in other fields, particularly in machine learning and artificial intelligence, to refer to the accurate, real-world data used for training and testing models. 

The **best ground truth is human-generated** but building it is a very expensive task.
One recent trend is to rely on large language models but (as you will see later) they have their own pitfalls.
An intermediate approach uses different LLMs to provide ground truth "candidates" which are then subject to human pairwise evaluation.
For the sake of explanation, we will generate our ground truth by performing inference against existing models that are trained for summarization.




## Our Input data

Let's take a look at the [data we'll be using first from the Thunderbird public mailing list.](https://thunderbird.topicbox.com/groups/addons/T18e84db141355abd-M4cca8e3f9e4fee9ae14b9dbb/self-hosted-version-of-extension-is-incorrectly-appearing-in-atn)

## Generating Data for Ground Truth Evaluation

In order to generate a ground truth summary for our data, we first need an input dataset. In this case we use threads from the [Thunderbird public mailing list](https://thunderbird.topicbox.com/latest).

Our selection criteria: 

+ Collect 100 recent and "complete" email threads for evaluation
+ Clean them of email formatting such as `>`
+ BART, the baseline model we're using, accepts up to a 1024-token-long context window as input. This means that we have to have input email threads that are ~ approximately 1000 words, so keeping on the conservative side for smaller models. 

Once we've collected them, we'd like to take a look at the data before we generate summaries. 

In [None]:
# show information about the Thunderbird dataset
dataset_id = "db7ff8c2-a255-4d75-915d-77ba73affc53"
r = ld.dataset_info(dataset_id)

In [None]:
# download the dataset into a pandas dataframe
df = ld.dataset_download(dataset_id)
df.head()

In [None]:
# We'd like to make sure our data is clean for LLM input
# This is often not necessary since most LLMs are trained on internet-formatted data
# But we'll be careful here

import re
from string import punctuation

def preprocess_text(text:str):
    text = text.lower()  # Lowercase text
    text = re.sub(f"[{re.escape(punctuation)}]", "", text)  # Remove punctuation
    text = " ".join(text.split())  # Remove extra spaces, tabs, and new lines
    text = re.sub(r"\b[0-9]+\b\s*", "", text)
    return text

df["examples"].map(preprocess_text)

In [None]:
# Examine a single sample 
# we define the data with examples
df['examples'].iloc[0]

In [None]:
# Add a function to do some simple character counts for model input
df['char_count'] = df['examples'].str.len()

In [None]:
# inspect our data
df.head

In [None]:
# Show statistics about characters count
df['char_count'].describe()

In [None]:
# Generate plot of character counts
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['char_count'], bins=30)
ax.set_xlabel('Character Count')
ax.set_ylabel('Frequency')

stats = df['char_count'].describe().apply(lambda x: f"{x:.0f}")

# Add text boxes for statistics
plt.text(1.05, 0.95, stats.to_string(), 
         transform=ax.transAxes, verticalalignment='top')

# Adjust layout
plt.tight_layout()
fig.subplots_adjust(right=0.75)

plt.show()

In [None]:
# Perform Ground Truth Generation with Mistral 

mistral_responses = []

for sample in df['examples'][0:10]:
    res = ld.get_mistral_ground_truth(sample)
    print(f"Mistral Summary:", res)
    mistral_responses.append((sample, res['text']))

In [None]:
# Let's create a result set we can look at
mistral_results_df = pd.DataFrame(mistral_responses, columns=['examples', 'mistral_response'])
mistral_results_df

In [None]:
# Let's take a look at all available deployments
ld.get_deployments()

In [None]:
# Perform Ground Truth Generation with BART

deployment_id = ld.get_summarizer_deployment_id()

bart_responses = []

for prompt in df['examples'][0:10]:
    response = ld.get_bart_ground_truth(deployment_id, prompt)
    response_dict = json.loads(response.text)
    results = response_dict.get('deployment_response', {}).get('result', 'No result found')
    print("BART:", results)
    bart_responses.append((prompt, results))

In [None]:
bart_results_df = pd.DataFrame(bart_responses, columns=['examples', 'bart_response'])
bart_results_df

In [None]:
# Combine results and examine multiple versions of ground truth
merged_df = pd.merge(bart_results_df, mistral_results_df, on='examples', how='outer')
merged_df.to_csv('ground_truth.csv', index=False)
merged_df 

In [None]:
# Now that we have the data, let's save it to the cluster so we can use it later on
bart_results_df = bart_results_df.rename(columns={"bart_response": "ground_truth"})
bart_results_df.to_csv('bart_ground_truth.csv', index=False)
ld.dataset_upload("bart_ground_truth.csv")

In [None]:
mistral_results_df = mistral_results_df.rename(columns={"mistral_response": "ground_truth"})
mistral_results_df.to_csv('mistral_ground_truth.csv', index=False)
ld.dataset_upload("mistral_ground_truth.csv")

In [None]:
# And let's check that data loaded
ld.get_datasets()

# Experiments
Let's start by creating a team name for our experiments to organize our data, pick a team name below and run the cell. 

In [None]:
# Let's create an ID for our experiments 
alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)

def shortuuid_random():
    return su.random(length=8)

short_guid = shortuuid_random()
team_name = f"gator_{short_guid}"
team_name

## Loading Data

After generating the ground truth (either manually or with the aid of some models) and uploading the dataset to lumigator, we are ready to start evaluating models on it.

Note that when you uploaded your datasets you were returned some information that included a `dataset_id`. This is a unique identifier to your own dataset that you can reuse across different experiment. Please add your dataset identifier in the cell below to use it from now on.

Note that we have also provided a few pre-generated datasets below, in the same format as the one you just generated. If you want to try one of them you can just remove the `YOUR_DATASET_ID` line and uncomment (by removing the trailing `#` character) the one with the dataset you want.

In [None]:
dataset_id = YOUR_DATASET_ID
# dataset_id = "fd454e33-e0c1-4c3b-a5f3-151a2be8beaa" # Mistral GT - 10 samples
# dataset_id = "daaa63ae-84fc-4557-a301-97d71b4ca7fe" # Bart GT - 10 samples
# dataset_id = "1bc65c24-5f28-4ede-9578-f56e4cbdeb5f" # Mistral-API GT - 100 samples
# dataset_id = "6bb9378a-012f-486b-afac-01b56f00456e" # Bart GT - 100 samples
# dataset_id = "a36061aa-b18a-4abc-a7de-d652670ed971" # Mistral-llamafile GT - 100 samples

# now look for the dataset on lumigator
r = ld.dataset_info(dataset_id)
dataset_name = json.loads(r.text)['filename']

## Model Selection

What you see below are different lists of models we have already tested for the summarization task.
The `models` variable at the end provides you with a selection, but you can choose any combination of them.

Note that different model types are specified with different prefixes:

- `hf://` is used for HuggingFace models which are downloaded and ran as Ray jobs
- `mistral://` is used for models which are accessed through the Mistral API
- `oai://` is used for models which are accessed through an OpenAI-compatible API

In [None]:
enc_dec_models = [
    'hf://facebook/bart-large-cnn',
    'hf://mikeadimech/longformer-qmsum-meeting-summarization', 
    'hf://mrm8488/t5-base-finetuned-summarize-news',
    'hf://Falconsai/text_summarization',
]

dec_models = [
    'mistral://open-mistral-7b',
]

gpts = [
    "oai://gpt-4o-mini",
    "oai://gpt-4-turbo",
    "oai://gpt-3.5-turbo-0125"  
]

models = [
    enc_dec_models[0],
    dec_models[0],
    gpts[0]
]

models

## Run Evaluations

The following cell will start the actual model evaluations.
Once you run it, new jobs will be submitted to ray (one for each model) and the outcomes of these submissions will be printed.
Each evaluation job will first use the provided model to summarize each of the emails in the input dataset. After that, it will calculate a few metrics to evaluate how close the predicted summaries are to the ground truth provided in the dataset.

Each job starts with a `created` status. While the job runs, you will be able to track its status by running the cell in the section **Track evaluation jobs**.

In [None]:
# set this value to limit the evaluation to the first max_samples items (0=all)
max_samples = 0

responses = []
for model in models:
    descr = f"Testing {model} summarization model on {dataset_name}"
    responses.append(ld.experiments_submit(model, team_name, descr, dataset_id, max_samples))

### Track evaluation jobs

Run the following to track your evaluation jobs.

- *NOTE*: you won't be able to run other cells while this one is running. However, you can interrupt it whenever you want by clicking on the "stop" button above and run it at a later time.

In [None]:
job_ids = [ld.get_resource_id(r) for r in responses]

wip = ld.show_experiment_statuses(job_ids)
while wip == True:
    time.sleep(5)
    clear_output()
    wip=ld.show_experiment_statuses(job_ids)

## Show evaluation results

Once all evaluations are completed, their results will be stored on our platform and available for download. 
You can download them individually with the command

```python
ld.experiments_result_download(job_id)
```

The following cell iterates on all your job ids, downloads results from each, and builds a table comparing different metrics for each model.
The metrics we use to evaluate are ROUGE, METEOR, and BERT score. They all measure similarity between predicted summaries and those provided with the ground truth, but each of them focuses on different aspects. The image below shows their main characteristics and the tradeoffs between their flexibility and their computational cost.

<img src="assets/metrics.png" alt="drawing" width="900"/>

In [None]:
# after the jobs complete, gather evaluation results
eval_results = []
for job_id in job_ids:
    eval_results.append(ld.experiments_result_download(job_id))

# convert results into a pandas dataframe
ld.eval_results_to_table(eval_results)

## Analysis of Evaluation Results

The tablel above is just a summary of all the evaluation results.
The `eval_results` object contains way more details from which you'll be able to get a few more insights in the following cells.

### Direct access to all data

The following cell shows you the kind of information that's available in each of the `eval_results` elements. This information is nested at different depth levels. You can access each using the `get_nested_value` command.

In [None]:
# eval_results is a list holding information for each of the models you defined before
# for each element, you can access different metrics, time performance, and predictions
eval_results[0].keys()

In [None]:
# see how much time it took for a model to summarize all the input samples
ld.get_nested_value(eval_results[0], "summarization_time")

In [None]:
# see all the bertscore data
ld.get_nested_value(eval_results[0], "bertscore")

In [None]:
# see mean bert precision
ld.get_nested_value(eval_results[0], "bertscore/precision_mean")

### See the samples with the best and worst values for a given model and metric

Sometimes an individual average score is hard to interpret and to get some sense of it one wants to look into the data. With the following you can get more insights from the best and worst predictions for a given model and metric.

In [None]:
ids = [ld.experiments_result_download(job_id) for job_id in job_ids]

ld.show_best_worst(ids, "hf://facebook/bart-large-cnn", "bertscore/f1")