# Lumigator from Mozilla AI 🐊 🦊

Welcome to the Getting Started notebook for Lumigator, a powerful tool developed by
[Mozilla AI](https://www.mozilla.ai/) for evaluating language models. In this guide, we'll
walk you through the key features and functionalities of Lumigator, helping you quickly get
up to speed with using it for your LM evaluation tasks.

Specifically, we'll cover the following topics:

+ What Jupyter Notebooks are and how to use them.
+ An Overview of Machine Learning.
+ The Lumigator Platform 🐊
+ How do Machine Learning evaluation workflows look like.
+ The Thunderbird Ground Truth Dataset.
+ Selecting models to perform summarization:
  + Using one encoder/decoder (BART).
  + Utilizing two decoders (Mistral and GPT-4) to evaluate against ground truth.
+ Running evaluation experiments.
+ Discussing results.

## Jupyter Walkthrough

[Jupyter Notebooks](https://jupyter-notebook.readthedocs.io/en/stable/) provide an executable
environment for running (usually) Python code alongside text. To work with Jupyter, click the "play"
icon (i.e., ▶) to execute the code and view the results below the cell you are currently running.
You can also use the `shift + enter` shortcut to execute the code cell and move to the next one.
Cells are executed sequentially and can contain either text (Markdown) or Python code.

![cell-running](images/running.png)

Your files are located on the left-hand side. They'll be saved for the duration of our session, but
if you'd like to keep them, make sure to download them. 

![file-tree](images/files.png)

Lets' try running some code! Execute the following code and verify the output below.

In [None]:
print("Welcome to Lumigator!🐊")

## Machine learning glossary

As we walk through this notebook, you’ll encounter several Machine Learning terms that are essential
to understanding the concepts and methods we'll be using. To help you follow along, we've compiled a
brief glossary of key terms you'll come across during the session:

+ **Machine learning**: The process of creating a model that learns from data.
+ **Dataset**: Data used to train models and evaluate their performance.
+ **LLM**: Large language model, [a text-based model that performs next-word predictions](https://www.nvidia.com/en-us/glossary/large-language-models/).
+ **Tokens**: Words broken up into pieces to be used in an LLM.
+ **Inference**: The process of getting a prediction from a large language model.
+ **Embeddings**: Numerical representations of text generated by modeling.
+ **Encoder-decoder models**: A neural network architecture comprised of two neural networks, an
  encoder that takes the input vectors from our data and creates an embedding of a fixed length, and
  a decoder, also a neural network, which takes the embeddings encoded as input and generates a
  static set of outputs such as translated text or a text summary.
+ **Decoder-only models** - Given a fixed input prompt, uses its representation to generate a
  sequence of words one at a time, with each word being conditioned on the ones generated
  previously.
+ **Task** - Machine learning tasks to fit a specific model type, including translation,
  summarization, completion, etc.
+ **Ground truth** - Information that has been evaluated to be true by humans (or LLMs, in some
  cases), that we can use to evaluate and compare trained models.

The process of **machine learning** is the process of creating a mathematical model that tries to
approximate the world. A **machine learning model** is a set of instructions for generating a given
output from data. The instructions are learned from the features of the input data itself.

Within the broad landscape of machine learning, there are various modeling approaches, including
**supervised**, **unsupervised**, and **reinforcement** learning. Each approach has its own set of
techniques and use cases.

In the context of language models, the kind of modeling we focus on with Large Language Models
(LLMs) primarily falls within the domain of neural network-based approaches. These models learn
patterns, structures, and relationships in data through vast networks of interconnected nodes (or
neurons), which allow them to generate, interpret, and manipulate natural language in powerful ways.

### How do LLMs work? 

There are many different kinds of LLMs and many different kinds of architectures. For our
evaluations, we use two different kinds:

+ **Encoder/Decoder** - BART is an encoder/decoder model that converts input data into a fixed-size
  representation (similar to encoder models). These models are trained first to transform text into
  numerical representations, then to output text based on those numerical representations. They're
  good for synthesis as opposed to generation. 
+ **Decoder-only** - most models in the GPT-family, like Mistral, GPT, and others we'll be working
  with, are pre-trained with text data in an autoregressive manner, for next-token prediction given
  previous tokens.  

### LLM Evaluation Workflows

The following steps outline the key phases involved in evaluating a Large Language Model (LLM):

1. **Generate Ground Truth**: The first step is to establish a reliable ground truth based on the
   specific business use case you're targeting. This represents the "correct" or expected output
   against which model performance will be measured.
1. **Select Models for Evaluation**: Next, choose several candidate models that you'd like to
   evaluate. These could be different versions of a language model, or distinct models altogether,
   depending on your evaluation criteria and use case.
1. **Run the Evaluation Loop**: This phase involves running the models through an evaluation process
   where you compare the model's outputs against the ground truth. You'll iterate through multiple
   examples, assessing how well each model performs in generating the desired results.
1. **Analyze Evaluation Results**: Finally, after completing the evaluation loop, you'll analyze the
   results to identify strengths, weaknesses, and areas for improvement. This analysis helps inform
   decisions about which model to use, optimize, or further train for the business use case at hand.

![lumigator-features](images/lumigator-features.svg)

On a technical level, Lumigator is a Python-based FastAPI web application designed to run services
that handle jobs and deployments on a Ray cluster. This cluster can be run either locally or in the
cloud, depending on your system's specifications and the resources available.

The results and job metadata generated during evaluations are stored in an SQL database for easy
tracking and retrieval. Additionally, larger models, which are often loaded from platforms like
[Hugging Face](https://huggingface.co/), require GPUs for efficient processing due to their size and
computational demands.

![lumigator-architecture](images/lumigator-architecture.svg)

What is Ray? [A distributed runtime for Python programs](https://github.com/ray-project/ray) that
includes a Core library with primitives (Tasks, Actors, and objects) and a suite of ML libraries
(Tune, Serve) that allow to build components of the Machine Learning model workflow. 

### Nota bene: Machine Learning is alchemy

When we think of traditional software application workflows, we think of an example such as adding a
button. We can clearly test that we've added a blue button to our application, and that it works
correctly. Machine Learning is not like this! It involves a lot of experimentation, tweaking of
hyperparameters and prompts and trying different models. Expect for the process to be imperfect,
with many iterative loops. Luckily, Lumigator helps take away the uncertainty of at least model
selection. 🙂

> There’s a self-congratulatory feeling in the air. We say things like “Machine Learning is the new
> electricity”. I’d like to offer an alternative metaphor: Machine Learning has become alchemy. -
> [Ben Recht and Ali Rahimi](https://archives.argmin.net/2017/12/05/kitchen-sinks/)

Ultimately, the final conclusion of whether a model is good is if humans think it's good. With that
in mind, let's dive into setting up experiments with Lumigator to test our models!

## Importing Required Libraries

Before we begin working with Lumigator, we'll need to import several libraries and modules that
provide the necessary functionality for our tasks. In this section, we'll load the core
dependencies, including tools for data handling, and visualization.

In [None]:
# Importing packages we need to work with data 
# python standard libraries
import os
import time
import json

# Random string generator
import random
import string
import shortuuid

# third-party libraries
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
from IPython.display import clear_output

from lumigator_sdk.lumigator import LumigatorClient
from lumigator_schemas.datasets import DatasetFormat
from lumigator_schemas.jobs import JobType, JobEvalCreate

from utils import job_result_download, results_to_table, get_nested_value

# wrap columns for inspection
pd.set_option('display.max_colwidth', 0)
# stylesheet for visibility
plt.style.use("fast")

%load_ext autoreload
%autoreload 2

In [None]:
LUMIGATOR_SERVICE_HOST = os.getenv('LUMIGATOR_SERVICE_HOST', 'localhost')
LUMIGATOR_SERVICE_PORT = os.getenv('LUMIGATOR_SERVICE_PORT', '8000')

# Understanding the Lumigator App and API 

The app itself consists of an API, which you can access and test out methods in the
[OpenAPI spec](https://swagger.io/specification/), at the platform URL, under docs.

If you are running Lumigator as a local installation, you can directly access the API at
[this URL](http://localhost:8000/docs).

![lumigator-api](images/lumigator-api.png)

Large language models today are consumed in one of several ways:

+ As **API endpoints** for proprietary models hosted by [OpenAI](https://openai.com/),
  [Anthropic](https://www.anthropic.com/), or major cloud providers.
+ As **model artifacts** downloaded from HuggingFace’s Model Hub, trained/fine-tuned using
  HuggingFace libraries, and hosted on local storage.
+ As model artifacts available in a format optimized for **local inference**, typically
  [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), and accessed via applications
  like [llama.cpp](https://github.com/ggerganov/llama.cpp) or [ollama](https://ollama.com/).
+ As [ONNX](https://onnx.ai/), a format which optimizes sharing between backend ML frameworks.

We use API endpoints and local storage in Lumigator. We currently have four key endpoints on the
platform:

+ `/health`: Status of the application, running status of jobs and deployments. 
+ `/datasets`: Data that we add to the platform for evaluation. We can upload, delete, and save
  different data in the platform. We'll use this to also save our ground truth and experiment data.
+ `/jobs`: Our actual evaluation jobs. We can list all previous evaluations, create new ones, and
  get their results.
+ `/completions`: Access to external APIs such as Mistral and OpenAI.

## Model Task: Summarization

The task we'll be working with is *summarization*, aka we want to generate a summary of our text. In
our business case, which is to create summaries of conversation threads, much as you might see in
Slack or an email chain, the models need to be able to extract key information from those threads
while still being able to accept a large context window to capture the entire conversation history. 

We identified that it is far more valuable to conduct *abstractive* summaries—summaries that
identify important sections in the text and generate highlights—rather than *extractive* ones, which
select a subset of sentences and staple them together. This is because the final interface will be
in natural language, and we want to avoid summaries that are interpreted from often incoherent text
snippets produced by extractive methods.

For more on summarization as a use-case, [see our blog post here.](https://blog.mozilla.ai/on-model-selection-for-text-summarization/)

## Ground Truth for Models

The term ground truth comes from geology and geospatial sciences, where actual information was
collected on the ground to validate data acquired through remote sensing, such as satellite imagery
or aerial photography. Since then, the concept has been adopted in other fields, particularly in
machine learning and artificial intelligence, to refer to the accurate, real-world data used for
training and testing models. 

The **best ground truth is human-generated** but building it is a very expensive task. One recent
trend is to rely on large language models but they have their own pitfalls. An intermediate approach
uses different LLMs to provide ground truth "candidates" which are then subject to human pairwise
evaluation.

## Our Input data

The data we'll be using in this walkthrough comes from
[DialogSum](https://github.com/cylnlp/DialogSum), a large-scale labeled dialogue summarization
dataset which comes with ground truth provided by human annotators. Here follows a brief description
of DialogSum.

In [None]:
# The dataset is available at https://huggingface.co/datasets/knkarthick/dialogsum
# and can be directly downloaded with the `load_dataset` method
dataset = 'knkarthick/dialogsum'
ds = load_dataset(dataset, split='validation')
df = ds.to_pandas()

In [None]:
# Examine a single sample 
df['dialogue'].iloc[0]

In [None]:
# Add a function to do some simple character counts for model input
df['char_count'] = df['dialogue'].str.len()

In [None]:
# inspect our data
df.head(n=3)

In [None]:
# Show statistics about characters count
df['char_count'].describe()

In [None]:
# Generate plot of character counts
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['char_count'], bins=30)
ax.set_xlabel('Character Count')
ax.set_ylabel('Frequency')

stats = df['char_count'].describe().apply(lambda x: f"{x:.0f}")

# Add text boxes for statistics
plt.text(1.05, 0.95, stats.to_string(), 
         transform=ax.transAxes, verticalalignment='top')

# Adjust layout
plt.tight_layout()
fig.subplots_adjust(right=0.75)

plt.show()

## Save and upload datasets

Now that you have seen how the option to generate ground truth works, let us save all datasets and
make them available to lumigator for further experiments. For each example (i.e., dialogsum original
dataset) we will perform the following operations:

1. Make sure that the two main fields (original text and ground truth) are called `examples` and
  `ground_truth`, which are the names internally used by Lumigator to refer to them, and save the
  datasets as CSV files.
2. Make the dataset available to Lumigator with the `create_dataset` method.

In [None]:
ds = ds.remove_columns(["id", "topic"])
ds = ds.rename_column("dialogue", "examples")
ds = ds.rename_column("summary", "ground_truth")

dataset_name = "dialogsum_converted.csv"
ds.to_csv(dataset_name)

In [None]:
lm_client = LumigatorClient(
    f"{LUMIGATOR_SERVICE_HOST}:{LUMIGATOR_SERVICE_PORT}"
)

lm_client.datasets.create_dataset(
    open(dataset_name, "rb"),
    DatasetFormat.JOB
)

In [None]:
# And let's check that data loaded
datasets = lm_client.datasets.get_datasets()

## Jobs

After generating the ground truth (either manually or with the aid of some models) and uploading the
dataset to lumigator, we are ready to start evaluating models on it. Note that when you uploaded
your datasets you got back some information that included a dataset `id`. This is a unique
identifier to your own dataset that you can reuse across different jobs.

In [None]:
dataset_id = datasets.items[0].id

# now look for the dataset on lumigator
result = lm_client.datasets.get_dataset(dataset_id)
dataset_id, dataset_name = result.id, result.filename

### Model Selection

What you see below are different lists of models we have already tested for the summarization task.
The `models` variable at the end provides you with a selection, but you can choose any combination
of them: the default is a single local model (`facebook/bart-large-cnn`), but depending on your
setup you can choose more and/or add different APIs.

Note that different model types are specified with different prefixes:

- `hf://` is used for HuggingFace models which are downloaded and ran as Ray jobs.
- `mistral://` is used for models which are accessed through the Mistral API.
- `oai://` is used for models which are accessed through an OpenAI-compatible API.

In [None]:
# Here follows a list of models we have tested for summarization:
# feel free to add any of them in the "models" list below
#
# Encoder-Decoder models
#    'hf://facebook/bart-large-cnn',
#    'hf://mikeadimech/longformer-qmsum-meeting-summarization', 
#    'hf://mrm8488/t5-base-finetuned-summarize-news',
#    'hf://Falconsai/text_summarization',
#
# Decoder models
#    'mistral://open-mistral-7b',
#
# GPTs
#    "oai://gpt-4o-mini",
#    "oai://gpt-4-turbo",
#    "oai://gpt-3.5-turbo-0125",
#
models = [
    'hf://facebook/bart-large-cnn',
]

### Run Evaluations

The following cell will start the actual model evaluations. Once you run it, new jobs will be
submitted to ray (one for each model) and the outcomes of these submissions will be printed.

Each evaluation job will first use the provided model to summarize each of the emails in the input
dataset. After that, it will calculate a few metrics to evaluate how close the predicted summaries
are to the ground truth provided in the dataset. Each job starts with a `created` status. While the
job runs, you will be able to track its status by running the cell in the next section.

In [None]:
# set this value to limit the evaluation to the first max_samples items (0=all)
max_samples = 10
# team_name is a way to group jobs together under the same namespace, feel free to customize it
team_name = "lumigator_enthusiasts"

responses = []
for model in models:
    job_args = {
        "name": team_name,
        "description": "Test run.",
        "model": model,
        "dataset": str(dataset_id),
        "max_samples": max_samples
    }
    descr = f"Testing {model} summarization model on {dataset_name}"
    responses.append(lm_client.jobs.create_job(JobType.EVALUATION, JobEvalCreate(**job_args)))

![ray-job](images/ray-job.png)

### Track Evaluation Jobs

To track the progress of your evaluation jobs, you’ll need to run the following commands. These jobs
are executed on a Ray cluster, which efficiently distributes the workload across multiple nodes,
whether locally or in the cloud.

Not that you won't be able to run other cells while this one is running. However, you can interrupt
it whenever you want by clicking on the "stop" button above and run it at a later time.

In [None]:
job_id = responses[0].id

job = lm_client.jobs.wait_for_job(job_id)  # Create the coroutine object

print(job)

## Show evaluation results

Once all evaluations are completed, their results will be stored on our platform and available for
download. The following cell iterates on all your job ids, downloads results from each, and builds a
table comparing different metrics for each model.

The metrics we use to evaluate are ROUGE, METEOR, and BERT score. They all measure similarity
between predicted summaries and those provided with the ground truth, but each of them focuses on
different aspects. The image below shows their main characteristics and the tradeoffs between their
flexibility and their computational cost.

![metrics](images/metrics.png)

In [None]:
job_download = lm_client.jobs.get_job_download(job_id)
result = job_result_download(job_download)

In [None]:
results_to_table([result])

## Analysis of Evaluation Results

The table above is just a summary of all the evaluation results. The `result` object contains way
more details from which you'll be able to get a few more insights in the following cells.

The following cell shows you the kind of information that's available in each of the `result` object
elements. This information is nested at different depth levels. You can access each using the
`get_nested_value` command.

In [None]:
# eval_results is a list holding information for each of the models you defined before
# for each element, you can access different metrics, time performance, and predictions
result.keys()

In [None]:
# see how much time it took for a model to summarize all the input samples
get_nested_value(result, "summarization_time")

In [None]:
# see all the bertscore data
get_nested_value(result, "bertscore")

In [None]:
# see mean bert precision
get_nested_value(result, "bertscore/precision_mean")