## Comparing Deepseek/Mistral/OpenAI Models using Lumigator 🐊

There's been a lot of hype around [Deepseek R-1](https://github.com/deepseek-ai/DeepSeek-R1): 
it's an open source model that rivals OpenAIs o1 performance!

In this notebook, we will use Lumigator in order to evaluate Deepseek R-1 against OpenAI o1 and Mistral Large.

### Dataset

Neither GPT-4o or DeepSeek R-1 have published exactly what data they've used for training. This means
that anything on the internet may have been used in its training and we have no way to verify that 
any public data wasn't used in the training process. This makes any evaluation we do with public data
inherently flawed. If we post it on the internet, it's technically possible to use for LLM training and
is no longer a reliable benchmark for future models. 

That's a big caveat to this notebook demonstration: the model performance differences don't actually 
indicate which model is better, in order to answer that question you'll have to try it on your own data
that couldn't possibly have been used for training DeepSeek R-1 or GPT-4o!

With that in mind, the dataset we'll use here is called [SummScreen](https://arxiv.org/abs/2104.07091) ForeverDreaming.
It's a dataset of tv show transcripts and their associated recaps, and for this demo we'll filter down to using only episodes
from the popular US tv show called "The Office". This filtered dataset is useful for a few reasons:

1. The input transcripts are quite long (>4k tokens) which means that generating a summary isn't trivial task.
2. I'm a domain expert in this because I've watched all of the episodes of The Office many times.
 I'll be able to evaluate for myself how good the summary is, and I'll know if a model missed anything in its summary.

Pre-Requisites

Before running this notebook, you need to have Lumigator running.You will need to ensure that you have both `OPENAI_API_KEY`, `DEEPSEEK_API_KEY`, and `MISTRAL_API_KEY` set in your environment variables. Then, run `make local-up` in order to build and have Lumigator listening. Now Lumigator should be ready for our experiment and be able to make requests to OpenAI, Mistral, and DeepSeek!

In [None]:
from datasets import load_dataset

# First step, let's prepare the dataset!

# First, grab the dataset off huggingface: https://huggingface.co/datasets/YuanPJ/summ_screen
ds = load_dataset("YuanPJ/summ_screen", "fd")["test"]
# filter for only examples which contain "Gilmore_Girls" in the File Name
ds = ds.filter(lambda x: "The_Office" in x["File Name"])

# Now let's prepare it for Lumigator upload. We need to rename some columns and delete the rest
# rename the column "input" to "text" and "output" to "ground_truth". This is what Lumigator expects
ds = ds.rename_column("Transcript", "examples")
ds = ds.rename_column("Recap", "ground_truth")

# remove all columns except "text" and "ground_truth"
columns_list = ds.column_names
columns_list.remove("examples")
columns_list.remove("ground_truth")
ds = ds.remove_columns(columns_list)

print(f"The filtered test split contains {len(ds)} examples.")
# convert ds to a csv and make it a string so we can upload it with the Lumigator API
DS_OUTPUT = "office_dataset.csv"
ds.to_csv(DS_OUTPUT)
MAX_SAMPLES = 1  # This demo is only designed to run on example, to make visual comparison easier

In [None]:
from pathlib import Path
from time import sleep

from lumigator_schemas.datasets import DatasetFormat
from lumigator_schemas.experiments import GetExperimentResponse
from lumigator_schemas.workflows import WorkflowCreateRequest, WorkflowStatus
from lumigator_sdk.lumigator import LumigatorClient


def wait_for_all_workflows(lumi_client_int: LumigatorClient, experiment_id: str) -> GetExperimentResponse:
    """Wait for an experiment to complete."""
    still_running = True
    while still_running:
        still_running = False
        experiment_details = lumi_client_int.experiments.get_experiment(experiment_id)
        still_running_workflows = []
        for workflow in experiment_details.workflows:
            if workflow.status not in [WorkflowStatus.SUCCEEDED, WorkflowStatus.FAILED]:
                still_running_workflows.append(workflow.name)
        if still_running_workflows:
            still_running = True
            print(f"Waiting for workflows {still_running_workflows} to complete")
            sleep(10)
    return experiment_details


# Time to connect up to the Lumigator client!
LUMI_HOST = "localhost:8000"
client = LumigatorClient(api_host=LUMI_HOST)

# Upload that file that we created earlier
with Path.open(DS_OUTPUT) as file:
    data = file.read()
dataset_response = client.datasets.create_dataset(dataset=data, format=DatasetFormat.JOB)
dataset_id = dataset_response.id
print(f"Dataset uploaded and has ID: {dataset_id}")

In [None]:
# Now time to create an experiment in Lumigator! This is a container for all the workflows we'll run
from lumigator_schemas.experiments import ExperimentCreate

request = ExperimentCreate(
    name="The Office Summarization",
    description="Bears, Beets, Battlestar Galactica",
    dataset=dataset_id,
    max_samples=MAX_SAMPLES,
)
experiment_response = client.experiments.create_experiment(request)
experiment_id = experiment_response.id
print(f"Experiment created and has ID: {experiment_id}")

In [None]:
import requests
from lumigator_schemas.workflows import WorkflowDetailsResponse


# Wait till the workflow is done
def get_workflow_results(workflow: WorkflowDetailsResponse):
    response = requests.get(workflow.artifacts_download_url)
    result = response.json()
    results = {
        "rouge2": round(result["metrics"]["rouge"]["rouge2_mean"], 2),
        "bertscore": round(result["metrics"]["bertscore"]["f1_mean"], 2),
        "meteor": round(result["metrics"]["meteor"]["meteor_mean"], 2),
        "predictions": result["artifacts"]["predictions"],
        "ground_truth": result["artifacts"]["ground_truth"],
        "examples": result["artifacts"]["examples"],
    }
    return results

In [None]:
# Let's run the Deepseek R1 https://api-docs.deepseek.com/quick_start/pricing
request = WorkflowCreateRequest(
    name="Deepseek R1",
    description="Summarize with Deepseek R-1",
    model="deepseek/deepseek-reasoner",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=MAX_SAMPLES,
)
client.workflows.create_workflow(request).model_dump()

In [None]:
# Let's run the Deepseek V3 https://api-docs.deepseek.com/quick_start/pricing
request = WorkflowCreateRequest(
    name="Deepseek V3",
    description="Summarize with Deepseek V3",
    model="deepseek/deepseek-chat",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=MAX_SAMPLES,
)
client.workflows.create_workflow(request).model_dump()

In [None]:
# Let's run the Deepseek V3 https://api-docs.deepseek.com/quick_start/pricing
request = WorkflowCreateRequest(
    name="Mistral Large",
    description="Summarize with Mistral Lage",
    model="mistral/mistral-large-latest",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=MAX_SAMPLES,
)
client.workflows.create_workflow(request).model_dump()

In [None]:
# Now let's run the same thing, but with o3-mini
request = WorkflowCreateRequest(
    name="OpenAI o1",
    description="Summarize with o1",
    model="openai/o1",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=MAX_SAMPLES,
)
client.workflows.create_workflow(request).model_dump()

In [None]:
# Now let's run the same thing, but with GPT-40
request = WorkflowCreateRequest(
    name="OpenAI o3-mini",
    description="Summarize with o3-mini",
    model="openai/o3-mini",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=MAX_SAMPLES,
)
client.workflows.create_workflow(request).model_dump()

In [None]:
# Now let's run the same thing, but with GPT-40
request = WorkflowCreateRequest(
    name="Llamafile Mistral",
    description="Summarize with Q4 Llamafile",
    model="openai/scoopdewhoop",
    base_url="http://localhost:8080/v1",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=MAX_SAMPLES,
)
client.workflows.create_workflow(request).model_dump()

In [None]:
import pandas as pd

experiment = wait_for_all_workflows(client, experiment_id)
print(f"Experiment: {experiment.name}")
# create a table with the results
table = pd.DataFrame()
for workflow in experiment.workflows:
    print(f"--------{workflow.name}--------")
    print(f"Desc: {workflow.description}")
    if workflow.status == WorkflowStatus.SUCCEEDED:
        results = get_workflow_results(workflow)
        print(f"ROUGE2: {results['rouge2']}")
        print(f"BERTScore: {results['bertscore']}")
        print(f"METEOR: {results['meteor']}")
        for idx, prediction in enumerate(results["predictions"]):
            hypo = prediction["choices"][0]["message"]["content"]
            comp_tok = prediction["usage"]["completion_tokens"]
            prompt_tok = prediction["usage"]["prompt_tokens"]
            reasoning_tok = prediction["usage"]["completion_tokens_details"]
            if reasoning_tok:
                comp_tok = comp_tok - reasoning_tok["reasoning_tokens"]
            print(f"Example {idx}")
            print(f"Prediction: \n{hypo}\n")
            print(f"Completion Tokens: {comp_tok}")
            print(f"Prompt Tokens: {prompt_tok}")
            print(f"Reasoning Tokens: {reasoning_tok}")
            # add a new row to the table
            table = pd.concat(
                [
                    table,
                    pd.DataFrame(
                        {
                            "Model": workflow.name,
                            "ROUGE2": results["rouge2"],
                            "BERTScore": results["bertscore"],
                            "METEOR": results["meteor"],
                            "Example": idx,
                            "Prediction": hypo,
                            "Tokens": comp_tok,
                            "Prompt Tokens": prompt_tok,
                            "Reasoning Tokens": reasoning_tok,
                        },
                        index=[0],
                    ),
                ]
            )
    else:
        print(f"Workflow {workflow.id} failed: deleting the workflow.")
        client.workflows.delete_workflow(workflow.id)

In [None]:
# Generate a table the prints out all the automatic metrics and numbers for easy comparison
print(table[["Model", "ROUGE2", "BERTScore", "METEOR", "Tokens"]].to_string(index=False))

## Conclusion

Although we can't make any confident conclusions about which model is better overall (because all models may have had data about The Office in their training mix), DeepSeek models look to be competitive with OpenAI's and Mistral's for this summarization task. What other summarization tasks might be interesting to compare using Lumigator?