## Comparing Deepseek Models using Lumigator 🐊

There's been a lot of hype around [Deepseek R-1](https://github.com/deepseek-ai/DeepSeek-R1): 
it's an open source model that rivals GPT-4 performance!

In this notebook, we will use Lumigator in order to evaluate Deepseek R-1 against GPT-4o.

### Dataset

Neither GPT-4o or DeepSeek R-1 have published exactly what data they've used for training. This means
that anything on the internet may have been used in its training and we have no way to verify that 
any public data wasn't used in the training process. This makes any evaluation we do with public data
inherently flawed. If we post it on the internet, it's technically possible to use for LLM training and
is no longer a reliable benchmark for future models. 

That's a big caveat to this notebook demonstration: the model performance differences don't actually 
indicate which model is better, in order to answer that question you'll have to try it on your own data
that couldn't possibly have been used for training DeepSeek R-1 or GPT-4o!

With that in mind, the dataset we'll use here is called [SummScreen](https://arxiv.org/abs/2104.07091) ForeverDreaming.
It's a dataset of TV show transcripts and their associated recaps. It's useful here for a few reasons:

1. The input transcripts are quite long (>8k tokens) which means that generating a summary is no trivial task.
2. For the purposes of this demo, we'll filter down to only Gilmore Girls episodes. 
   I'm a domain expert in this because I've watched most episodes a few times, so I can judge how good the summary is, in addition to the automatic metrics we get back.

In [86]:
from datasets import load_dataset

# First step, let's prepare the dataset!

# First, grab the dataset off huggingface: https://huggingface.co/datasets/YuanPJ/summ_screen
ds = load_dataset("YuanPJ/summ_screen", "fd")["test"]
# filter for only examples which contain "Gilmore_Girls" in the File Name
ds = ds.filter(lambda x: "Gilmore" in x["File Name"])

# Now let's prepare it for Lumigator upload. We need to rename some columns and delete the rest
# rename the column "input" to "text" and "output" to "ground_truth". This is what Lumigator expects
ds = ds.rename_column("Transcript", "examples")
ds = ds.rename_column("Recap", "ground_truth")

# remove all columns except "text" and "ground_truth"
columns_list = ds.column_names
columns_list.remove("examples")
columns_list.remove("ground_truth")
ds = ds.remove_columns(columns_list)

print(f"The filtered test split contains {len(ds)} examples.")
# convert ds to a csv and make it a string so we can upload it with the Lumigator API
DS_OUTPUT = "gg_dataset.csv"
ds.to_csv(DS_OUTPUT)

The filtered test split contains 12 examples.


Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

653430

In [87]:
from pathlib import Path
from time import sleep

from lumigator_schemas.datasets import DatasetFormat
from lumigator_schemas.workflows import WorkflowCreateRequest, WorkflowStatus
from lumigator_sdk.lumigator import LumigatorClient
from lumigator_sdk.strict_schemas import ExperimentIdCreate


# Now we'll create some helper functions.
def wait_for_workflow_complete(lumi_client_int: LumigatorClient, workflow_id: str):
    """Wait for a workflow to complete."""
    workflow_details = lumi_client_int.workflows.get_workflow(workflow_id)
    while workflow_details.status not in [WorkflowStatus.SUCCEEDED, WorkflowStatus.FAILED]:
        sleep(5)
        workflow_details = lumi_client_int.workflows.get_workflow(workflow_id)
    return workflow_details


# Time to connect up to the Lumigator client!
LUMI_HOST = "localhost:8000"
client = LumigatorClient(api_host=LUMI_HOST)

# Upload that file that we created earlier
with Path.open(DS_OUTPUT) as file:
    data = file.read()
dataset_response = client.datasets.create_dataset(dataset=data, format=DatasetFormat.JOB)
dataset_id = dataset_response.id
print(f"Dataset uploaded and has ID: {dataset_id}")

Dataset uploaded and has ID: 0fc03fef-5bce-4e79-9d72-e8a9440b214e


In [88]:
# Now time to create an experiment in Lumigator! This is a container for all the workflows we'll run
request = ExperimentIdCreate(
    name="Gilmore Girls Summarization",
    description="Which LLM knows Rory and Lorelai the best?",
)
experiment_response = client.experiments.create_experiment(request)
experiment_id = experiment_response.id
print(f"Experiment created and has ID: {experiment_id}")

Experiment created and has ID: 250565396489376488


In [80]:
import requests


# Wait till the workflow is done
def get_results(workflow_id):
    print(f"Waiting for workflow {workflow_id} to complete")
    details = wait_for_workflow_complete(client, workflow_id)
    # Load the artifact so that we can look at the example
    print(f"Workflow {workflow_id} has completed")
    response = requests.get(details.artifacts_download_url)
    result = response.json()

    results = {
        "avg_rouge": round(result["rouge"]["rouge2_mean"], 2),
        "avg_bertscore": round(result["bertscore"]["f1_mean"], 2),
        "avg_meteor": round(result["meteor"]["meteor_mean"], 2),
        "predictions": result["predictions"],
        "ground_truth": result["ground_truth"],
        "examples": result["examples"],
    }
    return results

In [95]:
# Let's run the Deepseek R1 https://api-docs.deepseek.com/quick_start/pricing
request = WorkflowCreateRequest(
    name="Deepseek R1",
    description="Summarize with Deepseek R-1",
    model="deepseek/deepseek-reasoner",
    model_url="deepseek/deepseek-reasoner",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=1,
)
deepseek_response = client.workflows.create_workflow(request)
deepseek_id = deepseek_response.id

In [96]:
# Now let's run the same thing, but with GPT-40
request = WorkflowCreateRequest(
    name="GPT-40",
    description="Summarize with GPT-40",
    model="gpt-4o",
    model_url="gpt-4o",
    dataset=dataset_id,
    experiment_id=experiment_id,
    max_samples=1,
)
gpt4_response = client.workflows.create_workflow(request)
gpt4_id = gpt4_response.id

In [97]:
# for each one, wait til they're done and then grab the resuls
deepseek_results = get_results(deepseek_id)
gpt4_results = get_results(gpt4_id)

Waiting for workflow 8338c0686462470e9d68c36b1dd5dd66 to complete
Workflow 8338c0686462470e9d68c36b1dd5dd66 has completed
Waiting for workflow 7ccca9a2ccfc4c61bf8f8a98cbc737ea to complete
Workflow 7ccca9a2ccfc4c61bf8f8a98cbc737ea has completed


In [98]:
# All Done! Let's print out a few results
print("=" * 10)
print("Rouge-2 Scores")
print(f"Deepseek R1: {deepseek_results['avg_rouge']}")
print(f"GPT-40: {gpt4_results['avg_rouge']}")
print("=" * 10)

print("=" * 10)
print("BertScore Scores")
print(f"Deepseek R1: {deepseek_results['avg_bertscore']}")
print(f"GPT-40: {gpt4_results['avg_bertscore']}")
print("=" * 10)

print("=" * 10)
print("Meteor Scores")
print(f"Deepseek R1: {deepseek_results['avg_meteor']}")
print(f"GPT-40: {gpt4_results['avg_meteor']}")
print("=" * 10)

# print the first example
print("=" * 10)
print("Deepseek R1")
print("=" * 10)
print(deepseek_results["predictions"][0]["choices"][0]["message"]["content"])
print("=" * 10)
print("GPT-40")
print("=" * 10)
print(gpt4_results["predictions"][0]["choices"][0]["message"]["content"])

Rouge-2 Scores
Deepseek R1: 0.03
GPT-40: 0.03
BertScore Scores
Deepseek R1: 0.82
GPT-40: 0.81
Meteor Scores
Deepseek R1: 0.19
GPT-40: 0.24
Deepseek R1
**Summary of "Gilmore Girls: Concert Interruptus" (Season 1, Episode 13):**

Lorelai and Rory prepare for a town charity rummage sale, clashing over Lorelai’s sentimental attachment to her eccentric wardrobe (e.g., a tasseled halter top). Meanwhile, Sookie secures coveted tickets to a Bangles concert, prompting Lorelai to invite Rory, Lane, and Sookie. Lane’s strict mother, Mrs. Kim, forbids her from attending, highlighting their generational and cultural conflict.  

At Chilton, Rory’s debate team (with Paris, Madeline, and Louise) studies at the Gilmore house, showcasing Paris’s intensity and Rory’s efforts to navigate high school dynamics. Lorelai impulsively offers the girls her concert tickets, sacrificing her front-row seats to foster Rory’s social connections.  

During the concert, Madeline and Louise recklessly leave with strang

## Conclusion

At least for summarizing Gilmore Girls Episodes, the hype about Deepseek-R1 looks like it's real. 

The quality of the summary rivals GPT-4o not only in terms of automatic metrics like ROUGE/METEOR/BERTSCORE, but a manual comparison confirms that they are similarly comprehensive.