---
title: "Ingredients extraction"
execute:
  freeze: true
title-block-banner: "#497D74"
description: Explore possibilities for ingredients extraction from a given input text.
format:
  html:
    code-fold: true
    code-tools: true
    number-sections: false
    toc: true
    toc-location: right
    toc-depth: 2
    toc-expand: 1
    callout-icon: true
    highlight-style: tango
    code-line-numbers: ayu
    embed-resources: true
    theme: flatly
    grid:
        body-width: 1000px
---

## Context

We will add a feature to handle a recipe requests from the user by using natural language. 
The request sentence will be parsed by a LLM to extract the ingredients that the user wants to include or exclude 
from the recipe. This list will then be used as input of the recipe recommendation system. The goal of this notebook 
is to explore the performance of LLMs for this task.

## Test dataset

We will use a dataset containing 20 recipe requests. For each examples it contains:

- A sentence in which the user ask for a recipe.
- The list of ingredients that:
  - The user likes
  - The user doesn't like

This will be used as a benchmark. We will see if the model can accurately extract the ingredients from the sentence.

Here is an example of recipe request and the corresponding expected ingredients:

In [None]:
import json
from pathlib import Path
from typing import Any

import pandas as pd
from great_tables import GT
from openai import OpenAI
from pydantic import BaseModel

TEST_FILE_PATH = Path.cwd().parent / "data" / "test_cases_request_to_ingredients.json"


class RecipeRequest(BaseModel):
    """A request for a recipe recommendation."""

    recipe_request: str
    positive_ingredients: list[str]
    negative_ingredients: list[str]


# Load dataset
with Path.open(TEST_FILE_PATH, encoding="utf-8") as json_file:
    _test_recipes_requests_ingredients = json.load(json_file)

test_recipes_requests_ingredients = {k: RecipeRequest(**v) for k, v in _test_recipes_requests_ingredients.items()}

# Display a sample
for i, k in enumerate(test_recipes_requests_ingredients.keys()):
    if i >= 2:
        break
    print(test_recipes_requests_ingredients[k].model_dump_json(indent=2))

## Benchmark

In order to easily compare models locally, we will use ollama and openai sdk to use different llms. 
Ollama supports the structured output which allows to handle the LLM's output easily.

In [None]:
# | code-fold: false

class Ingredients(BaseModel):
    """Lists of ingredients associated to positive and negative feelings"""

    positive_ingredients: list[str]
    negative_ingredients: list[str]


def get_ingredients(recipe_request: str, llm_credentials: dict, llm_model: str) -> Ingredients:
    """Get the list of positive and negative ingredients for a given recipe request."""

    client = OpenAI(**llm_credentials)

    try:
        completion = client.beta.chat.completions.parse(
            temperature=0,
            model=llm_model,
            messages=[
                {
                    "role": "user",
                    "content": (
                        "Provide the list of positive and negative ingredients for the following recipe "
                        f"request in lowercase: {recipe_request}"
                    ),
                }
            ],
            response_format=Ingredients,
        )

        recipe_response = completion.choices[0].message
        if recipe_response.parsed:  # noqa: SIM108
            ingredients = recipe_response.parsed
        else:
            ingredients = Ingredients(positive_ingredients=[], negative_ingredients=[])
    except Exception as e:
        print(e)
        ingredients = Ingredients(positive_ingredients=[], negative_ingredients=[])

    return ingredients

In [None]:
def benchmark(
    test_dataset: dict[str, RecipeRequest], llm_credentials: dict, llm_model: str
) -> dict[str, dict[str, Any]]:
    """Test a list of ingredients with a given LLM."""
    benchmark_results = {}
    for _id, req in test_dataset.items():
        computed_result = get_ingredients(
            recipe_request=req.recipe_request,
            llm_credentials=llm_credentials,
            llm_model=llm_model,
        )
        expected_result = Ingredients(
            positive_ingredients=req.positive_ingredients, negative_ingredients=req.negative_ingredients
        )

        benchmark_results[_id] = {
            "recipe_request": req.recipe_request,
            "computed_result": computed_result,
            "expected_result": expected_result,
            "correct_inference": computed_result == expected_result,
        }

    return benchmark_results


def benchmark_overview(benchmark_results: dict[str, dict[str, Any]]) -> None:
    """Generate a table and a dataframe allowing to display the benchmark results."""
    NB_DISPLAYED_SAMPLES = 10

    df_result = pd.DataFrame(benchmark_results).T
    df_result["short_id"] = df_result.index.str[:8]
    table_result = (
        GT(df_result.head(NB_DISPLAYED_SAMPLES), rowname_col="short_id")
        .tab_header(
            title="Recipes ingredients extraction overview",
            subtitle=f"Accuracy on full dataset: {df_result.correct_inference.mean():.1%}",
        )
        .tab_source_note(source_note=f"Only the first {NB_DISPLAYED_SAMPLES} cases are displayed.")
        .tab_options(
            heading_background_color="#36665e",
            column_labels_background_color="#479487",
        )
        .cols_label(recipe_request="Request", computed_result="Computed ingredients", correct_inference="Correct")
        .cols_hide("expected_result")
        .fmt(lambda x: "✅" if x else "❌", columns=["correct_inference"])
        .opt_align_table_header()
        .cols_align(align="center", columns=["recipe_request", "computed_result"])
        .opt_table_outline()
    )

    return df_result, table_result

Let's run the benchmark against a list of models.

In [None]:
# | code-fold: false


LLM_CREDENTIALS = {"base_url": "http://localhost:11434/v1", "api_key": "ollama"}
AVAILABLE_MODELS = ["smollm2:135m", "smollm2:360m", "llama3.2:1b", "gemma2:2b"]

models_result = {}
for model in AVAILABLE_MODELS:
    benchmark_results = benchmark(
        test_dataset=test_recipes_requests_ingredients,
        llm_credentials=LLM_CREDENTIALS,
        llm_model=model,
    )
    df_result, table_result = benchmark_overview(benchmark_results)

    models_result[model] = (df_result, table_result)

    print(f"Model: {model:<15} | Accuracy: {df_result.correct_inference.mean():.1%}")


The only model that gets 100% accuracy is `gemma2:2b`. It's the biggest model tested, but at 2b it's small enough to 
have a good inference speed. We will therefore use this model in the recipes inference app.


In [None]:
USED_MODEL = "gemma2:2b"
models_result[USED_MODEL][1].tab_header(f"[ {USED_MODEL} ] Recipes ingredients extraction overview")

## Improvements

The current method struggles with imprecise ingredient descriptions. For example, if a user mentions disliking "meat," this general term won't match specific entries like "beef" or "chicken" in the recipe database. Similarly, a preference against "spicy food" won't align with specific spicy ingredients. To address this, we could implement a semantic search to identify the closest matching ingredient in the database and substitute the original term.

Additionally, the format of ingredients could be standardized. Currently, some ingredients are listed in plural form while others are singular. To maintain consistency with the recipe database, which uses singular forms, we could convert all ingredients to singular in either a two-step process or a single step using a more advanced model.