---
title: "Recipe recommendation system"
title-block-banner: "#497D74"
description: Exploratory data analysis on recipe recommendation system leveraging word embeddings and similarities.
format:
  html:
    code-fold: true
    code-tools: true
    number-sections: false
    toc: true
    toc-location: right
    toc-depth: 2
    toc-expand: 1
    callout-icon: true
    highlight-style: tango
    code-line-numbers: ayu
    embed-resources: true
    theme: flatly
    grid:
        body-width: 1000px
---

# Goal and context

The aim of this study is to explore word embedding and similarities on recipes in order to create a content based recommender system.
The dataset used is a sample of +6000 recipes extracted from the [recipeDB](https://cosylab.iiitd.edu.in/recipedb/).


For each recipe, the name, ingredients and origin is provided. The recommender will suggest a list of recipes based on two user inputs:

- A list of ingredients liked by the user.
- A list of ingredients disliked by the user.

*Inspired by Duarte Carmo's [work](https://duarteocarmo.com/blog/scandinavia-food-python-recommendation-systems)*


## Import and check dataset

We start by importing data and perform basic quality checks.

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
from great_tables import GT

pio.templates.default = "none"
pio.renderers.default = "notebook"

DATA_FOLDER = Path.cwd().parent / "data"
SEED = 42

df_recipes = pd.read_parquet(DATA_FOLDER / "recipe_db_raw.parquet")

In [None]:
df_recipes.head()

The dataset is a table comprised of 4 columns:

- **id:** the id of the recipe in the source database
- **name:** the name of the recipe
- **ingredients:** the list of ingredients used in the recipe, with a `|` as a separator
- **origin:** the origin of the recipe, with different levels with a `>>` as a separator (e.g. `African >> Middle Eastern >> Egyptian`)

We will perform a basic health check to make sure that the data doesn't contain unexpected things.

In [None]:
df_desc = pd.concat(
    [
        df_recipes.dtypes.replace("object", "str"),
        df_recipes.notna().sum(),
        df_recipes.isna().sum(),
        (df_recipes == "").sum(),  # noqa: PLC1901
        pd.concat([df_recipes[col].astype(str).str.len() for col in df_recipes.columns], axis=1).mean().round(1),
        pd.concat([df_recipes[col].astype(str).str.len() for col in df_recipes.columns], axis=1).min(),
        pd.concat([df_recipes[col].astype(str).str.len() for col in df_recipes.columns], axis=1).max(),
    ],
    axis=1,
).reset_index()

df_desc.columns = (
    "Column",
    "Column type",
    "# non null values",
    "# None values",
    "# value with empty string",
    "# of characters (mean)",
    "# of characters (min)",
    "# of characters (max)",
)

(
    GT(df_desc, rowname_col="Column")
    .tab_header(
        title="Recipes dataset overview",
    )
    .tab_options(
        column_labels_font_weight="bold",
        stub_font_weight="bold",
    )
)

**Observation:**

- There are very few missing values in the column `ingredients` .
- The columns data type are as expected
- The repartition of string columns size is coherent

--> The data doesn't seem to have obvious issues.

## Preprocessing

Some preprocessing must be done on the ingredients and origin columns in order to gain insight and remove / replace specific characters.

In [None]:
df_recipes = df_recipes.dropna().reset_index(drop=True)

df_recipes["origin"] = df_recipes["origin"].str.replace("\n", "").str.replace(" ", "")

df_recipes["origin_0"] = df_recipes["origin"].str.split(">>").apply(lambda origin_split: origin_split[0])
df_recipes["origin_1"] = df_recipes["origin"].str.split(">>").apply(lambda origin_split: origin_split[1])
df_recipes["origin_2"] = df_recipes["origin"].str.split(">>").apply(lambda origin_split: origin_split[2])

df_recipes["ingredients_str"] = df_recipes.ingredients.str.replace(" | ", ", ", regex=False)

df_recipes["origin"] = df_recipes["origin"].str.replace(">>", " > ")

## Preliminary EDA


### Recipes origin

In [None]:
px.bar(
    (df_recipes.origin_2.value_counts(ascending=True)).tail(35),
    title="Nb of recipes per origin",
    orientation="h",
    height=700,
    width=1100,
).update_xaxes(title="Origin").update_yaxes(title="#")

Some recipes origin seems over-represented, we'll use the Lorentz curve to quantintify this more precisely.

In [None]:
recipes_per_origin = df_recipes.origin_2.value_counts(ascending=True).rename("nb_recipes").reset_index()

recipes_per_origin["rank"] = np.arange(len(recipes_per_origin)) + 1
recipes_per_origin["cum_nb_recipes"] = recipes_per_origin["nb_recipes"].cumsum()

recipes_per_origin["percent_rank"] = recipes_per_origin["rank"] / len(recipes_per_origin["rank"])
recipes_per_origin["percent_cum_nb_recipes"] = (
    recipes_per_origin["cum_nb_recipes"].cumsum() / recipes_per_origin["cum_nb_recipes"].sum()
)

recipes_per_origin["percent_rank"] = (recipes_per_origin["percent_rank"] * 100).round(2)
recipes_per_origin["percent_cum_nb_recipes"] = (recipes_per_origin["percent_cum_nb_recipes"] * 100).round(2)

px.scatter(
    recipes_per_origin,
    x="percent_rank",
    y="percent_cum_nb_recipes",
    hover_data=recipes_per_origin.columns,
    title="Lorentz curve of # recipes per origin",
    width=1100,
)

**Observations:**

The top 5% recipes origins gather almost 40% of recipes in the dataset.

> The origin of recipes available in the dataset is unbalanced, this could lead to suggesting too much some kind of food. This could be taken into account in the recommender system . For example by adding an heuristic to filter out some regions.

### Ingredients

In [None]:
ingredients_per_recipe = df_recipes.ingredients.str.replace(" | ", "|", regex=False).str.split("|")
full_ingredients = pd.Series([element for list_ in ingredients_per_recipe for element in list_])

In [None]:
NB_INGREDIENTS_DISPLAY = 35

ingredients_occurrence = (
    full_ingredients.value_counts(ascending=True)
    .reset_index()
    .rename(columns={"index": "ingredient", 0: "num_occurrences"})
    .assign(prop_occurences=lambda df: (df.num_occurrences / len(df_recipes) * 100))
    .sort_values("num_occurrences", ascending=True)
)
px.bar(
    ingredients_occurrence.reset_index(drop=True).tail(35).round(1),
    x="num_occurrences",
    y="ingredient",
    hover_data=ingredients_occurrence.columns,
    title=f"# ingredients occurrences in recipes (Top {NB_INGREDIENTS_DISPLAY} by occurence)",
    orientation="h",
    height=700,
    width=1100,

).update_xaxes(title="Ingredient").update_yaxes(title="# occurrences")

In [None]:
tmp = ingredients_occurrence.iloc[[*np.arange(5), *(np.arange(-5, 0))]].reset_index(drop=True)

(
    GT(tmp, rowname_col="ingredient")
    .tab_header(
        title="Top 5 and bottom 5 ingredients by occurrence",
    )
    .fmt_percent(
        "prop_occurences",
        decimals=2,
        scale_values=False,
    )
    # .fmt_nanoplot(columns="prop_occurences", plot_type="bar")
    .data_color(
        columns=[
            "prop_occurences",
            "num_occurrences",
        ],
        palette="RdBu",
        reverse=True,
        alpha=0.5,
    )
)

In [None]:
print(
    "List of ingredients containing the word milk, by occurrence: "
    + ", ".join(
        [
            f"{d['ingredient']} ({d['num_occurrences']})"
            for d in ingredients_occurrence[ingredients_occurrence.ingredient.str.contains("milk")]
            .iloc[::-1]
            .to_dict(orient="records")
        ]
    )
)

**Observations:**

- The ingredients are very unbalanced, some ingredients are present in almost 40% of the recipes, while some are present in only one recipe.
- The top 5 ingredients are basic and therefore not so discriminant.
- Some people may not be tolerant to some of the most common ingredients (e.g. milk, ginger). This could be taken into account and be used as a filter.
- We can see with the milk that some ingredients have a lot of declinations, where some are not real ingredients, e.g. cold / hot milk.



Our approach based on vector embedding and similarity search should be able to group similar ingredients together. However the user might want only a specific ingredient and not the other (for instance, he might want oat milk but not rice milk). This might be a limitation to the solution, which could be mixed with hard filters.

## Embedding

We will create an embedding model from the list of ingredients on each recipes. We will use a small embedding model 
which has a good ranking on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) from HuggingFace. 
The embedding model is a source of improvement in our pipeline and we could test several model in a refining phase.


In [None]:
# | code-fold: false
from sentence_transformers import SentenceTransformer

# Import model
EMBEDDING_MODEL = "intfloat/multilingual-e5-small"
model = SentenceTransformer(EMBEDDING_MODEL)


# Create embeddings
sentences_embeddings = model.encode(df_recipes.ingredients_str.to_list())
embedding_matrix = sentences_embeddings

## Embedding visualisation

We reduce the embeddings' dimensionality with UMAP in order to get an overview of the embedding properties.

In [None]:
# | code-fold: false
import umap

# Reduce dimensionality
umap_fit = umap.UMAP(random_state=SEED)
umap_matrix = umap_fit.fit_transform(embedding_matrix)

df_recipes["umap_comp_0"] = umap_matrix[:, 0]
df_recipes["umap_comp_1"] = umap_matrix[:, 1]

In [None]:
df_recipes_subset = df_recipes.sample(5000, random_state=SEED).sort_values("id")

px.scatter(
    df_recipes_subset,
    x="umap_comp_0",
    y="umap_comp_1",
    color="origin_0",
    hover_data=["name", "ingredients", "origin"],
    width=800,
    height=800,
)


**Observations:**

We can see some clusters related to the recipe origin which is a good sign as a recipes from the same origin usually 
have similar ingredients and can be a strong signal for the taste.
We will look more into detail for given dish in order as a sanity check.

In [None]:
df_recipes_subset["is_pasta"] = df_recipes.name.str.contains("pasta", case=False)
df_recipes_subset["is_pizza"] = df_recipes.name.str.contains("pizza", case=False)
df_recipes_subset["is_salad"] = df_recipes.name.str.contains("salad", case=False)

px.scatter(
    df_recipes_subset,
    x="umap_comp_0",
    y="umap_comp_1",
    color="is_pasta",
    hover_data=["name", "ingredients", "origin"],
    width=800,
    height=800,
)

**Observations:**

Pasta recipes are mostly clustered together, some deviations come from asian recipes which makes sense as it uses 
specific ingredients and tests.

According to these sanity checks, the embeddings seem to be a good representation of the recipes. And we can therefore 
use similarity metrics on them in order create a recommender system.

## Liked and disliked recipes visualisation

We'll work on a semi-supervised method:

1. Ask the affinity of the user with few ingredients.
2. Use this information to label the recipes containing those ingredients
3. Train a classification algorithm on those labelled recipes
4. Use this algorithm to predict the affinity of the user with the rest of the recipes
5. Use this information to recommend recipes


Here is an example based on a list of liked and disliked ingredients:

In [None]:
# | code-fold: false
liked_ingredients = [
    "salt",
    "onion",
    "butter",
    "olive oil",
    "egg",
    "soy sauce",
    "vegetable oil",
    "green onion",
    "lemon juice",
    "cream",
    "lime juice",
    "purpose flour",
    "beef",
]


disliked_ingredients = [
    "tomato",
    "garlic",
    "sugar",
    "black pepper",
    "cilantro",
    "cumin",
    "ginger",
    "milk",
    "cinnamon",
    "pepper",
    "salt pepper",
]

df_recipes["liked"] = 0

for ingredient in liked_ingredients:
    df_recipes["liked"] += df_recipes["ingredients"].str.contains(ingredient).astype("int")

for ingredient in disliked_ingredients:
    df_recipes["liked"] -= df_recipes["ingredients"].str.contains(ingredient).astype("int")


df_recipes["liked_bool"] = df_recipes["liked"] / df_recipes["liked"].abs().replace(0, 1)

df_recipes["labeled"] = df_recipes.liked_bool != 0

df_recipes.liked_bool.value_counts()

We can see a homogeneous distribution of liked and disliked values.

In [None]:
px.scatter(
    df_recipes,
    x="umap_comp_0",
    y="umap_comp_1",
    color="liked",
    hover_data=["name", "ingredients_str", "origin", "liked"],
    color_continuous_scale=px.colors.diverging.RdBu_r,
    title="Liked ingredients score",
    width=800,
    height=800,
).update_layout({"plot_bgcolor": "#E0E0E0"})

Some clusters seem to appear based on our list of liked and disliked ingredients.

## KNN

In order to label unlabelled data, we can define an affinity score based on the k-nearest-neighbors labels.
The faiss library is used because of it's high computing efficiency.

In [None]:
# | code-fold: false
import faiss

NB_NEIGHBORS = 10

# get label and unlabeled embeddings
labeled_embeddings = embedding_matrix[df_recipes.labeled, :]
unlabeled_embeddings = embedding_matrix[~df_recipes.labeled, :]


# create faiss index
vector_dimension = labeled_embeddings.shape[1]
index = faiss.IndexFlatIP(vector_dimension)


# normalize and add labeled_embeddings
faiss.normalize_L2(labeled_embeddings)
index.add(labeled_embeddings)


# normalize and query unlabeled_embeddings
faiss.normalize_L2(unlabeled_embeddings)


r_distances, r_indexes = index.search(unlabeled_embeddings, k=NB_NEIGHBORS)

In [None]:
# | code-fold: false

df_labeled = df_recipes[df_recipes.liked_bool != 0].reset_index(drop=True)
df_unlabeled = df_recipes[df_recipes.liked_bool == 0].reset_index(drop=True)

neighbors_id_columns = [f"neighbor_{i}_id" for i in range(NB_NEIGHBORS)]
neighbors_liked_columns = [f"neighbor_{i}_liked_bool_value" for i in range(NB_NEIGHBORS)]

affinity = pd.DataFrame(r_indexes, index=df_unlabeled.id, columns=neighbors_id_columns)

for i in range(NB_NEIGHBORS):
    affinity[f"neighbor_{i}_liked_bool_value"] = affinity[f"neighbor_{i}_id"].map(df_labeled.liked_bool)


affinity["liked_estimated_score"] = affinity[neighbors_liked_columns].mean(axis=1)

df_recipes["liked_estimated_score"] = df_recipes["id"].map(affinity.liked_estimated_score)

## Affinity visualisation

### Affinity on embedding scatterplot

In [None]:
px.scatter(
    df_recipes[~df_recipes.labeled],
    x="umap_comp_0",
    y="umap_comp_1",
    color="liked_estimated_score",
    hover_data=["name", "ingredients_str", "origin", "liked"],
    color_continuous_scale=px.colors.diverging.RdBu_r,
    title="Estimation of affinity score on recipes without label (i.e. without initial list of ingredients)",
    width=800,
    height=800,
).update_layout({"plot_bgcolor": "#E0E0E0"})

### Sample of recipes with high estimated affinity

In [None]:
from pprint import pp

for recipe in (
    df_recipes[df_recipes.liked_estimated_score == 1][["name", "ingredients_str"]]
    .sample(5, random_state=SEED)
    .to_dict(orient="records")
):
    for _, v in recipe.items():
        pp(v + ":")
    print("\n")

Based on my tastes, the results seem very coherent. A more refined benchmark approach could be to define generate several pairs of liked, disliked ingredients along with a sample of recipes, rank them, and estimate if the recommender system is able to rank them in the he same order.

## Save data

We save the embeddings to use them as is for fast inference in the app.

In [None]:
df_features = df_recipes[
    [
        "id",
        "name",
        "origin_2",
        "ingredients_str",
        "umap_comp_0",
        "umap_comp_1",
    ]
]


df_embedding = pd.DataFrame(
    embedding_matrix,
    columns=[f"embedding_feature_{i}" for i in range(embedding_matrix.shape[1])],
)

df_features = df_features.join(df_embedding)
df_features = df_features.dropna()
df_features["name"] = df_features["name"].str.title()
df_features["ingredients_str"] = df_features["ingredients_str"].str.title()

df_features = df_features.rename(columns={"name": "Name", "ingredients_str": "Ingredients", "origin_2": "Origin"})
df_features["Link"] = "https://cosylab.iiitd.edu.in/recipedb/search_recipeInfo/" + df_features["id"].astype("str")

df_features.sample().T.iloc[:10, :]

df_features.to_parquet(DATA_FOLDER / "recipe_db.parquet", index=False)

print("Processed dataset saved")