# Create NER datasets

In this notebook, I created the training and test datasets that will be used to fine tune a Name Entity Recognition (NER) Spacy pipeline.

The train dataset was created using the dataset available at https://www.kaggle.com/datasets/wilmerarltstrmberg/recipe-dataset-over-2m/data
This is a very large datasetthat contains recipe information, including keywords that can be used for NER, with the aim to recognise ingredients necessary for receipes. The dataset contains recipes scapped from multiple websites.

For this specific project, I wanted to only recognise "main" ingredients. For example, I am not interested in spices, sauces or pantry items such as oil, but rather in fresh ingredients and items such as pasta, rice and pulses. For this reason, for the train set, I chose 200 random recipes from one website, and annotated every ingredient I was interested in. For example, I annotated "tomatoes" but not "olive oil". Thus, I expect that olive oil will not be recognised as an ingredient, while "tomatoes" will.

To evaluate the NER model, I used my main dataset. I randomly chose 20 recipes and performed the same process, annotating the ingredients of interest and leaving the rest blank.


In [32]:
from pathlib import Path

import pandas as pd
import numpy as np

import spacy
from spacy.tokens import Span, DocBin

In [33]:
BASE_DIR = Path().resolve().parent

def data_path(folder, file_name):
    return Path(BASE_DIR) / f"data/{folder}/{file_name}"

raw_data = pd.read_csv(data_path("raw", "recipes_data_NER.csv"))
raw_data.site.unique()


In [34]:
# Let's keep data only from epicurious.com
subsample_recipes_df = raw_data[raw_data.site=='www.epicurious.com'].reset_index(drop=True)
subsample_recipes_df.iloc[0].ingredients

In [35]:
subsample_recipes_df.iloc[0].NER

In [36]:
subsample_recipes_df[['ingredients', 'NER']].to_csv(data_path("raw", "epicurious_recipes_df.csv"), index=False)

In [37]:
ingredients_df = pd.read_csv(data_path("raw", "epicurious_recipes_df.csv"))
type(ingredients_df['ingredients'].iloc[0])

In [38]:
import ast

np.random.seed(1)
idx = np.random.choice(ingredients_df.shape[0], 200, replace=False)
ingredients_sub_df = ingredients_df.iloc[idx]['ingredients'].reset_index(drop=True).map(lambda x: ast.literal_eval(x))

ingredients_sub_df.head()

In [39]:
ingredients_rows_df = ingredients_sub_df.explode(ignore_index=True)
ingredients_rows_df.head()

In [40]:
ingredients_rows_df = ingredients_rows_df.to_frame()

In [41]:
ingredients_rows_df.shape

In [42]:
ingredients_rows_df.to_csv(Path(BASE_DIR) / f"data/interim/selected_ingrendiens_NER_.csv", index=False, sep='\t')

In [43]:
data_path("interim", "selected_ingrendiens_NER_.csv")

In [56]:
# Create dev dataset

In [57]:
# Get subsample from the ottolenghi dataset
np.random.seed(1)
dev_set_df = pd.read_json(data_path('interim', 'ottolenghi_recipes.json'))
dev_set_df = dev_set_df.iloc[np.random.choice(dev_set_df.shape[0], 20, replace=False)].reset_index(drop=True).ingredients.explode()
dev_set_df.to_csv(Path(BASE_DIR) / f"data/interim/dev_set_.csv", index=False, sep='\t')
