# Create NER datasets

In this notebook, I created the training and test datasets that will be used to fine tune a Name Entity Recognition (NER) Spacy pipeline.

The train dataset was created using the dataset available at https://www.kaggle.com/datasets/wilmerarltstrmberg/recipe-dataset-over-2m/data
This is a very large datasetthat contains recipe information, including keywords that can be used for NER, with the aim to recognise ingredients necessary for receipes. The dataset contains recipes scapped from multiple websites.

For this specific project, I wanted to only recognise "main" ingredients. For example, I am not interested in spices, sauces or pantry items such as oil, but rather in fresh ingredients and items such as pasta, rice and pulses. For this reason, for the train set, I chose 400 random recipes from one website, and annotated every ingredient I was interested in. For example, I annotated "tomatoes" but not "olive oil". Thus, I expect that olive oil will not be recognised as an ingredient, while "tomatoes" will. 

To evaluate the NER model, I used my main dataset. I randomly chose 20 recipes and performed the same process, annotating the ingredients of interest and leaving the rest blank.


In [1]:
from pathlib import Path

import pandas as pd
import numpy as np

import spacy
from spacy.tokens import Span, DocBin

In [2]:
BASE_DIR = Path().resolve().parent

def data_path(folder, file_name):
    return Path(BASE_DIR) / f"data/{folder}/{file_name}"

raw_data = pd.read_csv(data_path("raw", "recipes_data_NER.csv"))
raw_data.site.unique()


array(['www.cookbooks.com', 'www.allrecipes.com', 'www.food.com',
       'recipes-plus.com', 'www.epicurious.com', 'food52.com',
       'www.myrecipes.com', 'www.seriouseats.com', 'www.tasteofhome.com',
       'tastykitchen.com', 'www.yummly.com', 'cookeatshare.com',
       'www.foodnetwork.com', 'cookpad.com', 'www.kraftrecipes.com',
       'online-cookbook.com', 'www.lovefood.com', 'www.landolakes.com',
       'cooking.nytimes.com', 'allrecipes.com', 'www.foodgeeks.com',
       'www.cookstr.com', 'recipeland.com', 'www.vegetariantimes.com',
       'www.delish.com', 'www.foodandwine.com', 'www.chowhound.com',
       'www.foodrepublic.com'], dtype=object)

In [3]:
# Let's keep data only from epicurious.com
subsample_recipes_df = raw_data[raw_data.site=='www.epicurious.com'].reset_index(drop=True)
subsample_recipes_df.iloc[0].ingredients

'["Makes about 3 quarts of sauce", "5 pounds ripe plum tomatoes", "3-1/2 pounds firm eggplants", "1/2 cup extra-virgin olive oil", "3 cups finely chopped onions, about 1-1/4 pounds", "1/4 cup finely chopped garlic", "2 teaspoons salt, plus more to taste", "1/2 teaspoon peperoncino (or to your taste)", "3 or 4 large branches fresh basil with leaves"]'

In [4]:
subsample_recipes_df.iloc[0].NER

'["eggplants", "extra-virgin olive oil", "tomatoes", "of sauce", "onions", "peperoncino", "garlic", "fresh basil", "salt"]'

In [5]:
subsample_recipes_df[['ingredients', 'NER']].to_csv(data_path("raw", "epicurious_recipes_df.csv"), index=False)

In [6]:
ingredients_df = pd.read_csv(data_path("raw", "epicurious_recipes_df.csv"))
type(ingredients_df['ingredients'].iloc[0])

str

In [7]:
import ast

np.random.seed(0)
idx = np.random.choice(ingredients_df.shape[0], 400, replace=False)
ingredients_sub_df = ingredients_df.iloc[idx]['ingredients'].reset_index(drop=True).map(lambda x: ast.literal_eval(x))

ingredients_sub_df.head()

0    [1 large red onion, halved lengthways, thinly ...
1    [6tsp feta cheese, 1/3 cup shredded fresh spin...
2    [3 tablespoons vegetable oil, 1 large onion, s...
3    [1 lb. mixed montery jack and cheddar cheeses,...
4    [left over meat, beef, pork, ham, or chicken. ...
Name: ingredients, dtype: object

In [8]:
ingredients_rows_df = ingredients_sub_df.explode(ignore_index=True)
ingredients_rows_df.head()

0    1 large red onion, halved lengthways, thinly s...
1                              2 small green jalapenos
2                                 2/3 cup rice vinegar
3                              1 tablespoon lime juice
4                           1 heaped teaspoon sea salt
Name: ingredients, dtype: object

In [9]:
ingredients_rows_df = ingredients_rows_df.to_frame()

In [10]:
ingredients_rows_df.shape

(3929, 1)

In [11]:
ingredients_rows_df.to_csv(Path(BASE_DIR) / f"data/interim/selected_ingrendiens_NER_large.csv", index=False, sep='\t')

In [12]:
data_path("interim", "selected_ingrendiens_NER_large.csv")

PosixPath('/Users/mariakalimeri/Documents/cooking_seasonally/data/interim/selected_ingrendiens_NER_large.csv')

Add recipes from ottolenghi as well

In [14]:
np.random.seed(0)
test_set_df = pd.read_json(data_path('interim', 'ottolenghi_recipes.json'))
test_set_df = test_set_df.iloc[np.random.choice(test_set_df.shape[0], 20, replace=False)].reset_index(drop=True).ingredients.explode()
test_set_df.to_csv(Path(BASE_DIR) / f"data/interim/ottolenghi_test_set_.csv", index=False, sep='\t')

In [56]:
# Create dev dataset

In [None]:
# Get subsample from the ottolenghi dataset
np.random.seed(1)
dev_set_df = pd.read_json(data_path('interim', 'ottolenghi_recipes.json'))
dev_set_df = dev_set_df.iloc[np.random.choice(dev_set_df.shape[0], 20, replace=False)].reset_index(drop=True).ingredients.explode()
dev_set_df.to_csv(Path(BASE_DIR) / f"data/interim/dev_set.csv", index=False, sep='\t')
