## Named Entity Recognition

Named Entity Recognition (NER) is used to identify and extract ingredients and their quantities from recipe texts. This technique is essential for automating ingredient extraction, which can be applied to tasks like cuisine classification, recipe recommendations, and ingredient-based analysis.

In [3]:
import pandas as pd
import numpy as np
import spacy
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher

## Load Data
Loading and preprocessing of the recipe data is the first step.
The dataset contains raw recipe text along with actual NER labels, which identify the ingredients and the feature that is combined, normalized text obtained after preprocessing

In [None]:
train_df = pd.read_csv('preprocessed_train_data.csv')

In [None]:
print(f'Length of train data is {len(train_df)}')
train_df.head()

Length of train data is 103271


Unnamed: 0,NER,normalized_combined
0,"['sugar', 'vanilla', 'graham cracker crumbs', 'egg whites', 'nuts', 'coconut']",coconut crunch pie 4 egg whites 1 cup nuts 1 cup sugar 1 cup coconut 1 cup graham cracker crumbs 1 teaspoon vanilla beat egg whites until frothy add sugar beat for 1 minute add other ingredients pour into buttered pie pan bake at 350 for 30 minutes top with sliced bananas and cool whip
1,"['creme fraiche', 'lobsters', 'lemon juice', 'leeks', 'sherry vinegar', 'freshly ground black pepper', 'red jalapeno chile', 'red bell pepper', 'lime juice', 'white wine', 'kosher salt', 'yellow heirloom tomatoes', 'orange bell pepper', 'fresh cilantro']",heirloom tomato gazpacho with lobster 8 large yellow heirloom tomatoes about 4 pound peeled and seeded 1 orange bell pepper quartered 1 red bell pepper quartered 1 large or 2 small leeks sliced 1 teaspoon finely chopped red jalapeno chile 0.5 cup sherry vinegar 0.25 cup white wine 1 tablespoon kosher salt 0.5 teaspoon freshly ground black pepper 0.75 cup chopped fresh cilantro divided 2 1.25 pound lobsters steamed and shelled 0.5 cup mexican crema creme fraiche or sour cream 1 tablespoon fresh lime juice 1 tablespoon fresh lemon juice process first 9 ingredients in batches in a blender until smooth stir in 0.5 cup chopped cilantro chill at least 1 hour or up to 1 day slice lobster tails into medallions and split each claw into 2 pieces by cutting across the flat side combine crema lime juice and lemon juice in a small bowl divide gazpacho among 6 bowls top evenly with lobster crema mixture and remaining 0.25 cup cilantro
2,"['lemon juice', 'celery', 'red apples', 'mayonnaise', 'walnuts', 'light cream']",waldorf salad serves 4 3 red apples 2 tablespoon lemon juice 1 cup sliced celery 0.5 cup broken walnuts 0.5 cup mayonnaise 2 tablespoon light cream or milk core and dice apples sprinkle lemon juice add celery and walnuts toss to mix
3,"['pineapple', 'yellow cake']",pineapple cake 1 yellow cake mix 1 large can crushed pineapple make cake as directions indicate or make a cake from scratch let cake cool punch holes in top and spoon pineapple over cake juice and all spread icing recipe following over pineapple
4,"['vinegar', 'sugar', 'vanilla', 'eggs', 'pecans', 'butter', 'dish pie shell']",pecan pie 1 cup sugar 1 stick butter 2 eggs 1 teaspoon vinegar 1 teaspoon vanilla 1 cup chopped pecans 1 deep dish pie shell melt butter mix with sugar add eggs one at a time beating after each blend in vinegar vanilla and pecans pour into deep dish pie shell bake at 350 for 40 45 minutes very quick and easy


In [None]:
texts = train_df['normalized_combined'].tolist()
texts[0]

'coconut crunch pie 4 egg whites 1 cup nuts 1 cup sugar 1 cup coconut 1 cup graham cracker crumbs 1 teaspoon vanilla beat egg whites until frothy add sugar beat for 1 minute add other ingredients pour into buttered pie pan bake at 350 for 30 minutes top with sliced bananas and cool whip'

In [None]:
ner_tags = train_df['NER'].apply(eval).tolist()
ner_tags[0]

['sugar', 'vanilla', 'graham cracker crumbs', 'egg whites', 'nuts', 'coconut']

## Creating Patterns for Ingredient Extraction:

Based on the provided NER labels, patterns are created using SpaCy's Matcher. These patterns help to identify entities related to ingredients and quantities in the recipe texts.  
Ingredients are mapped to specific spans (start and end positions) in the text for accurate entity recognition.

In [None]:
nlp = spacy.blank("en")

# Create a dictionary of terms
terms = {}
patterns = []

for tags in ner_tags:
    for tag in tags:
        if tag not in terms and tag!='mix':
            terms[tag] = {'label': 'INGREDIENT'}
            patterns.append(nlp(tag))

# Initialize the PhraseMatcher
ingredient_matcher = PhraseMatcher(nlp.vocab)
ingredient_matcher.add("INGREDIENT", None, *patterns)

In [None]:
patterns[:20]

[sugar,
 vanilla,
 graham cracker crumbs,
 egg whites,
 nuts,
 coconut,
 creme fraiche,
 lobsters,
 lemon juice,
 leeks,
 sherry vinegar,
 freshly ground black pepper,
 red jalapeno chile,
 red bell pepper,
 lime juice,
 white wine,
 kosher salt,
 yellow heirloom tomatoes,
 orange bell pepper,
 fresh cilantro]

In [None]:
#Checking that ner pipes don't have any components added at this stage
nlp.analyze_pipes()

{'summary': {}, 'problems': {}, 'attrs': {}}

In [24]:
# Quantity extractor component
@Language.component("quantity_extractor")
def quantity_extractor(doc):
    # Extract quantities
    matcher = Matcher(nlp.vocab)
    pattern = [
        {"LIKE_NUM": True},  # Match numbers
        {"LIKE_NUM": True, "OP": "?"},  # Match the second optional number (e.g., 8)
        {"LOWER": {"IN": ["cup", "tablespoon", "teaspoon", "ounce", "pound", "gram", "kilogram", "package", "quart", "liter", "milliliter"]}}
    ]
    matcher.add("QUANTITY", [pattern])
    matches = matcher(doc)
    quantity_spans = [Span(doc, start, end, label="QUANTITY") for match_id, start, end in matches]

    filtered_spans = spacy.util.filter_spans(quantity_spans)
   # Filter out existing QUANTITY entities
    new_ents = [ent for ent in doc.ents if ent.label_ != "QUANTITY"]

    # Add the unique quantity spans to the new_ents list
    doc.ents = new_ents + filtered_spans  # Add unique quantity spans

    return doc
nlp.add_pipe("quantity_extractor", last=True)  # Quantity extractor runs first


In [25]:
# Ingredient extractor component
@Language.component("ingredient_extractor")
def ingredient_extractor(doc):
    # Extract ingredients after quantity extraction
    matches = ingredient_matcher(doc)
    spans = [Span(doc, start, end, label='INGREDIENT') for match_id, start, end in matches]

    unique_spans = []

    # Check if the ingredient overlaps with any quantity span before adding
    quantity_spans = [ent for ent in doc.ents if ent.label_ == "QUANTITY"]

    for span in spans:

        overlap_found = False

        # Check if the ingredient span overlaps with any quantity span
        for quantity in quantity_spans:
            if span.start < quantity.end and span.end > quantity.start:  # If overlap occurs
                overlap_found = True
                break  # Skip adding this ingredient if it overlaps with a quantity

        # Add ingredient span if there's no overlap with any quantity span
        if not overlap_found:
                unique_spans.append(span)


    # Resolve overlaps and filter out duplicate ingredient spans
    filtered_spans = spacy.util.filter_spans(unique_spans)
    new_ents = [ent for ent in doc.ents if ent.label_ != "INGREDIENT"]
    doc.ents = new_ents + filtered_spans  # Add unique ingredient spans, excluding overlapping ones

    return doc

nlp.add_pipe("ingredient_extractor", last=True)  # Ingredient extractor runs second



In [None]:
#Analyzing if Quantity and Ingredient Extractor components are added to pipe
nlp.analyze_pipes()

{'summary': {'quantity_extractor': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'ingredient_extractor': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'quantity_extractor': [], 'ingredient_extractor': []},
 'attrs': {}}

## Annotating Data for Training:

The raw text is annotated with start and end positions of ingredients and their corresponding labels.
Annotations are used to create training data in the SpaCy format and saved as a .spacy file (training_data.spacy), which is later used for model training.

In [None]:
from spacy.tokens import DocBin

train_data = [(text, {"entities": []}) for text in texts]

for i, (text, annotations) in enumerate(train_data):
    doc = nlp(text)
    entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    train_data[i] = (text, {"entities": entities})


In [None]:
train_data[10]

('venison taco salad 1 pound ground venison fried and crumbled 1 can kidney beans drained 0.25 cup chopped onion 1 small can chopped green chilies 8 ounce cheddar cheese grated 1 bag taco chips crumbled 1 8 ounce bottle russian dressing 2 tomatoes diced 0.5 cup ripe olives sliced 1 head lettuce torn into bite size pieces toss all ingredients just before serving add chips last lo-cal russian dressing can be used if you like',
 {'entities': [(0, 7, 'INGREDIENT'),
   (8, 12, 'INGREDIENT'),
   (13, 18, 'INGREDIENT'),
   (19, 26, 'QUANTITY'),
   (27, 41, 'INGREDIENT'),
   (52, 60, 'INGREDIENT'),
   (67, 79, 'INGREDIENT'),
   (88, 96, 'QUANTITY'),
   (105, 110, 'INGREDIENT'),
   (113, 118, 'INGREDIENT'),
   (123, 136, 'INGREDIENT'),
   (137, 144, 'INGREDIENT'),
   (145, 152, 'QUANTITY'),
   (153, 167, 'INGREDIENT'),
   (177, 180, 'INGREDIENT'),
   (181, 191, 'INGREDIENT'),
   (192, 200, 'INGREDIENT'),
   (201, 210, 'QUANTITY'),
   (226, 234, 'INGREDIENT'),
   (237, 245, 'INGREDIENT'),
   (25

In [None]:
def save_training_data(data, output_file):
    nlp = spacy.blank("en")
    doc_bin = DocBin()
    for text, annotations in data:
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annotations["entities"]:
            span = doc.char_span(start, end, label=label)
            if span is not None:
                ents.append(span)
        doc.ents = ents
        doc_bin.add(doc)
    doc_bin.to_disk(output_file)

save_training_data(train_data, 'training_data.spacy')

## Training the SpaCy Model:

A blank SpaCy model is trained using the annotated data.
Training is done with mini-batch compounding, with parameters such as a batch size of 32, a learning rate of 4.0, and a dropout rate of 1.5.
To avoid overfitting, an early stopping mechanism is applied. Training stops if the loss does not improve for 10 consecutive iterations (with a threshold of 5000).


In [None]:
import random
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding
from spacy.tokens import DocBin
import os

# Load the training data
nlp = spacy.blank("en")  # Using a blank 'en' model
db = DocBin().from_disk("training_data.spacy")
docs = list(db.get_docs(nlp.vocab))

# Create the NER component and add it to the pipeline
if 'ner' not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)

# Add the labels to the NER component
for doc in docs:
    for ent in doc.ents:
        ner.add_label(ent.label_)

# Disable other pipes during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    max_iterations = 10  # Total iterations
    min_loss_improvement = 5000  # Minimum loss improvement to continue training
    prev_loss = None  # Track the previous loss

    #Parallelized minibatch to process training data in parallel
    for itn in range(max_iterations):
        random.shuffle(docs)
        losses = {}
        batches = minibatch(docs, size=compounding(4.0, 32.0, 1.5))

        for batch in batches:
            # Examples from docs and update the model
            examples = [Example.from_dict(doc, {"entities": [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]}) for doc in batch]

            nlp.update(examples, drop=0.5, losses=losses)

        print(f"Iteration {itn}, Losses: {losses}")

        # Check for improvement in loss
        if prev_loss is not None:
            improvement = prev_loss - losses['ner']
            print(f"Improvement in this iteration: {improvement}")

            # Early stopping if the improvement is small
            if improvement < min_loss_improvement:
                print(f"Stopping early due to small improvement.")
                break

        # Update previous loss for next iteration
        prev_loss = losses['ner']

    # Save the trained model to disk
    nlp.to_disk("ner_model")


Iteration 0, Losses: {'ner': 855710.3838120555}
Iteration 1, Losses: {'ner': 449087.37117298826}
Improvement in this iteration: 406623.0126390672
Iteration 2, Losses: {'ner': 380580.9409768345}
Improvement in this iteration: 68506.43019615376
Iteration 3, Losses: {'ner': 344451.89294495125}
Improvement in this iteration: 36129.048031883256
Iteration 4, Losses: {'ner': 322777.2335223893}
Improvement in this iteration: 21674.659422561934
Iteration 5, Losses: {'ner': 308388.9426392976}
Improvement in this iteration: 14388.290883091744
Iteration 6, Losses: {'ner': 296030.89556097606}
Improvement in this iteration: 12358.047078321513
Iteration 7, Losses: {'ner': 286466.773712112}
Improvement in this iteration: 9564.121848864073
Iteration 8, Losses: {'ner': 279011.7287182981}
Improvement in this iteration: 7455.044993813906
Iteration 9, Losses: {'ner': 273060.43795126514}
Improvement in this iteration: 5951.290767032944


## Prediction on Test Data:

After training, the model is used to predict ingredients and quantities on preprocessed test data.  
The predicted entities are extracted for further analysis, such as cuisine classification.

In [5]:
test_df = pd.read_csv('/content/drive/MyDrive/test_data.csv')
test_df.head()

Unnamed: 0,test_recipe,actual_NER
0,sour cream green beans 0.5 cup thinly sliced o...,"[""sour cream"", ""onion soup"", ""onion"", ""mushroo..."
1,great fluff 7 ounce marshmallows 2 tablespoon ...,"[""milk"", ""brown sugar"", ""marshmallows"", ""unsal..."
2,fruit punch 1 can frozen orange juice 1 can fr...,"[""sugar"", ""water"", ""pineapple juice"", ""orange ..."
3,crabmeat casserole serves 10 1 pt mayonnaise 1...,"[""fresh crabmeat"", ""stuffing mix"", ""eggs"", ""on..."
4,georgia peach-glazed pork roast 1 2 to 2.5 pou...,"[""pork loin roast"", ""cinnamon"", ""peach preserv..."


In [7]:
nlp = spacy.load("/content/drive/MyDrive/ner_model")

def testNER(recipe):
  predicted_ingredients=set()
  predicted_quantities=set()
  doc = nlp(recipe)
  for ent in doc.ents:
    #print(ent.text, ent.label_)
    if(ent.label_=='INGREDIENT'):
      predicted_ingredients.add(ent.text)
    elif(ent.label_=='QUANTITY'):
      predicted_quantities.add(ent.text)

  return predicted_ingredients,predicted_quantities

test_df['predicted_ingredients'],test_df['predicted_quantities'] = zip(*test_df['test_recipe'].apply(lambda x: testNER(x)))



In [None]:
pd.set_option('display.max_colwidth',None)
test_df.head()

Unnamed: 0,test_recipe,actual_NER,predicted_ingredients,predicted_quantities
0,sour cream green beans 0.5 cup thinly sliced onion 8 ounce sliced mushrooms 0.25 cup butter 2 10 ounce packages green beans 1 cup sour cream 1 5 ounce package dry onion soup mix salt and pepper to taste 1 cup shredded cheese any kind you like 1 cup breadcrumbs pre heat oven to 350 . saute onions and mushrooms in butter in a bowl combine beans sour cream onion soup mix salt pepper mix lightly fold in cheese pour into a greased 2 quart casserole mix melted butter with bread crumbs top the beans when mixed bake for 20 25 until bubbly,"[""sour cream"", ""onion soup"", ""onion"", ""mushrooms"", ""shredded cheese"", ""breadcrumbs"", ""green beans"", ""butter"", ""salt""]","{saute, bake, a, bowl combine, mix, cheese, pepper, in, salt pepper, mushrooms, for, mixed, bread crumbs, onion, shredded cheese, pour, beans, sour cream, onions, onion soup mix, breadcrumbs, butter, oven, salt, packages, like, green beans, kind}","{1 cup, 0.5 cup, 0.25 cup, 2 quart, 1 5 ounce, 2 10 ounce, 8 ounce}"
1,great fluff 7 ounce marshmallows 2 tablespoon unsalted butter 12 cup brown sugar 1 cup evaporated milk first add sugar marshmallows and milk to a pot then wait till boiling when done bowling add in the marshmallows finally wait till the marshmallows melt in the bowling liquid then enjoy you yummy tasting fluff,"[""milk"", ""brown sugar"", ""marshmallows"", ""unsalted butter""]","{then, a, fluff, sugar, first, boiling, brown sugar, evaporated milk, in, enjoy, milk, unsalted butter, marshmallows, liquid}","{7 ounce, 1 cup, 12 cup, 2 tablespoon}"
2,fruit punch 1 can frozen orange juice 1 can frozen lemonade 4 cup sugar 1 quart pineapple juice 6 quart water 2 large ginger ale 6 small bottles 7 up thaw juices mix first 5 ingredients and stir until sugar is dissolved add ginger ale and 7 up fills large punch bowl,"[""sugar"", ""water"", ""pineapple juice"", ""orange juice"", ""frozen lemonade"", ""ginger ale""]","{fruit, mix, up, frozen lemonade, sugar, bowl, small, ingredients, first, large, orange juice, ginger ale, bottles, pineapple juice, is, water, frozen}","{1 quart, 4 cup, 6 quart}"
3,crabmeat casserole serves 10 1 pt mayonnaise 1 pt cream 1 pound fresh crabmeat 2 tablespoon chopped parsley 2 tablespoon chopped onion 3 hard-boiled eggs coarsely chopped 0.5 cup sherry taylor dry 1 package stuffing mix bake 45 minutes at 300 mix all together except stuffing and place on casserole dish you can place stuffing mix on top of mixture then bake 45 minutes at 300,"[""fresh crabmeat"", ""stuffing mix"", ""eggs"", ""onion"", ""sherry"", ""mayonnaise"", ""parsley"", ""cream""]","{bake, mix, mixture, all, mayonnaise, at, onion, stuffing, stuffing mix, fresh crabmeat, then bake, dish, place, of, sherry, together, parsley, hard-boiled eggs, cream, crabmeat, minutes}","{0.5 cup, 1 package, 1 pound, 2 tablespoon}"
4,georgia peach-glazed pork roast 1 2 to 2.5 pound boneless pork loin roast 0.5 cup peach preserves 1 tablespoon dijon mustard 0.5 teaspoon cinnamon heat oven to 325 coat a roasting pan with cooking spray,"[""pork loin roast"", ""cinnamon"", ""peach preserves"", ""mustard""]","{boneless pork loin roast, dijon mustard, cinnamon, a, cooking spray, peach, roasting pan, pork roast, peach preserves, oven}","{1 tablespoon, 2.5 pound, 0.5 cup, 0.5 teaspoon}"


In [9]:
pd.set_option('display.max_colwidth', None)
test_df.to_csv('test_results.csv', index=False)
test_df['predicted_ingredients'].to_csv('predicted_ingredients.csv', index=False)

## Evaluation of Model Performance:

The model’s performance is evaluated by comparing predicted ingredients with actual ground truth labels from the test data.  
Test data's predicted and actual NER values are prepared in a way Spacy can understand and therefore annotations are used.  
This comparison is made by annotating raw text with the predicted entities and their corresponding labels.  
SpaCy’s Scorer is used to calculate performance metrics such as precision, recall, and F1 scores, which reflect the model’s ability to accurately extract ingredients and quantities.

In [8]:
len(test_df)

30000

## Preparing the data for predicted ingredient values for evaluation
Creating annotation for the predicted values

In [None]:

texts = test_df['test_recipe'].tolist()
ner_tags = test_df['predicted_ingredients'].tolist()

nlp = spacy.blank("en")

# Create a dictionary of terms
terms = {}
patterns = []

for tags in ner_tags:
    for tag in tags:
        if tag not in terms and tag!='mix':
            terms[tag] = {'label': 'INGREDIENT'}
            patterns.append(nlp(tag))

# Initialize the PhraseMatcher
ingredient_matcher = PhraseMatcher(nlp.vocab)
ingredient_matcher.add("INGREDIENT", None, *patterns)

In [None]:
len(patterns)

25272

In [None]:
patterns[:10]

[onion soup mix,
 mushrooms,
 pour,
 breadcrumbs,
 bowl combine,
 shredded cheese,
 salt,
 pepper,
 bread crumbs,
 mixed]

In [None]:
#Calling the quantity_extractor component to add to ner pipe
nlp.add_pipe("quantity_extractor", last=True)

<function __main__.quantity_extractor(doc)>

In [None]:
#Calling the ingredient_extractor component to add to ner pipe
nlp.add_pipe("ingredient_extractor", last=True)

<function __main__.ingredient_extractor(doc)>

In [None]:
nlp.analyze_pipes()

{'summary': {'quantity_extractor': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'ingredient_extractor': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'quantity_extractor': [], 'ingredient_extractor': []},
 'attrs': {}}

In [None]:
#Creating annotations for predicted ingredient values

from spacy.tokens import DocBin

predicted_data = [(text, {"entities": []}) for text in texts]

for i, (text, annotations) in enumerate(predicted_data):
    doc = nlp(text)
    entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    predicted_data[i] = (text, {"entities": entities})

In [None]:
predicted_data[1]

('great fluff 7 ounce marshmallows 2 tablespoon unsalted butter 12 cup brown sugar 1 cup evaporated milk first add sugar marshmallows and milk to a pot then wait till boiling when done bowling add in the marshmallows finally wait till the marshmallows melt in the bowling liquid then enjoy you yummy tasting fluff',
 {'entities': [(6, 11, 'INGREDIENT'),
   (12, 19, 'QUANTITY'),
   (20, 32, 'INGREDIENT'),
   (33, 45, 'QUANTITY'),
   (46, 61, 'INGREDIENT'),
   (62, 68, 'QUANTITY'),
   (69, 80, 'INGREDIENT'),
   (81, 86, 'QUANTITY'),
   (87, 102, 'INGREDIENT'),
   (103, 108, 'INGREDIENT'),
   (113, 118, 'INGREDIENT'),
   (119, 131, 'INGREDIENT'),
   (136, 140, 'INGREDIENT'),
   (144, 145, 'INGREDIENT'),
   (150, 154, 'INGREDIENT'),
   (165, 172, 'INGREDIENT'),
   (195, 197, 'INGREDIENT'),
   (202, 214, 'INGREDIENT'),
   (237, 249, 'INGREDIENT'),
   (255, 257, 'INGREDIENT'),
   (270, 276, 'INGREDIENT'),
   (277, 281, 'INGREDIENT'),
   (282, 287, 'INGREDIENT'),
   (306, 311, 'INGREDIENT')]})

## Preparing the data for actual NER labels to compare with the predicted labels
Creating annotation for the actual NER values

In [None]:
texts = test_df['test_recipe'].tolist()
ner_tags = test_df['actual_NER'].apply(eval).tolist()

nlp = spacy.blank("en")

# Create a dictionary of terms
terms = {}
patterns = []

for tags in ner_tags:
    for tag in tags:
        if tag not in terms and tag!='mix':
            terms[tag] = {'label': 'INGREDIENT'}
            patterns.append(nlp(tag))

# Initialize the PhraseMatcher
ingredient_matcher = PhraseMatcher(nlp.vocab)
ingredient_matcher.add("INGREDIENT", None, *patterns)

In [None]:
patterns[:10]

[sour cream,
 onion soup,
 onion,
 mushrooms,
 shredded cheese,
 breadcrumbs,
 green beans,
 butter,
 salt,
 milk]

In [None]:
#Calling the quantity_extractor component to add to ner pipe
nlp.add_pipe("quantity_extractor", last=True)

<function __main__.quantity_extractor(doc)>

In [None]:
#Calling the ingredient_extractor component to add to ner pipe
nlp.add_pipe("ingredient_extractor", last=True)

<function __main__.ingredient_extractor(doc)>

In [None]:
nlp.analyze_pipes()

{'summary': {'quantity_extractor': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'ingredient_extractor': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'quantity_extractor': [], 'ingredient_extractor': []},
 'attrs': {}}

In [None]:
#Creating annotations for actual NER labels
from spacy.tokens import DocBin

actual_data = [(text, {"entities": []}) for text in texts]

for i, (text, annotations) in enumerate(actual_data):
    doc = nlp(text)
    entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    actual_data[i] = (text, {"entities": entities})

In [None]:
actual_data[1]

('great fluff 7 ounce marshmallows 2 tablespoon unsalted butter 12 cup brown sugar 1 cup evaporated milk first add sugar marshmallows and milk to a pot then wait till boiling when done bowling add in the marshmallows finally wait till the marshmallows melt in the bowling liquid then enjoy you yummy tasting fluff',
 {'entities': [(12, 19, 'QUANTITY'),
   (20, 32, 'INGREDIENT'),
   (33, 45, 'QUANTITY'),
   (46, 61, 'INGREDIENT'),
   (62, 68, 'QUANTITY'),
   (69, 80, 'INGREDIENT'),
   (81, 86, 'QUANTITY'),
   (87, 102, 'INGREDIENT'),
   (113, 118, 'INGREDIENT'),
   (119, 131, 'INGREDIENT'),
   (136, 140, 'INGREDIENT'),
   (146, 149, 'INGREDIENT'),
   (150, 154, 'INGREDIENT'),
   (165, 172, 'INGREDIENT'),
   (195, 197, 'INGREDIENT'),
   (202, 214, 'INGREDIENT'),
   (237, 249, 'INGREDIENT'),
   (255, 257, 'INGREDIENT'),
   (270, 276, 'INGREDIENT'),
   (277, 281, 'INGREDIENT')]})

In [None]:
#Evaluating the predicted ingredients against the actual labels

from spacy.training import Example
from spacy.scorer import Scorer

examples = []

# Iterate over both predicted and actual data to create Example objects
for actual, predicted in zip(actual_data, predicted_data):
    text = actual[0]

    # Actual annotations (ground truth)
    actual_anns = actual[1]

    # Predicted annotations
    predicted_anns = predicted[1]

    # Doc object for the actual (ground truth) annotations
    doc_actual = nlp.make_doc(text)
    # Example object with the actual annotations
    example = Example.from_dict(doc_actual, actual_anns)

    # Doc object for the predicted annotations
    doc_predicted = nlp.make_doc(text)

    # List of Span objects for the predicted entities
    predicted_spans = []
    for start, end, label in predicted_anns["entities"]:
        span = doc_predicted.char_span(start, end, label=label)
        if span is not None:
            predicted_spans.append(span)

    # Set predicted entities in the doc_predicted object
    doc_predicted.ents = predicted_spans

    # Set the predicted entities in the Example object
    example.predicted = doc_predicted

    # Append the Example object to the list
    examples.append(example)

# Evaluate the model with these examples
scorer = Scorer()
metrics = scorer.score(examples)

# Print evaluation results
print(metrics)


{'token_acc': 1.0, 'token_p': 1.0, 'token_r': 1.0, 'token_f': 1.0, 'sents_p': None, 'sents_r': None, 'sents_f': None, 'tag_acc': None, 'pos_acc': None, 'morph_acc': None, 'morph_micro_p': None, 'morph_micro_r': None, 'morph_micro_f': None, 'morph_per_feat': None, 'dep_uas': None, 'dep_las': None, 'dep_las_per_type': None, 'ents_p': 0.782902706363453, 'ents_r': 0.9016767679389511, 'ents_f': 0.838102556338916, 'ents_per_type': {'INGREDIENT': {'p': 0.7533481928528357, 'r': 0.885935777993243, 'f': 0.8142800505642817}, 'QUANTITY': {'p': 1.0, 'r': 1.0, 'f': 1.0}}, 'cats_score': 0.0, 'cats_score_desc': 'macro F', 'cats_micro_p': 0.0, 'cats_micro_r': 0.0, 'cats_micro_f': 0.0, 'cats_macro_p': 0.0, 'cats_macro_r': 0.0, 'cats_macro_f': 0.0, 'cats_macro_auc': 0.0, 'cats_f_per_type': {}, 'cats_auc_per_type': {}}


## Testing on raw sample data

In [36]:
import pandas as pd
import spacy
from spacy import displacy

# Load your SpaCy model
nlp = spacy.load("/content/drive/MyDrive/ner_model")

# List of recipes and their ingredients (with quantities) in the format of raw text
recipes = [
    {"recipe_name": "Mexican Tacos", "recipe_text": "For the tacos, you will need 1 cup of salsa, 2 tablespoon of chopped cilantro, and 1 teaspoon of chili powder. Serve the tacos with 1 pound of ground beef, and garnish with 0.5 cup of shredded cheese."},
    {"recipe_name": "Italian Margherita Pizza", "recipe_text": "For the pizza, you'll need 200 gram of mozzarella cheese, 1 cup of tomato sauce, and 2 teaspoon of dried oregano. Use 1 tablespoon of olive oil to coat the dough before baking."},
    {"recipe_name": "American Pancakes", "recipe_text": "To make pancakes, combine 1 cup of flour, 2 tablespoon of sugar, and 1 teaspoon of vanilla extract. Add 0.5 cup of milk and 0.5 cup of melted butter to the batter, then cook on a hot griddle."},
    {"recipe_name": "Chinese Fried Rice", "recipe_text": "For the fried rice, start with 2 cups of cooked rice, 0.5 cup of peas and carrots, and 2 tablespoon of soy sauce. Add 0.5 cup of chopped green onions and 1 teaspoon of sesame oil. Stir-fry until golden."},
    {"recipe_name": "Japanese Ramen", "recipe_text": "For the ramen, heat 1 liter of chicken broth and add 2 tablespoons of soy sauce. Add 3 ounces of ramen noodles and 0.5 teaspoon of miso paste. Simmer for 10 minutes, then garnish with 0.25 cup of chopped green onions."},
    {"recipe_name": "Indian Chole Bhature", "recipe_text": "For the chole, cook 1 cup of chickpeas and mix with 2 tablespoon of cumin powder and 1 teaspoon of coriander powder. Serve with 1 teaspoon of garam masala and 0.5 cup of chopped cilantro. Fry the bhature in 0.5 cup of oil until golden brown."}
]

# Convert recipes into a DataFrame
df_recipes = pd.DataFrame(recipes)

# Assuming you have a trained NER model (nlp), we can iterate over the recipes and display using displacy
for index, row in df_recipes.iterrows():
    recipe_text = row['recipe_text']
    recipe_name = row['recipe_name']

    # Process the recipe text through the model
    doc = nlp(recipe_text)

    # Print the recipe name
    print(f"Recipe: {recipe_name}")

    # Visualize the identified entities using displacy
    print(f"Visualizing Entities for {recipe_name}")
    displacy.render(doc, style='ent', page=True)  # page=True will open it in the browser or a Jupyter notebook

    # Add a line for separation between recipes
    print("\n" + "-"*50 + "\n")




Recipe: Mexican Tacos
Visualizing Entities for Mexican Tacos



--------------------------------------------------

Recipe: Italian Margherita Pizza
Visualizing Entities for Italian Margherita Pizza



--------------------------------------------------

Recipe: American Pancakes
Visualizing Entities for American Pancakes



--------------------------------------------------

Recipe: Chinese Fried Rice
Visualizing Entities for Chinese Fried Rice



--------------------------------------------------

Recipe: Japanese Ramen
Visualizing Entities for Japanese Ramen



--------------------------------------------------

Recipe: Indian Chole Bhature
Visualizing Entities for Indian Chole Bhature



--------------------------------------------------

