We're going to replicate the benchmark in [A Named Entity Based Approach to Model Recipes](https://arxiv.org/abs/2004.12184), by Diwan, Batra, and Bagler using StanfordNLP, and check it using [seqeval](https://github.com/chakki-works/seqeval).

Evaluating NER is surprisingly tricky, as [David Batista explains](https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/), and I want to check that the results in the paper are the same as what seqeval gives, so I can compare it to other models.

The authors share their data in an [associated git repository](https://github.com/cosylabiiit/recipe-knowledge-mining) and train a model using [Stanford NER](https://nlp.stanford.edu/software/CRF-NER.html), which is open source, so we have a chance of replicating the results.

# Installing Stanford NLP

We're going to install Stanford NLP which is a Java library.
To make things easier we will use [stanza](https://stanfordnlp.github.io/stanza/) which includes tools for [installing and invoking Stanford NLP](https://stanfordnlp.github.io/stanza/corenlp_client.html).

In [1]:
    !pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 2.4 MB/s eta 0:00:00
Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
     -------------------------------------- 590.6/590.6 kB 6.2 MB/s eta 0:00:00
Collecting protobuf>=3.15.0
  Downloading protobuf-6.31.0-cp310-abi3-win_amd64.whl (435 kB)
     -------------------------------------- 435.1/435.1 kB 6.9 MB/s eta 0:00:00
Installing collected packages: protobuf, emoji, stanza
Successfully installed emoji-2.14.1 protobuf-6.31.0 stanza-1.10.1



[notice] A new release of pip available: 22.3 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


We can specify where to install Core NLP, but we will us the default, which is either "\\$CORE_NLP_HOME", or "\\$HOME/stanza_corenlp". (Ideally we'd use stanza to get this, but I couldn't easy work out how.)

In [2]:
import stanza
stanza.install_corenlp()

  from .autonotebook import tqdm as notebook_tqdm
2025-05-15 21:01:51 INFO: Installing CoreNLP package into C:\Users\Helena\stanza_corenlp
Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip: 100%|██████████| 508M/508M [01:23<00:00, 6.07MB/s] 
2025-05-15 21:03:16 INFO: Downloaded file to C:\Users\Helena\stanza_corenlp\corenlp.zip


We'll need to invoke the Stanford Core NLP JAR that we just installed, so let's find it.

In [16]:
import os
import re
from pathlib import Path


# Reimplement the logic to find the path where stanza_corenlp is installed.
core_nlp_path = os.getenv('CORENLP_HOME', str(Path.home() / 'stanza_corenlp'))

# A heuristic to find the right jar file
classpath = [str(p) for p in Path(core_nlp_path).iterdir() if re.match(r"stanford-corenlp-[0-9.]+\.jar", p.name)][0]
classpath

'C:\\Users\\Helena\\stanza_corenlp\\stanford-corenlp-4.5.9.jar'

Let's test the [basic usage](https://stanfordnlp.github.io/stanza/client_usage.html).

There are currently models for 8 languages, and for some fairly complex tasks like coreference resolution.

In [8]:
from stanza.server import CoreNLPClient

text = "David Batista wrote a blog post on NER evaluation. " \
       "Hiroki Nakayama wrote seqeval to evaluate sequential labelling tasks, such as NER. " \
       "We will test his library against Stanford Core NLP. "

with CoreNLPClient(
     annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'],
     timeout=30000,
     memory='6G',
     port=3001  # Use a different port to avoid conflicts
) as client:
    
    ann = client.annotate(text)

2025-05-15 21:09:50 INFO: Writing properties to tmp file: corenlp_server-2eae0dd9fb824ba3.props


PermanentlyFailedException: Error: unable to start the CoreNLP server on port 9000 (possibly something is already running there)

We get 3 sentences out.

In [7]:
for sentence in ann.sentence:
    print(" ".join([token.word for token in sentence.token]))

NameError: name 'ann' is not defined

It can even do clever things like coreference resolution; resolving that "his library" refers to "Hiroki Nakayama's library".

In [6]:
for chain in ann.corefChain:
    print([ann.mentionsForCoref[mention.mentionID].headString for mention in chain.mention])

['nakayama', 'his']


We can extract things such as lemmas, parts of speech and standard NER tags.

But we want to train our own NER model to detect ingredients. First we will need to collect the data.

In [4]:
import pandas as pd

tokens = ann.sentence[1].token

pd.DataFrame({'word': [s.word for s in tokens],
              'lemma': [s.lemma for s in tokens],
              'pos': [s.pos for s in tokens],
              'ner': [s.ner for s in tokens]}).T

NameError: name 'ann' is not defined

# Get Data

Helpfully the authors provide the annotated ingredients data in the format for Stanford NER that we can download [from github](https://github.com/cosylabiiit/recipe-knowledge-mining).

There are two sources of ingredients, `ar` is AllRecipes and `gk` is  FOOD.com (formerly GeniusKitchen.com).

In [6]:
from urllib.request import urlretrieve

data_sources = ['ar', 'gk']
data_splits = ['train', 'test']

base_url = 'https://raw.githubusercontent.com/cosylabiiit/recipe-knowledge-mining/master/'

def data_filename(source, split):
    return f'{source}_{split}.tsv'

for source in data_sources:
    for split in data_splits:
        name = data_filename(source, split)
        urlretrieve(base_url + name, name)

Each line of the file is either a single tab (separating different texts), or a token followed by a tab and then the entity type.

So for example the first ingredient is `4 cloves garlic`, which is a quantity (4) followed by a unit (cloves) and a name (garlic).

In [7]:
!head {data_filename('ar', 'train')} | cat -t

'head' is not recognized as an internal or external command,
operable program or batch file.


We can read this in to Python, converting it to a list of annotated sentences, which is just a sequence of token, label pairs.

In [8]:
from typing import List, Tuple, Generator

Annotation = Tuple[str, str]
AnnotatedSentence = List[Annotation]

def segment_texts(data: str) -> Generator[AnnotatedSentence, None, None]:
    output = []
    for line in data.split('\n'):
        if line.strip():
            text, token = line.split('\t')
            output.append((text.strip(), token.strip()))
        elif output:
            yield output
            output = []
            
def segment_file(filename: str) -> List[AnnotatedSentence]:
    with open(filename, 'rt') as f:
        return list(segment_texts(f.read()))

In [9]:
ar_train = segment_file(data_filename('ar', 'train'))

In [10]:
ar_train[:2]

[[('4', 'QUANTITY'), ('cloves', 'UNIT'), ('garlic', 'NAME')],
 [('2', 'QUANTITY'),
  ('tablespoons', 'UNIT'),
  ('vegetable', 'NAME'),
  ('oil', 'NAME'),
  (',', 'O'),
  ('divided', 'STATE')]]

We can then calculate the number of sentences in the training set for a source.

In [11]:
len(ar_train)

1470

We can use this to check the types of entities annotated, as in the paper (DF is Dried/Fresh).

In [12]:
from collections import Counter

tag_counts = Counter([annotation[1] for sentence in ar_train for annotation in sentence])
tag_counts

Counter({'NAME': 2501,
         'O': 1662,
         'QUANTITY': 1583,
         'UNIT': 1338,
         'STATE': 879,
         'DF': 154,
         'SIZE': 64,
         'TEMP': 31})

# Train NER Model

Now we want to train a Stanford NER model on the new annotations.

First we have to configure it; but there's no information on the paper on how it's configured.
I've copied this template configuration out of the [FAQ](https://nlp.stanford.edu/software/crf-faq.html)
For more information on the parameters you can check the [NERFeatureFactory documentation](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html) or the [source](https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/ie/NERFeatureFactory.java).

In [17]:
def ner_prop_str(train_files: List[str], test_files: List[str], output: str) -> str:
    """Returns configuration string to train NER model"""
    train_file_str = ','.join(train_files)
    test_file_str = ','.join(test_files)
    return f"""
trainFileList = {train_file_str}
testFiles = {test_file_str}
serializeTo = {output}
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
"""

This is expected to be a file, so let's write a helper that writes it to a file. (An alternative would be to pass these as arguments to the trainer).

In [18]:
def write_ner_prop_file(ner_prop_file: str, train_files: List[str], test_files: List[str], output_file: str) -> None:
    with open(ner_prop_file, 'wt') as f:
        props = ner_prop_str(train_files, test_files, output_file)
        f.write(props)

Stanza doesn't give an interface to train a CRF NER model using Stanford NLP, but we can invoke `edu.stanford.nlp.ie.crf.CRFClassifier` directly.

Let's write a properties file and invoke Java to run the classifier.
It prints a lot of training information, and importantly a summary report at the end which we want to see.

In [19]:
import subprocess
from typing import List

def train_model(model_name, train_files: List[str], test_files: List[str], print_report=True, classpath=classpath) -> str:
    """Trains CRF NER Model using StanfordNLP"""
    model_file = f'{model_name}.model.ser.gz'
    ner_prop_filename = f'{model_name}.model.props'
    write_ner_prop_file(ner_prop_filename, train_files, test_files, model_file)
        
    result = subprocess.run(
                ['java',
                 '-Xmx2g',
                 '-cp', classpath,
                 'edu.stanford.nlp.ie.crf.CRFClassifier',
                 '-prop', ner_prop_filename],
                capture_output=True)
    
    # If there's an error with invocation better log the stacktrace
    if result.returncode != 0:
        print(result.stderr.decode('utf-8'))
    result.check_returncode()
    
    if print_report:
        print(*result.stderr.decode('utf-8').split('\n')[-11:], sep='\n')
        
    return model_file

We can train models on each dataset separately, and all together.
For evaluation we'll use the corresponding test set.

This only takes a few minutes.

In [20]:
%%time

models = {}
for source in ['ar', 'gk', 'ar_gk']:
    print(source)
    train_files = [data_filename(s, 'train') for s in source.split('_')]
    test_files = [data_filename(s, 'test') for s in source.split('_')]
    models[source] = train_model(source, train_files, test_files)
    print()

ar
CRFClassifier tagged 2788 words in 483 documents at 3295.51 words per second.
         Entity	P	R	F1	TP	FP	FN
             DF	1.0000	0.9608	0.9800	49	0	2
           NAME	0.9297	0.9279	0.9288	463	35	36
       QUANTITY	1.0000	0.9962	0.9981	522	0	2
           SIZE	1.0000	1.0000	1.0000	20	0	0
          STATE	0.9601	0.9633	0.9617	289	12	11
           TEMP	0.8750	0.7000	0.7778	7	1	3
           UNIT	0.9819	0.9841	0.9830	434	8	7
         Totals	0.9696	0.9669	0.9682	1784	56	61


gk
CRFClassifier tagged 9886 words in 1705 documents at 9701.67 words per second.
         Entity	P	R	F1	TP	FP	FN
             DF	0.9718	0.9517	0.9617	138	4	7
           NAME	0.9132	0.9021	0.9076	1621	154	176
       QUANTITY	0.9882	0.9870	0.9876	1598	19	21
           SIZE	0.9750	0.9398	0.9571	78	2	5
          STATE	0.9255	0.9503	0.9377	708	57	37
           TEMP	0.8125	0.8125	0.8125	26	6	6
           UNIT	0.9810	0.9721	0.9766	1291	25	37
         Totals	0.9534	0.9497	0.9516	5460	267	289


ar_gk
CRFClassifier tagged 126

The summary report shows for each model and entity type:

* True Positives (TP): The number of times that entity was predicted correctly
* False Positives (FP): The number of times that entity in the text but not predicted correctly
* False Negative (FN): The number of times that entity was not in the text and predicted
* Precision (P): Probability a predicted entity is correct, TP/(TP+FP)
* Recall (R): Probability a correct entity is predicted, TP/(TP+FN)
* F1 Score (F1): Harmonic mean of precision and recall, 2/(1/P + 1/R).

We can compare the F1 Totals to the diagonal of Table IV in the paper

* AllRecipes.com (ar): We get 0.9682, they report 0.9682
* FOOD.com (gk): We get 0.9516, they report 0.9519
* Both (ar_gk): We get 0.9551, they report 0.9611

These are super close.
The furthest is `ar_gk` and in the repository they have a separate `ar_gk_train.tsv`; it would be interesting to check whether using it directly gives a closer result and why there is a difference.

# Running the model in Python

We can now use these trained models in Python by invoking Stanford NLP with Stanza.

First we'll load in the test data.

In [21]:
test_data = {}

for source in data_sources:
    test_data[source] = segment_file(data_filename(source, 'test'))
    print(source, len(test_data[source]))

ar 483
gk 1705


We can call StanfordNLP with our custom model by passing the property `ner.model`.

Our test data is already tokenized in a different way to StanfordNLP, so we'll add an option to the [Tokenizer](https://stanfordnlp.github.io/CoreNLP/tokenize.html) to use whitespace tokenization which is easy to invert.

It takes a while to start up the server so we want to annotate a large number of texts at once.

In [26]:
import random
import time
from stanza.server import CoreNLPClient  # Add this import
from tqdm import tqdm  # Also add this for the tqdm function

def annotate_ner_robust(ner_model_file: str, texts: List[str], tokenize_whitespace: bool = True):
    """A more robust version of annotate_ner that handles port conflicts better"""
    
    # 1. First, try to kill any lingering Java processes that might be using ports
    try:
        subprocess.run(['taskkill', '/F', '/IM', 'java.exe'], capture_output=True)
        time.sleep(2)  # Give system time to release resources
    except Exception as e:
        print(f"Warning: Could not kill Java processes: {e}")
    
    # 2. Generate random, high-numbered ports to avoid conflicts
    server_port = random.randint(20000, 50000)
    control_port = server_port + 1000  # Keep these well separated
    
    print(f"Trying server port: {server_port}, control port: {control_port}")
    
    # 3. Set additional parameters to avoid issues
    properties = {
        "ner.model": ner_model_file, 
        "tokenize.whitespace": tokenize_whitespace, 
        "ner.applyNumericClassifiers": False
    }
    
    annotated = []
    with CoreNLPClient(
         annotators=['tokenize','ssplit','ner'],
         properties=properties,
         timeout=60000,  # Longer timeout
         be_quiet=True,
         port=server_port,
         start_server=True,
         control_port=control_port,
         preload=False,  # Don't preload models
         memory='4G',    # Use less memory
         endpoint=f'http://localhost:{server_port}') as client:  # Include port in endpoint URL
        
        print("Server successfully started!")
        
        for text in tqdm(texts):
            annotated.append(client.annotate(text))
            
    return annotated

  from .autonotebook import tqdm as notebook_tqdm


We can then get the annotations

In [42]:
from stanza.server import CoreNLPClient

annotations = annotate_ner_robust(models['ar'],
                           [   "- 30 g of sweet potato",
"- 20 g of edamame",
"- 5 g of cornstarch" ,
"- A small bit of water",

])

2025-05-17 19:37:19 INFO: Writing properties to tmp file: corenlp_server-fdde4fae98604f80.props
2025-05-17 19:37:19 INFO: Starting server with command: java -Xmx4G -cp C:\Users\Helena\stanza_corenlp\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 32637 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-fdde4fae98604f80.props -annotators tokenize,ssplit,ner -outputFormat serialized


Trying server port: 32637, control port: 33637
Server successfully started!


100%|██████████| 4/4 [00:17<00:00,  4.31s/it]


Note here that the word "Italian" has ner "NATIONALITY", which comes from another model (it wasn't in the training set!).

We want to use the `coarseNER`.

In [28]:
annotations[2].sentence[0].token[2]

word: "pancetta"
pos: "NN"
value: "pancetta"
originalText: "pancetta"
ner: "NAME"
lemma: "pancetta"
beginChar: 10
endChar: 18
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false
coarseNER: "NAME"
fineGrainedNER: "NAME"
entityMentionIndex: 2
nerLabelProbs: "NAME=0.925096306225114"

When I didn't set `"ner.applyNumericClassifiers": False` this would come up as a `NUMBER`.

In [29]:
annotations[3].sentence[0].token[3]

word: "3"
pos: "CD"
value: "3"
originalText: "3"
ner: "O"
lemma: "3"
beginChar: 20
endChar: 21
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false
coarseNER: "O"
fineGrainedNER: "O"
nerLabelProbs: "O=0.8599875707736956"

We can then flatten the sentences and extract the NER tokens

In [30]:
from dataclasses import dataclass, asdict

@dataclass
class NERData:
    ner: List[str]
    tokens: List[str]
        
    # Let's use Pandas to make it pretty in a notebook
    def _repr_html_(self):
        return pd.DataFrame(asdict(self)).T._repr_html_()

def extract_ner_data(annotation) -> NERData:
    tokens = [token for sentence in annotation.sentence for token in sentence.token]
    return NERData(tokens=[t.word for t in tokens], ner=[t.coarseNER for t in tokens])

A relatively simple ingredient works well

In [34]:
extract_ner_data(annotations[0])

Unnamed: 0,0,1,2,3,4
ner,QUANTITY,UNIT,O,TEMP,NAME
tokens,1,cup,of,frozen,peas


A more complex sentence does quite badly, perhaps because this kind of thing wasn't seen.

In [32]:
extract_ner_data(annotations[1])

Unnamed: 0,0,1,2,3,4,5,6,7
ner,QUANTITY,UNIT,NAME,NAME,NAME,NAME,O,O
tokens,A,dash,of,salt,.,Or,to,taste


In [37]:
extract_ner_data(annotations[2])

Unnamed: 0,0,1,2,3,4
ner,QUANTITY,UNIT,STATE,NAME,NAME
tokens,1/2,teaspoon,ground,black,pepper


In [38]:
def extract_ingredient_names(ner_data):
    """
    Extract only the ingredient names from NER data.
    If names appear in consecutive tokens, join them together.
    
    Args:
        ner_data: NERData object returned by extract_ner_data
        
    Returns:
        List of extracted ingredient names (joined if multiple consecutive tokens)
    """
    names = []
    current_name = []
    
    # Zip tokens and NER tags together and iterate
    for token, tag in zip(ner_data.tokens, ner_data.ner):
        if tag == 'NAME':
            current_name.append(token)
        elif current_name:  # Not a NAME tag but we have collected name tokens
            names.append(' '.join(current_name))
            current_name = []
    
    # Add the last name if there's one at the end of the sequence
    if current_name:
        names.append(' '.join(current_name))
    
    return names

# Example usage:
ingredient_names = extract_ingredient_names(extract_ner_data(annotations[0]))
print(f"Ingredient names: {ingredient_names}")

Ingredient names: ['peas']


In [39]:
# Example usage:
ingredient_names = extract_ingredient_names(extract_ner_data(annotations[0]))
print(f"Ingredient names: {ingredient_names}")

Ingredient names: ['peas']


In [40]:
# Process all annotations and extract ingredient names
def process_all_annotations(annotations):
    """
    Process all annotations and extract ingredient names from each one
    
    Args:
        annotations: List of annotation objects returned by annotate_ner_robust
        
    Returns:
        List of dictionaries containing original text and extracted ingredient names
    """
    results = []
    
    for i, annotation in enumerate(annotations):
        # Extract NER data
        ner_data = extract_ner_data(annotation)
        
        # Extract ingredient names
        names = extract_ingredient_names(ner_data)
        
        # Add to results
        results.append({
            "index": i,
            "tokens": ner_data.tokens,
            "ingredient_names": names
        })
    
    return results

# Example usage:
all_results = process_all_annotations(annotations)

# Print results
for result in all_results:
    original_text = ' '.join(result['tokens'])
    print(f"Original: '{original_text}'")
    print(f"Ingredient names: {result['ingredient_names']}\n")

Original: '1 cup of frozen peas'
Ingredient names: ['peas']

Original: '2 tablespoons olive oil'
Ingredient names: ['olive oil']

Original: '1/2 teaspoon ground black pepper'
Ingredient names: ['black pepper']



In [41]:
# At the end of your model training notebook
import pickle

# Save models dictionary
with open('trained_models.pkl', 'wb') as f:
    pickle.dump(models, f)

In [None]:
def extract_names_from_ingredient(ingredient_text, model_file):
    """
    Process an ingredient text and extract only the ingredient names.
    
    Args:
        ingredient_text: String with ingredient text
        model_file: Path to the trained NER model
        
    Returns:
        List of ingredient names
    """
    # Make sure to import CoreNLPClient to avoid NameError
    from stanza.server import CoreNLPClient
    
    # Get annotations
    annotations = annotate_ner_robust(model_file, [ingredient_text])
    
    # Process each annotation
    if not annotations or annotations[0] is None:
        return []
    
    # Extract NER data
    ner_data = extract_ner_data(annotations[0])
    
    # Extract just the names
    return extract_ingredient_names(ner_data)

# Example:
# names = extract_names_from_ingredient("1 cup of frozen peas", ar_model_file)
# print(names)  # Should print ["peas"]

We can chain these functions together to get from text to NER

In [28]:
from typing import Dict

def ner_extract(ner_model_file: str, texts: List[str], tokenize_whitespace: bool = True) -> List[Dict[str, List[str]]]:
    annotations = annotate_ner(ner_model_file, texts, tokenize_whitespace)
    return [extract_ner_data(ann) for ann in annotations]

And then for each model, and test data we can calculate the predictions.

In [29]:
preds = {}
for model, modelfile in models.items():
    preds[model] = {}
    for test_source, token_data in test_data.items():
        texts = [' '.join([x[0] for x in text]) for text in token_data]
        preds[model][test_source] = ner_extract(modelfile, texts)

  0%|          | 0/483 [00:00<?, ?it/s]

  0%|          | 0/1705 [00:00<?, ?it/s]

  0%|          | 0/483 [00:00<?, ?it/s]

  0%|          | 0/1705 [00:00<?, ?it/s]

  0%|          | 0/483 [00:00<?, ?it/s]

  0%|          | 0/1705 [00:00<?, ?it/s]

## Sanity checks

Let's check the same tokens come through the model as were input

In [30]:
for test_source, token_data in test_data.items():
    tokens = [[x[0] for x in tokens] for tokens in token_data]
    
    for model in models:
        model_preds = preds[model][test_source]
        
        model_tokens = [p.tokens for p in model_preds]
        
        if tokens != model_tokens:
            raise ValueError("Tokenization issue in %s with model %s" % (test_source, model))

# Evaluating

Now that we have predictions we can evaulate with [seqeval](https://github.com/chakki-works/seqeval).

In [31]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
     |████████████████████████████████| 43 kB 102 kB/s            
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l- \ | done
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16181 sha256=117220ab957b2dfbf6fad8b7cf7fb429b409f1fb1b62fef7ea14d20e38b36203
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Seqeval expects the data to be in one of the following formats:

* IOB1
* IOB2
* IOE1
* IOE2
* IOBES(only in strict mode)
* BILOU(only in strict mode)

These all become important when trying to distinguish distinct entities that are adjacent; these are quite rare in practice.
See Wikipedia for a detailed explanation of [IOB (inside-outside-beginning)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).

In this case it's assumed there's only one entity of each type (which can be wrong when multiple names are listing in a single ingredient).
We can easily convert it to IOB1 using this assumption by prefixing every tag other than 'O' with an 'I-'.

In [32]:
def convert_to_iob1(tokens):
    return ['I-' + label if label != 'O' else 'O' for label in tokens]

assert convert_to_iob1(['QUANTITY', 'SIZE', 'NAME', 'NAME', 'O', 'STATE']) == ['I-QUANTITY', 'I-SIZE', 'I-NAME', 'I-NAME', 'O', 'I-STATE']

Let's check the classification report for a single example and compare it to the report from StanfordNER.

The classification report doesn't have the TP, TN and FN, but instead has the support - the number of true entities in the data.
The set of data is equivalent:

* support = TP + FN
* TP = R * support
* FP = TP (1/P - 1)
* FN = support - TP

The results are the same.

In [33]:
from seqeval.metrics import classification_report

test_source = 'ar'
model = 'ar'

actual_ner = [convert_to_iob1([x[1] for x in ann]) for ann in test_data[test_source]]
pred_ner = [convert_to_iob1(p.ner) for p in preds[model][test_source]]

print(classification_report(actual_ner, pred_ner, digits=4))

              precision    recall  f1-score   support

          DF     1.0000    0.9608    0.9800        51
        NAME     0.9297    0.9279    0.9288       499
    QUANTITY     1.0000    0.9962    0.9981       524
        SIZE     1.0000    1.0000    1.0000        20
       STATE     0.9601    0.9633    0.9617       300
        TEMP     0.8750    0.7000    0.7778        10
        UNIT     0.9819    0.9841    0.9830       441

   micro avg     0.9696    0.9669    0.9682      1845
   macro avg     0.9638    0.9332    0.9471      1845
weighted avg     0.9695    0.9669    0.9682      1845



We can get the micro f1-score directly.

In [34]:
from seqeval.metrics import f1_score
'%0.4f' % f1_score(actual_ner, pred_ner)

'0.9682'

We can then try to reproduce Table IV by computing the f1-score for each model and data.

In [35]:
scores = {model: {} for model in models}
for test_source, data in test_data.items():
    actual_ner = [convert_to_iob1([x[1] for x in ann]) for ann in data]
    for model in models:
        pred_ner = [convert_to_iob1(p.ner) for p in preds[model][test_source]]
        scores[model][test_source] = f1_score(actual_ner, pred_ner)

We also need to calculate the scores on the combined test set, by contatenating them

In [36]:
actual_ner = [convert_to_iob1([x[1] for x in ann]) for data in test_data.values() for ann in data]
for model in models:
    pred_ner = [convert_to_iob1(p.ner) for test_source in test_data for p in preds[model][test_source]]
    scores[model]['combined'] = f1_score(actual_ner, pred_ner)

In [37]:
pd.DataFrame(scores).style.format('{:0.4f}')

Unnamed: 0,ar,gk,ar_gk
ar,0.9682,0.9331,0.9704
gk,0.8666,0.9511,0.9499
combined,0.8911,0.9469,0.9549


The results are *slightly* different to those in the paper, but all agree within 0.01 for each row.

So we've successfully reproduced the results in the paper, and shown the evaulation from Stanford NER toolkit is very close to that of seqeval (if you work around hallucinated entities).

In [38]:
reported_scores = pd.DataFrame([[0.9682, 0.9317, 0.9709],
              [0.8672, 0.9519, 0.9498],
              [0.8972, 0.9472, 0.9611]],
             columns = ['AllRecipes', 'FOOD.com', 'BOTH'],
             index = ['AllRecipes', 'FOOD.com', 'BOTH'])
reported_scores

Unnamed: 0,AllRecipes,FOOD.com,BOTH
AllRecipes,0.9682,0.9317,0.9709
FOOD.com,0.8672,0.9519,0.9498
BOTH,0.8972,0.9472,0.9611


In [2]:
ar_model_file = 'ar.model.ser.gz'  # This is the default filename format based on your code

# You can use it with annotate_ner_robust
annotations = annotate_ner_robust(
    ar_model_file,
    ['1 cup of frozen peas']
)

NameError: name 'annotate_ner_robust' is not defined

<h3><strong>Using The NER Trained Model</strong></h3>

In [1]:
#Open Excel File
import openpyxl
import os

# Load the workbook
workbook = openpyxl.load_workbook('1_food-dataset-final.xlsx')

# Select the active worksheet
worksheet = workbook["Sheet1"]


In [2]:
# Import pandas for data manipulation
import pandas as pd
import numpy as np

data = []
headers = []

# Get headers from the first row
for col in range(1, worksheet.max_column + 1):
    headers.append(worksheet.cell(row=1, column=col).value)

# Get data from remaining rows
for row in range(2, worksheet.max_row + 1):
    row_data = []
    for col in range(1, worksheet.max_column + 1):
        row_data.append(worksheet.cell(row=row, column=col).value)
    data.append(row_data)

# Create DataFrame
df = pd.DataFrame(data, columns=headers)

# Display first few rows to verify
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (465, 28)


Unnamed: 0,Food Name,None,Ingredients,NER Ingredient,Instructions,Min Age Group,Max Age Group,texture,Prep Time,Cook Time,...,Description,Flavor_type,Dietary Tags,choking_hazard,tips,allergen,hypoallergenic,Nutrion Value,None.1,None.2
0,"Chicken, Squash & Spinach",,"2 tsp sunflower oil_x000D_\n50g onion, diced_...",,1. Heat the oil in a saucepan.\n2. Add the oni...,6,36,Unknown,12.0,22.0,...,Medium,,,,,,,Yes,,
1,"Porridge with Apple, Pear & Apricot",,"4 tbsp water_x000D_\n1 apple, peeled, cored an...",,1. Put the fruit into a saucepan together with...,6,12,Unknown,5.0,10.0,...,Easy,,,,,,,Yes,,
2,"Spinach, Potato, Carrot & Cheddar Mash",,"350g potatoes, diced_x000D_\n200g carrot, dice...",,1. Put the potato and carrot into a steamer an...,6,10,Unknown,10.0,15.0,...,Easy,,,,,,,Yes,,
3,Fruity Chicken with Apricots & Sweet Potato Puree,,"2 tsp light olive oil_x000D_\n1 small onion, c...",,1. Heat the oil in a pan and sauté the onion f...,6,8,Unknown,10.0,20.0,...,Easy,,,,,,,Yes,,
4,Chicken Curry Puree,,"2 tsp sunflower oil_x000D_\n1 onion, roughly c...",,1. Heat the oil in a saucepan.\n2. Add the oni...,6,6,Thin Puree,12.0,25.0,...,Easy,,,,,,,Yes,,


In [3]:
# Get headers from the first row
headers = []
for col in range(1, worksheet.max_column + 1):
    headers.append(worksheet.cell(row=1, column=col).value)

print("Headers:", headers)

Headers: ['Food Name', None, 'Ingredients', 'NER Ingredient', 'Instructions', 'Min Age Group ', 'Max Age Group ', 'texture', 'Prep Time', 'Cook Time', 'Serving', 'Origin', 'Link ', 'Credibility ', 'Image Link ', 'Region', 'Difficulty', 'Meal Type', 'Description', 'Flavor_type', 'Dietary Tags', 'choking_hazard', 'tips', 'allergen', 'hypoallergenic', 'Nutrion Value', None, None]


In [4]:
# Display all column names to verify
print("Column names:", df.columns.tolist())

# Drop None value column if it exists
none_columns = [col for col in df.columns if col is None]
if none_columns:
    df = df.drop(columns=none_columns)
    print(f"Dropped {len(none_columns)} None column(s)")

# Display all column names to verify
print("Column post-clean:", df.columns.tolist())
print(f"Dataset shape: {df.shape}")


Column names: ['Food Name', None, 'Ingredients', 'NER Ingredient', 'Instructions', 'Min Age Group ', 'Max Age Group ', 'texture', 'Prep Time', 'Cook Time', 'Serving', 'Origin', 'Link ', 'Credibility ', 'Image Link ', 'Region', 'Difficulty', 'Meal Type', 'Description', 'Flavor_type', 'Dietary Tags', 'choking_hazard', 'tips', 'allergen', 'hypoallergenic', 'Nutrion Value', None, None]
Dropped 3 None column(s)
Column post-clean: ['Food Name', 'Ingredients', 'NER Ingredient', 'Instructions', 'Min Age Group ', 'Max Age Group ', 'texture', 'Prep Time', 'Cook Time', 'Serving', 'Origin', 'Link ', 'Credibility ', 'Image Link ', 'Region', 'Difficulty', 'Meal Type', 'Description', 'Flavor_type', 'Dietary Tags', 'choking_hazard', 'tips', 'allergen', 'hypoallergenic', 'Nutrion Value']
Dataset shape: (465, 25)


In [5]:
#renamed columns
df = df.rename(columns={
    'Food Name' : 'food_name',
    'Ingredients' : 'ingredient',
    'Instructions' : 'instructions',
    'Min Age Group ': 'min_age_group',
    'Max Age Group ': 'max_age_group',
    'NER Ingredient': 'ner_ingredient',
    'Texture': 'texture',
    'Prep Time': 'prep_time',
    'Cook Time': 'cook_time',
    'Serving': 'serving',
    'Difficulty': 'difficulty',
    'Origin': 'origin',
    'Region': 'region',
    'Description': 'description',
    'Image Link ': 'image_link',
    'Link ': 'recipe_link',
    'Credibility ': 'credibility',
    'Meal Type': 'meal_type',
    'Flavor_type': 'flavor_type',
    'Dietary Tags': 'dietary_tags',
    'Choking Hazards': 'choking_hazards',
    'Nutrion Value': 'nutrition_value',
    'tips': 'tips',
    'Allergen': 'allergen',
    'Hypoallergenic': 'hypoallergenic',
})

print("Column name post-clean:", df.columns.tolist())
df.head()

Column name post-clean: ['food_name', 'ingredient', 'ner_ingredient', 'instructions', 'min_age_group', 'max_age_group', 'texture', 'prep_time', 'cook_time', 'serving', 'origin', 'recipe_link', 'credibility', 'image_link', 'region', 'difficulty', 'meal_type', 'description', 'flavor_type', 'dietary_tags', 'choking_hazard', 'tips', 'allergen', 'hypoallergenic', 'nutrition_value']


Unnamed: 0,food_name,ingredient,ner_ingredient,instructions,min_age_group,max_age_group,texture,prep_time,cook_time,serving,...,difficulty,meal_type,description,flavor_type,dietary_tags,choking_hazard,tips,allergen,hypoallergenic,nutrition_value
0,"Chicken, Squash & Spinach","2 tsp sunflower oil_x000D_\n50g onion, diced_...",,1. Heat the oil in a saucepan.\n2. Add the oni...,6,36,Unknown,12.0,22.0,4,...,,,Medium,,,,,,,Yes
1,"Porridge with Apple, Pear & Apricot","4 tbsp water_x000D_\n1 apple, peeled, cored an...",,1. Put the fruit into a saucepan together with...,6,12,Unknown,5.0,10.0,4 portions,...,,,Easy,,,,,,,Yes
2,"Spinach, Potato, Carrot & Cheddar Mash","350g potatoes, diced_x000D_\n200g carrot, dice...",,1. Put the potato and carrot into a steamer an...,6,10,Unknown,10.0,15.0,4,...,,,Easy,,,,,,,Yes
3,Fruity Chicken with Apricots & Sweet Potato Puree,"2 tsp light olive oil_x000D_\n1 small onion, c...",,1. Heat the oil in a pan and sauté the onion f...,6,8,Unknown,10.0,20.0,4 Portions,...,,,Easy,,,,,,,Yes
4,Chicken Curry Puree,"2 tsp sunflower oil_x000D_\n1 onion, roughly c...",,1. Heat the oil in a saucepan.\n2. Add the oni...,6,6,Thin Puree,12.0,25.0,4 portions,...,,,Easy,,,,,,,Yes


In [6]:
#drop columns if some columns are empty
important_columns = ['food_name', 'ingredient', 'instructions',  'recipe_link']
for col in df.columns:
    if col in important_columns:
        null_count = df[col].isnull().sum()        
        if null_count >0:
            df = df.dropna(subset=[col])

# After dropping rows, you may want to reset the index if needed
df = df.reset_index(drop=True)
# Display the cleaned DataFrame
print("Cleaned DataFrame:")
print(df.shape)
df.head()



Cleaned DataFrame:
(241, 25)


Unnamed: 0,food_name,ingredient,ner_ingredient,instructions,min_age_group,max_age_group,texture,prep_time,cook_time,serving,...,difficulty,meal_type,description,flavor_type,dietary_tags,choking_hazard,tips,allergen,hypoallergenic,nutrition_value
0,Poached Chicken Breast with Carrots and Beans,- 100g finely chopped or minced chicken breast...,,"1. Place chicken, chopped carrots, and green b...",6,7,Thin Puree,10.0,15.0,2,...,Oceanic,nz,Medium,,"A standout among chicken and carrot dishes, th...",,,,,Yes
1,Beef and Vegetable Casserole Recipe,- 1 tsp olive oil\n-100g finely minced beef\n-...,,1. Heat 1 tsp oil in a medium saucepan over me...,6,7,Unknown,15.0,20.0,2,...,Oceanic,nz,Medium,,,,,,,Yes
2,Chicken and Tomato Risoni Recipe,- Spray oil for cooking\n- 100g chicken mince\...,,"1. In a small saucepan, cook mince over medium...",6,7,Unknown,10.0,18.0,6,...,Oceanic,nz,Medium,,This chicken and tomato risoni recipe is a del...,,,,,Yes
3,Cream of Pumpkin and Potato Soup,- Spray oil for cooking\n- ½ small (40g) onion...,,1. Lightly spray a medium saucepan with oil an...,6,7,Unknown,10.0,12.0,2,...,Oceanic,nz,Medium,,,,,,,No
4,Baby-Friendly Lentil Dhal Recipe,– 1 cup (250 mL) water\n– 1 small (120 g) pota...,,"1. Place water, chopped potato, onion, carrot,...",6,7,Unknown,10.0,15.0,2,...,Oceanic,nz,Easy,,With its soft texture and carefully chosen ing...,,,,,Yes


In [7]:
# Import dataclass from dataclasses module
from dataclasses import dataclass
from typing import List, Tuple

# Helper functions from your existing code
@dataclass
class NERData:
    ner: List[str]
    tokens: List[str]

def extract_ner_data(annotation) -> NERData:
    tokens = [token for sentence in annotation.sentence for token in sentence.token]
    return NERData(tokens=[t.word for t in tokens], ner=[t.coarseNER for t in tokens])

def extract_ingredient_names(ner_data):
    """
    Extract only the ingredient names from NER data.
    If names appear in consecutive tokens, join them together.
    """
    names = []
    current_name = []
    
    # Zip tokens and NER tags together and iterate
    for token, tag in zip(ner_data.tokens, ner_data.ner):
        if tag == 'NAME':
            current_name.append(token)
        elif current_name:  # Not a NAME tag but we have collected name tokens
            names.append(' '.join(current_name))
            current_name = []
    
    # Add the last name if there's one at the end of the sequence
    if current_name:
        names.append(' '.join(current_name))
    
    return names

# Import pickle and load the models
import pickle
import os

# Check if the model file exists directly
ar_model_file = 'ar.model.ser.gz'
if not os.path.exists(ar_model_file):
    # Try loading from the pickle file if direct model file doesn't exist
    try:
        with open('trained_models.pkl', 'rb') as f:
            models = pickle.load(f)
        ar_model_name = models['ar']
        
        # If models['ar'] is not a full path, prepend the current directory
        if not os.path.isabs(ar_model_name):
            ar_model_file = os.path.join(os.getcwd(), ar_model_name)
        else:
            ar_model_file = ar_model_name
    except:
        print("Could not find model file. Using the simple parser instead.")
        use_simple_parser = True


In [8]:
# ----- STEP 3: Find and load the NER model -----
print("Looking for NER model...")
model_file_path = 'ar.model.ser.gz'

# If the model file doesn't exist directly, try to find it
if not os.path.exists(model_file_path):
    try:
        # Try to load from trained_models.pkl
        with open('trained_models.pkl', 'rb') as f:
            models = pickle.load(f)
            if 'ar' in models:
                model_path = models['ar']
                if os.path.exists(model_path):
                    model_file_path = model_path
                    print(f"Found model at {model_file_path}")
                else:
                    print(f"Model path {model_path} from pickle doesn't exist")
    except Exception as e:
        print(f"Error loading model: {e}")
else:
    print(f"Found model at {model_file_path}")

Looking for NER model...
Found model at ar.model.ser.gz


In [9]:
# ----- STEP 4: Process each ingredient using the NER model -----
from stanza.server import CoreNLPClient
import random
import time
import subprocess

def annotate_ner_robust(ner_model_file, texts, tokenize_whitespace=True):
    """A more robust version of annotate_ner that handles port conflicts better"""
    try:
        subprocess.run(['taskkill', '/F', '/IM', 'java.exe'], capture_output=True)
        time.sleep(2)
    except Exception as e:
        print(f"Warning: Could not kill Java processes: {e}")
    
    server_port = random.randint(20000, 50000)
    control_port = server_port + 1000
    
    print(f"Starting NER server on port: {server_port}")
    
    properties = {
        "ner.model": ner_model_file, 
        "tokenize.whitespace": tokenize_whitespace, 
        "ner.applyNumericClassifiers": False
    }
    
    annotated = []
    with CoreNLPClient(
         annotators=['tokenize','ssplit','ner'],
         properties=properties,
         timeout=60000,
         be_quiet=True,
         port=server_port,
         start_server=True,
         control_port=control_port,
         preload=False,
         memory='4G',
         endpoint=f'http://localhost:{server_port}') as client:
        
        print("Server successfully started!")
        
        for text in tqdm(texts):
            annotated.append(client.annotate(text))
            
    return annotated

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
def process_ingredients_with_ner(df, model_file):
    """Process all ingredients and extract names using NER model"""
    processed_df = df.copy()
    all_extracted_names = []
    
    print("Processing ingredients with NER model...")
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        ingredient_text = row['ingredient']
        
        # Skip if ingredient is missing
        if pd.isna(ingredient_text):
            all_extracted_names.append("")
            continue
            
        # Split by either \n or \\n
        ingredient_lines = re.split(r'\\n|\n', str(ingredient_text))
        
        # Clean each line (remove leading "- " if present)
        ingredient_lines = [line.strip('- ').strip() for line in ingredient_lines if line.strip()]
        
        try:
            # Process all ingredient lines for this recipe at once
            annotations = annotate_ner_robust(model_file, ingredient_lines)
            
            # Extract ingredient names from all annotations
            extracted_names = []
            for annotation in annotations:
                ner_data = extract_ner_data(annotation)
                names = extract_ingredient_names(ner_data)
                if names:
                    extracted_names.extend(names)
            
            # Join all extracted names with commas
            all_extracted_names.append(', '.join(extracted_names))
            
        except Exception as e:
            print(f"Error processing row {idx}: {e}")
            all_extracted_names.append("")
    
    # Add the extracted names as a new column
    processed_df['ner_ingredient'] = all_extracted_names
    
    return processed_df

In [10]:
# ----- STEP 5: Run the processing on first 5 rows only -----
print("Starting ingredient extraction (first 5 rows only)...")
if os.path.exists(model_file_path):
    # Take only the first 5 rows
    df_sample = df.head(5).copy()
    print(f"Processing sample of {len(df_sample)} rows")
    
    # Process only these 5 rows
    df_sample_with_ner = process_ingredients_with_ner(df_sample, model_file_path)
    
    print("\nExtraction complete! Results:")
    print(df_sample_with_ner[['food_name', 'ingredient', 'ner_ingredient']])
    
    # Save the sample processed dataframe to a different file name
    df_sample_with_ner.to_excel('food_dataset_with_ner_sample.xlsx', index=False)
    print("Saved sample dataset to 'food_dataset_with_ner_sample.xlsx'")
else:
    print("NER model file not found. Cannot process ingredients.")

Starting ingredient extraction (first 5 rows only)...
Processing sample of 5 rows


NameError: name 'process_ingredients_with_ner' is not defined

In [11]:
import spacy
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Download NLTK resources if not already available
import nltk
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

def clean_ingredient_names(ingredient_names_list):
    """
    Process a list of ingredient names to:
    1. Remove measurements and units
    2. Remove stopwords
    3. Lemmatize words
    4. Keep only the core ingredient name
    
    Args:
        ingredient_names_list: List of strings containing extracted ingredient names
        
    Returns:
        List of cleaned, lemmatized ingredient names
    """
    # Custom measurement and cooking stopwords
    compound_ingredients = {
        "sweet potato", "bell pepper", "olive oil", "coconut milk", 
        "soy sauce", "maple syrup", "peanut butter", "baking powder",
        "baking soda", "salad greens", "sesame oil", "rice vinegar",
        "whole wheat", "green onion", "red onion", "red cabbage"
    }
    measurement_units = {
        "cup", "cups", "tablespoon", "tablespoons", "tbsp", "teaspoon", "teaspoons", "tsp",
        "oz", "ounce", "ounces", "pound", "pounds", "lb", "lbs", "gram", "grams", "g",
        "kilogram", "kilograms", "kg", "ml", "milliliter", "milliliters", "liter", "liters",
        "l", "pinch", "pinches", "dash", "dashes", "slice", "slices", "piece", "pieces"
    }
    
    cooking_words = {
        "chopped", "diced", "minced", "sliced", "grated", "shredded", "crushed",
        "ground", "mashed", "pureed", "julienned", "cubed", "quartered", "halved",
        "frozen", "fresh", "dried", "canned", "boiled", "steamed", "roasted", "baked",
        "fried", "ripe", "raw", "cooked", "processed", "peeled", "pitted"
    }
    
    # Combine with NLTK stopwords
    all_stopwords = set(stopwords.words('english')).union(measurement_units).union(cooking_words)
    
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    cleaned_ingredients = []
    
    for ingredient in ingredient_names_list:
        if not ingredient:
            continue
            
        lower_ingredient = ingredient.lower()
        if any(compound in lower_ingredient for compound in compound_ingredients):
            # Just clean up numbers and extra punctuation but preserve the compound term
            ingredient = re.sub(r'\d+\/?\d*', '', lower_ingredient)
            ingredient = re.sub(r'[^\w\s-]', '', ingredient).strip()
            if ingredient in compound_ingredients:
                cleaned_ingredients.append(ingredient)
                continue
        
        # Process with spaCy for better part-of-speech tagging
        doc = nlp(ingredient.lower())
        
        # Extract only nouns, skipping stopwords and lemmatizing
        tokens = []
        for token in doc:
            # Keep only nouns and proper nouns
            if token.pos_ in ("NOUN", "PROPN"):
                lemma = lemmatizer.lemmatize(token.text)
                if lemma.lower() not in all_stopwords and len(lemma) > 1:
                    tokens.append(lemma.lower())
        
        if tokens:
            cleaned_ingredients.append(" ".join(tokens))
    
    return cleaned_ingredients

# Modify the process_ingredients_with_ner function to include lemmatization
def process_ingredients_with_ner_and_lemmatize(df, model_file):
    """Process all ingredients and extract names using NER model with lemmatization"""
    processed_df = df.copy()
    all_extracted_names = []
    all_lemmatized_names = []
    
    print("Processing ingredients with NER model and lemmatization...")
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        ingredient_text = row['ingredient']
        
        # Skip if ingredient is missing
        if pd.isna(ingredient_text):
            all_extracted_names.append("")
            all_lemmatized_names.append("")
            continue
            
        # Split by either \n or \\n
        ingredient_lines = re.split(r'\\n|\n', str(ingredient_text))
        
        # Clean each line (remove leading "- " if present)
        ingredient_lines = [line.strip('- ').strip() for line in ingredient_lines if line.strip()]
        
        try:
            # Process all ingredient lines for this recipe at once
            annotations = annotate_ner_robust(model_file, ingredient_lines)
            
            # Extract ingredient names from all annotations
            extracted_names = []
            for annotation in annotations:
                ner_data = extract_ner_data(annotation)
                names = extract_ingredient_names(ner_data)
                if names:
                    extracted_names.extend(names)
            
            # Store original extracted names
            all_extracted_names.append(', '.join(extracted_names))
            
            # Apply lemmatization and cleaning
            lemmatized_names = clean_ingredient_names(extracted_names)
            all_lemmatized_names.append(', '.join(lemmatized_names))
            
        except Exception as e:
            print(f"Error processing row {idx}: {e}")
            all_extracted_names.append("")
            all_lemmatized_names.append("")
    
    # Add both the raw and lemmatized extracted names as new columns
    processed_df['ner_ingredient'] = all_extracted_names
    processed_df['lemmatized_ingredient'] = all_lemmatized_names
    
    return processed_df

In [12]:
import os
from tqdm import tqdm
# Run the processing on first 5 rows with lemmatization
print("Starting ingredient extraction with lemmatization (first 100 rows only)...")
model_file_path = 'ar.model.ser.gz'  # This is the default filename format based on your code
if os.path.exists(model_file_path):
    # Take only the first 5 rows
    df_sample = df.head(100).copy()
    print(f"Processing sample of {len(df_sample)} rows")
    
    # Process only these 5 rows with lemmatization
    df_sample_with_ner = process_ingredients_with_ner_and_lemmatize(df_sample, model_file_path)
    
    print("\nExtraction complete! Results:")
    # print(df_sample_with_ner[['food_name', 'ingredient', 'ner_ingredient', 'lemmatized_ingredient']])
    df_sample_with_ner[['food_name', 'ingredient', 'ner_ingredient', 'lemmatized_ingredient']]
    # Save the sample processed dataframe to a different file name
    df_sample_with_ner.to_excel('100_food_dataset_with_ner_lemmatized_sample.xlsx', index=False)
    print("Saved sample dataset to 'food_dataset_with_ner_lemmatized_sample.xlsx'")
else:
    print("NER model file not found. Cannot process ingredients.")

Starting ingredient extraction with lemmatization (first 100 rows only)...
Processing sample of 100 rows
Processing ingredients with NER model and lemmatization...


2025-05-23 09:17:33 INFO: Writing properties to tmp file: corenlp_server-c96f08511bde4e52.props
2025-05-23 09:17:33 INFO: Starting server with command: java -Xmx4G -cp C:\Users\Helena\stanza_corenlp\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 48444 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-c96f08511bde4e52.props -annotators tokenize,ssplit,ner -outputFormat serialized


Starting NER server on port: 48444
Server successfully started!


  0%|          | 0/3 [00:10<?, ?it/s]
  0%|          | 0/100 [00:13<?, ?it/s]


KeyboardInterrupt: 

In [23]:
df_sample_with_ner[['food_name', 'ingredient', 'ner_ingredient', 'lemmatized_ingredient']]

Unnamed: 0,food_name,ingredient,ner_ingredient,lemmatized_ingredient
0,Edamame and Sweet Potato Dumplings (Oyaki),- 30 g of sweet potato\n- 20 g of edamame\n- 5...,"g, sweet potato, g of edamame, g of cornstarch...","sweet potato, edamame, cornstarch, bit water"
1,"Chicken, Carrot, and Onion Udon For Babies",- 10 g of onion\n- 10 g of chicken breast (pre...,"g of onion, chicken breast, g of udon, ml, broth)","onion, chicken breast, udon, broth"
2,Kinaki Yogurt,- 2-3 tablespoons of plain yogurt \n- A pinch ...,"plain yogurt, soybean flour (kinako)","yogurt, soybean flour kinako"
3,Natto Oyaki and Grilled Salmon,Natto Oyaki :\n- 200g Japanese rice *steamed\n...,"Natto Oyaki :, Japanese rice *steamed, pack Na...","natto oyaki, rice, pack natto, onion, egg, but..."
4,Miso Soup,- 120 ml of dashi \n- 20 grams of tofu \n- 1/4...,"ml of dashi, grams of tofu, of miso paste","dashi, tofu, miso paste"
...,...,...,...,...
95,Peach puree with pain d’épice,- 1/2 peach\n- 1/2 slice of pain d’épice\n- 1 ...,"peach, of pain d’épice, petit-suisse","peach, pain d’épice, petit"
96,Cream of pumpkin soup with thyme,- 150g pumpkin\n- 1/3 onion\n- 1 potato\n- ½ t...,"pumpkin, onion, potato, of crème fraîche","onion, potato, fraîche"
97,Ham with pumpkin mash,- 10g cooked ham\n- 200g pumpkin\n- 2 measurin...,"ham, pumpkin, measuring scoops of follow-on fo...","ham, scoop formula"
98,Courgette and sweet potato puree with sage,- 1 sweet potato\n- 1 small courgette\n- 1 tab...,"sweet potato, courgette, of crème fraîche, of ...","sweet potato, courgette, fraîche, sage"


In [14]:
# Process the dataset in batches
import os
import subprocess
import time
import random
from tqdm import tqdm

def process_dataset_in_batches(df, model_file_path, batch_size=10, output_file='full_dataset_processed.xlsx'):
    """
    Process the full dataset in smaller batches to avoid memory issues with CoreNLP
    """
    print(f"Processing full dataset of {len(df)} rows in batches of {batch_size}...")
    
    # Initialize an empty DataFrame to store all results
    all_results = pd.DataFrame()
    
    # Calculate the number of batches
    num_batches = (len(df) // batch_size) + (1 if len(df) % batch_size > 0 else 0)
    
    for batch_num in range(num_batches):
        # Calculate start and end indices for this batch
        start_idx = batch_num * batch_size
        end_idx = min(start_idx + batch_size, len(df))
        
        print(f"\n{'='*60}")
        print(f"Processing Batch {batch_num+1}/{num_batches}, rows {start_idx}-{end_idx}")
        print(f"{'='*60}")
        
        # Extract the batch
        batch_df = df.iloc[start_idx:end_idx].copy()
        
        # First, ensure any existing Java processes are killed before starting
        kill_java_processes()
        
        try:
            # Process this batch
            batch_results = process_ingredients_with_ner_and_lemmatize(batch_df, model_file_path)
            
            # Save intermediate results
            batch_file = f'batch_{batch_num+1}_of_{num_batches}.xlsx'
            batch_results.to_excel(batch_file, index=False)
            print(f"Saved intermediate batch results to '{batch_file}'")
            
            # Append to the full results
            all_results = pd.concat([all_results, batch_results], ignore_index=True)
            
            # Save all results so far (in case of crash)
            all_results.to_excel(output_file, index=False)
            print(f"Updated combined results in '{output_file}'")
            
            # Remove previous batch file if it exists (keeping only the most recent)
            if batch_num > 0:
                prev_batch_file = f'batch_{batch_num}_of_{num_batches}.xlsx'
                if os.path.exists(prev_batch_file):
                    try:
                        os.remove(prev_batch_file)
                        print(f"Removed previous batch file: {prev_batch_file}")
                    except Exception as e:
                        print(f"Could not remove previous batch file: {e}")
            
        except Exception as e:
            print(f"Error processing batch {batch_num+1}: {e}")
        
        # After each batch, kill Java processes to free memory
        kill_java_processes()
    
    print(f"\nProcessing complete! Processed {len(all_results)} rows in total.")
    print(f"Final results saved to '{output_file}'")
    
    # Clean up any remaining batch files
    clean_up_batch_files(num_batches)
    
    return all_results

def kill_java_processes():
    """Kill all Java processes more thoroughly"""
    print("Cleaning up Java processes...")
    try:
        # For Windows
        subprocess.run(['taskkill', '/F', '/IM', 'java.exe'], capture_output=True)
        
        # Additional cleanup for potential Java server processes
        subprocess.run(['netstat', '-ano'], capture_output=True, text=True)
        
        # Give system time to release resources
        time.sleep(8)
        print("Java processes terminated")
    except Exception as e:
        print(f"Warning: Could not kill Java processes: {e}")

def clean_up_batch_files(num_batches):
    """Clean up all batch files after completion"""
    print("Cleaning up batch files...")
    for batch_num in range(1, num_batches + 1):
        batch_file = f'batch_{batch_num}_of_{num_batches}.xlsx'
        if os.path.exists(batch_file):
            try:
                os.remove(batch_file)
                print(f"Removed batch file: {batch_file}")
            except Exception as e:
                print(f"Could not remove batch file {batch_file}: {e}")

In [16]:
# Process the dataset in batches
import os
import subprocess
import time
import random
from tqdm import tqdm

def process_dataset_in_batches(df, model_file_path, batch_size=10, output_file='full_dataset_processed.xlsx'):
    """
    Process the full dataset in smaller batches to avoid memory issues with CoreNLP
    
    Args:
        df: DataFrame with recipe data
        model_file_path: Path to the NER model
        batch_size: Number of recipes to process in each batch
        output_file: Output file name for the complete processed data
    """
    print(f"Processing full dataset of {len(df)} rows in batches of {batch_size}...")
    
    # Initialize an empty DataFrame to store all results
    all_results = pd.DataFrame()
    
    # Calculate the number of batches
    num_batches = (len(df) // batch_size) + (1 if len(df) % batch_size > 0 else 0)
    
    for batch_num in range(num_batches):
        # Calculate start and end indices for this batch
        start_idx = batch_num * batch_size
        end_idx = min(start_idx + batch_size, len(df))
        
        print(f"\n{'='*60}")
        print(f"Processing Batch {batch_num+1}/{num_batches}, rows {start_idx}-{end_idx}")
        print(f"{'='*60}")
        
        # Extract the batch
        batch_df = df.iloc[start_idx:end_idx].copy()
        kill_java_processes()
        try:
            # Process this batch
            batch_results = process_ingredients_with_ner_and_lemmatize(batch_df, model_file_path)
            
            # Save intermediate results
            batch_file = f'batch_{batch_num+1}_of_{num_batches}.xlsx'
            batch_results.to_excel(batch_file, index=False)
            print(f"Saved intermediate batch results to '{batch_file}'")
            
            # Append to the full results
            all_results = pd.concat([all_results, batch_results], ignore_index=True)
            
            # Save all results so far (in case of crash)
            all_results.to_excel(output_file, index=False)
            print(f"Updated combined results in '{output_file}'")
            
        except Exception as e:
            print(f"Error processing batch {batch_num+1}: {e}")
            
        # Make sure to kill all Java processes to free memory
        try:
            subprocess.run(['taskkill', '/F', '/IM', 'java.exe'], capture_output=True)
            print("Killed Java processes to free memory")
            time.sleep(10)  # Give more time to fully release resources between batches
        except Exception as e:
            print(f"Warning: Could not kill Java processes: {e}")
    
    print(f"\nProcessing complete! Processed {len(all_results)} rows in total.")
    print(f"Final results saved to '{output_file}'")
    
    return all_results

# Enhanced annotate_ner_robust function with better timeout and memory management
def annotate_ner_robust(ner_model_file, texts, tokenize_whitespace=True, memory='6G', timeout=120000):
    """An enhanced version of annotate_ner that handles memory and timeouts better"""
    # Kill any lingering Java processes
    try:
        subprocess.run(['taskkill', '/F', '/IM', 'java.exe'], capture_output=True)
        time.sleep(5)  # Give system time to release resources
    except Exception as e:
        print(f"Warning: Could not kill Java processes: {e}")
    
    # Use higher port numbers to avoid conflicts
    server_port = random.randint(20000, 50000)
    control_port = server_port + 1000
    
    print(f"Starting NER server on port: {server_port}, control port: {control_port}")
    print(f"Using memory: {memory}, timeout: {timeout}ms")
    
    properties = {
        "ner.model": ner_model_file, 
        "tokenize.whitespace": tokenize_whitespace, 
        "ner.applyNumericClassifiers": False
    }
    
    annotated = []
    with CoreNLPClient(
         annotators=['tokenize','ssplit','ner'],
         properties=properties,
         timeout=timeout,  # Increased timeout
         be_quiet=True,
         port=server_port,
         start_server=True,
         control_port=control_port,
         preload=False,
         memory=memory,    # More memory for larger batches
         endpoint=f'http://localhost:{server_port}',
         max_char_length=100000  # Support longer texts
    ) as client:
        
        print("Server successfully started!")
        
        # Process items with progress bar
        for text in tqdm(texts):
            try:
                # Handle potential CoreNLP timeouts for individual texts
                result = client.annotate(text)
                annotated.append(result)
            except Exception as e:
                print(f"Error annotating text: {e}")
                print(f"Problematic text: {text[:100]}...")
                # Add None to maintain index alignment
                annotated.append(None)
            
    return annotated

# Main execution
if __name__ == "__main__":
    print("Starting ingredient extraction with batch processing...")
    model_file_path = 'ar.model.ser.gz'
    
    if os.path.exists(model_file_path):
        # For testing, process first 100 rows
        df_to_process = df.head(400).copy()  # Adjust number as needed
        print(f"Processing sample of {len(df_to_process)} rows in batches")
        
        # Process in batches of 20 rows
        processed_df = process_dataset_in_batches(
            df_to_process, 
            model_file_path, 
            batch_size=20,
            output_file='2nd_food_dataset_with_ner_lemmatized.xlsx'
        )
        
        
        print("\nBatch processing complete!")
    else:
        print("NER model file not found. Cannot process ingredients.")

Starting ingredient extraction with batch processing...
Processing sample of 241 rows in batches
Processing full dataset of 241 rows in batches of 20...

Processing Batch 1/13, rows 0-20
Cleaning up Java processes...
Java processes terminated
Processing ingredients with NER model and lemmatization...


2025-05-23 13:35:58 INFO: Writing properties to tmp file: corenlp_server-8084e2f6367148fd.props
2025-05-23 13:35:58 INFO: Starting server with command: java -Xmx6G -cp C:\Users\Helena\stanza_corenlp\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 30749 -timeout 120000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-8084e2f6367148fd.props -annotators tokenize,ssplit,ner -outputFormat serialized


Starting NER server on port: 30749, control port: 31749
Using memory: 6G, timeout: 120000ms
Server successfully started!


  0%|          | 0/3 [00:15<?, ?it/s]
  0%|          | 0/20 [00:21<?, ?it/s]


KeyboardInterrupt: 

In [None]:
# Main execution
if __name__ == "__main__":
    print("Starting ingredient extraction with batch processing...")
    model_file_path = 'ar.model.ser.gz'
    
    if os.path.exists(model_file_path):
        # For testing, process first 100 rows
        df_to_process = df.head(400).copy()  # Adjust number as needed
        print(f"Processing sample of {len(df_to_process)} rows in batches")
        
        # Process in batches of 20 rows
        processed_df = process_dataset_in_batches(
            df_to_process, 
            model_file_path, 
            batch_size=20,
            output_file='4nd_food_dataset_with_ner_lemmatized.xlsx'
        )
        
        
        print("\nBatch processing complete!")
    else:
        print("NER model file not found. Cannot process ingredients.")