## CRF for Ingredient Tagging

This notebook goes through a typical training routine to generate and evaluate CRF model from the NYT Cooking (or similarly formatted) dataset.

### Dataset Generation

In [29]:
%%time
from training import generatedata

X, y, X_test, y_test = generatedata("nyt-cooking.csv", testprop=0.1, parallel=True)

CPU times: user 3.81 s, sys: 1.57 s, total: 5.38 s
Wall time: 13.5 s


### Training

**Note**: Training the CRF model with this dataset and sample parameters will take anywhere from **5 to 15 minutes** depending on your CPU

In [55]:
%%time
from training import trainCRF

filename = 'model.crfsuite'
trainCRF(X, y, output=filename)

Model successfully trained and saved as: model.crfsuite
CPU times: user 4min 50s, sys: 613 ms, total: 4min 50s
Wall time: 4min 50s


'model.crfsuite'

### Training Data Evaluation

In [31]:
from evaluate import evaluate

accuracy, precision, recall, fscore = evaluate(X, y, filename)

print("Accuracy:")
print(accuracy)
print("Precision:")
print(precision)
print("Recall:")
print(recall)
print("F-Score")
print(fscore)

Accuracy:
{'INGR': 0.8960391702864321, 'QTY': 0.9391432194575843, 'QTY-UR': 0.925129640780451, 'UNIT': 0.9346875729135202, 'Total': 0.9158781699187732}
Precision:
{'INGR': 0.9179626735578502, 'QTY': 0.9813072396655212, 'QTY-UR': 0.5434210526315789, 'UNIT': 0.9219856309870887, 'Total': 0.9334283898297093}
Recall:
{'INGR': 0.8517744533638759, 'QTY': 0.9826695021635432, 'QTY-UR': 0.7357482185273159, 'UNIT': 0.970481052891158, 'Total': 0.9077692874882153}
F-Score
{'INGR': 0.8836308427774342, 'QTY': 0.9819878984651712, 'QTY-UR': 0.6251261352169525, 'UNIT': 0.9456119820056333, 'Total': 0.9204200448387544}


### Test Set Evaluation

In [32]:
accuracy, precision, recall, fscore = evaluate(X_test, y_test, filename)

print("Accuracy:")
print(accuracy)
print("Precision:")
print(precision)
print("Recall:")
print(recall)
print("F-Score")
print(fscore)

Accuracy:
{'INGR': 0.8947635870880317, 'QTY': 0.9380653168877512, 'QTY-UR': 0.9233284223697754, 'UNIT': 0.9332900081234768, 'Total': 0.91500930422548}
Precision:
{'INGR': 0.915966658051176, 'QTY': 0.9805991817341956, 'QTY-UR': 0.5719844357976653, 'UNIT': 0.9197607581171353, 'Total': 0.9318331448250249}
Recall:
{'INGR': 0.8512535655307011, 'QTY': 0.9835848557055864, 'QTY-UR': 0.7277227722772277, 'UNIT': 0.9703351634843891, 'Total': 0.9076095892663356}
F-Score
{'INGR': 0.8824252610610517, 'QTY': 0.9820897495208513, 'QTY-UR': 0.6405228758169934, 'UNIT': 0.9443713362842444, 'Total': 0.9195618674774061}


### Sample Taggings

In [42]:
import pandas as pd
from parsing import tokenize, removeiob
from evaluate import getlabels

def displaytags(tokens, tags):
    # Make a table with pandas and transpose to make horizontal
    df = pd.DataFrame(tokens, removeiob(pred)).transpose()
    # Print string representation with adjusted spacing and display options
    print(df.to_string(index=False, justify='center', col_space=8, max_cols=15))

df = pd.read_csv('nyt-cooking.csv')
df = df.loc[pd.notna(df.name)&pd.notna(df.input)]

Display 10 random samples from original dataset (incl. train and test) and display the tags:

In [54]:
from numpy.random import randint
samples = df.input.iloc[randint(0, len(df), 10)]
    
for item in samples:
    tokens = tokenize(item, preprocess=True)
    pred = getlabels(item, filename)
    displaytags(tokens, pred)
    print('-'*100)

  QTY      INGR  
    3      eggs  
----------------------------------------------------------------------------------------------------
  QTY        UNIT        INGR      INGR  
    4     tablespoons  vegetable    oil  
----------------------------------------------------------------------------------------------------
  QTY      UNIT              INGR     INGR     INGR     INGR     INGR     INGR  
    4     fillets    of      beef       ,      about     1/4     pound    each  
----------------------------------------------------------------------------------------------------
           INGR     INGR  
  Fresh    mint    leaves 
----------------------------------------------------------------------------------------------------
  QTY      UNIT              INGR     INGR  
    1       cup    chopped    red     onion 
----------------------------------------------------------------------------------------------------
  QTY      UNIT               INGR     INGR  
    3      cups     col

Tag a sample recipe from outside source:

In [46]:
recipe = ['1 14-oz. package firm or extra-firm tofu, drained',
'1 Tbsp. black peppercorns',
'2 garlic cloves',
'1 1½" piece ginger, peeled',
'1 Tbsp. cornstarch',
'½ tsp. kosher salt',
'3 Tbsp. extra-virgin olive oil',
'1 lb. asparagus, trimmed, cut into 1½" pieces',
'⅓ cup soy sauce',
'1 Tbsp. sugar',
'1 tsp. unseasoned rice vinegar',
'Cooked white or brown rice (for serving)']

for item in recipe:
    tokens = tokenize(item, preprocess=True)
    pred = getlabels(item, filename)
    displaytags(tokens, pred)
    print('-'*100)

  QTY      INGR     INGR     INGR     INGR     INGR       INGR      INGR     INGR     INGR  
    1      14-oz      .     package   firm      or     extra-firm   tofu       ,     drained
----------------------------------------------------------------------------------------------------
  QTY      UNIT              INGR       INGR    
    1      Tbsp       .      black   peppercorns
----------------------------------------------------------------------------------------------------
  QTY      INGR     INGR  
    2     garlic   cloves 
----------------------------------------------------------------------------------------------------
  QTY                        UNIT     INGR     INGR     INGR  
    1      1$1/2      "      piece   ginger      ,     peeled 
----------------------------------------------------------------------------------------------------
  QTY      UNIT                INGR   
    1      Tbsp       .     cornstarch
------------------------------------------------------

Input samples to test:

In [72]:
print('Enter a single recipe line to be tagged or type EXIT to stop')
while True:
    
    s = input('')
    if s.lower() == 'exit': break
    
    tokens = tokenize(s, preprocess=True)
    pred = getlabels(s, filename)
    print('')
    displaytags(tokens, pred)
    print('')
    print('-'*100)

Enter a single recipe line to be tagged or type EXIT to stop


 1 can of Campbell's chicken noodle soup



  QTY      UNIT                          INGR     INGR     INGR  
    1       can      of     Campbell's  chicken  noodle    soup  

----------------------------------------------------------------------------------------------------


 1 1/2 kilograms of churned butter



  QTY       UNIT               INGR     INGR  
  1$1/2   kilograms    of     churned  butter 

----------------------------------------------------------------------------------------------------


 exit
