## Ingredient Phrase Model

This program will create a model that is designed to separately identify food name, quantity and other information as Named Entity Recognition tags from a word ingredient list.

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import re

# Generate training data using NY Times ingredient phrase tagger
from ingredient_phrase_tagger.training.cli import Cli
from ingredient_phrase_tagger.training import utils, reshape

from sklearn.model_selection import train_test_split

# Model libraries
from tagger_model import *

from IPython.core.debugger import set_trace

  (fname, cnt))
  (fname, cnt))
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Some default parameters
n_word_embedding_nodes=300
n_tag_embedding_nodes=150
n_RNN_nodes=400
n_dense_nodes=200

dataPath = '../data'

In [3]:
filename = 'cleaned_nyt_ingred_data.pkl'

# reshape.read_and_save_raw_data(dataPath, filename)
cleaned_dat = pd.read_pickle(os.path.join(dataPath, filename))

# clean up tags to remove prefixes so that there are less tags to predict
cleaned_dat['tags'] = [[re.sub(r'[B|I]-', '', tag) for tag in tags] for tags in cleaned_dat.tags]

In [4]:
train, test = train_test_split(cleaned_dat, test_size = .2, random_state=10)

In [5]:
# Create lexicon
lexicon = lexiconTransformer(words_min_freq=2, unknown_tag_token='OTHER', saveNamePrefix='Ingred_mod')

lexicon.fit(train.sents, train.tags)

train['sent_indx'], train['tag_indx'] = lexicon.transform(train.sents, train.tags)

# Get length of longest sequence
max_seq_len = get_max_seq_len(train['sent_indx'])

#Add one to max length for offsetting sequence by 1
train_padded_words = pad_idx_seqs(train['sent_indx'], 
                                  max_seq_len + 1) 

train_padded_tags = pad_idx_seqs(train['tag_indx'],
                                 max_seq_len + 1)

# Shift tags by 1 for training since we should use tag of previous 
# iteration in next iteration
shifted_train_padded_tags = np.insert(train_padded_tags, 0, 1, axis=1)[:, :-1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [6]:
# Convert to one-hot vector encoding for y
# train_y = [to_categorical(i, num_classes=len(lexicon.tags_lexicon) + 1) for i in train_padded_tags]

In [7]:
mod_save_name = 'ingredient_model_clean_tags_crf_wordOnly'
crf_mod = True

In [None]:
ingredient_model = run_training_model(train_padded_words, train_padded_tags, 
                                      train_padded_tags, mod_save_name, lexicon, crf=crf_mod,
                                      print_summary=True, batch_size=256, epochs=200,
                                      n_word_embedding_nodes=n_word_embedding_nodes,
                                      n_tag_embedding_nodes=n_tag_embedding_nodes,
                                      n_RNN_nodes=n_RNN_nodes, 
                                      n_dense_nodes=n_dense_nodes)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
word_input_layer (InputLayer (None, 56)                0         
_________________________________________________________________
word_embedding_layer (Embedd (None, 56, 300)           1699800   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 56, 800)           1682400   
_________________________________________________________________
dense_layer (TimeDistributed (None, 56, 200)           160200    
_________________________________________________________________
output_layer (CRF)           (None, 56, 7)             1470      
Total params: 3,543,870
Trainable params: 3,543,870
Non-trainable params: 0
_________________________________________________________________
Train on 114600 samples, validate on 28650 samples
Epoch 1/200

Epoch 00001: saving model to models/ingredient_model_clean_tags_crf_


Epoch 00032: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 33/200

Epoch 00033: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 34/200

Epoch 00034: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 35/200

Epoch 00035: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 36/200

Epoch 00036: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 37/200

Epoch 00037: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 38/200

Epoch 00038: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 39/200

Epoch 00039: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 40/200

Epoch 00040: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 41/200

Epoch 00041: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 42/200

Epoch 00042: saving model to models/ing


Epoch 00069: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 70/200

Epoch 00070: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 71/200

Epoch 00071: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 72/200

Epoch 00072: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 73/200

Epoch 00073: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 74/200

Epoch 00074: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 75/200

Epoch 00075: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 76/200

Epoch 00076: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 77/200

Epoch 00077: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 78/200

Epoch 00078: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 79/200

Epoch 00079: saving model to models/ing


Epoch 00141: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 142/200

Epoch 00142: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 143/200

Epoch 00143: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 144/200

Epoch 00144: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 145/200

Epoch 00145: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 146/200

Epoch 00146: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 147/200

Epoch 00147: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 148/200

Epoch 00148: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 149/200

Epoch 00149: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 150/200

Epoch 00150: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 151/200

Epoch 00151: saving model to 


Epoch 00177: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 178/200

Epoch 00178: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 179/200

Epoch 00179: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 180/200

Epoch 00180: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 181/200

Epoch 00181: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 182/200

Epoch 00182: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 183/200

Epoch 00183: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 184/200

Epoch 00184: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 185/200

Epoch 00185: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 186/200

Epoch 00186: saving model to models/ingredient_model_clean_tags_crf_wordOnly.hdf5
Epoch 187/200

Epoch 00187: saving model to 

In [8]:
test['sent_indx'], test['tag_indx'] = lexicon.transform(test.sents, test.tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [9]:
test_mod = create_test_model(mod_save_name, lexicon, crf=crf_mod, 
                             n_word_embedding_nodes=n_word_embedding_nodes,
                             n_tag_embedding_nodes=n_tag_embedding_nodes,
                             n_RNN_nodes=n_RNN_nodes, 
                             n_dense_nodes=n_dense_nodes)

In [10]:
preds = predict_new_tag(test_mod, test, lexicon)

In [11]:
evaluate_model(preds, test, print_sample=True)

SENTENCE:	1/2	cup	heavy	cream
PREDICTED:	QTY	UNIT	NAME	NAME
GOLD:		QTY	UNIT	NAME	NAME
CORRECT:	True	True	True	True 


SENTENCE:	1/4	cup	low-fat	vanilla	yogurt
PREDICTED:	QTY	UNIT	NAME	NAME	NAME
GOLD:		QTY	UNIT	COMMENT	COMMENT	NAME
CORRECT:	True	True	False	False	True 


SENTENCE:	1/3	cup	extra-virgin	olive	oil
PREDICTED:	OTHER	UNIT	COMMENT	NAME	NAME
GOLD:		OTHER	UNIT	COMMENT	NAME	NAME
CORRECT:	True	True	True	True	True 


SENTENCE:	Coarse	salt	or	kosher	salt
PREDICTED:	COMMENT	NAME	OTHER	NAME	NAME
GOLD:		COMMENT	NAME	COMMENT	COMMENT	NAME
CORRECT:	True	True	False	False	True 


SENTENCE:	2	teaspoons	minced	ginger	root
PREDICTED:	QTY	UNIT	COMMENT	NAME	NAME
GOLD:		QTY	UNIT	COMMENT	NAME	NAME
CORRECT:	True	True	True	True	True 


SENTENCE:	1	ounce	Carpano	Antica	Formula	vermouth
PREDICTED:	QTY	UNIT	NAME	COMMENT	NAME	NAME
GOLD:		QTY	UNIT	NAME	NAME	NAME	NAME
CORRECT:	True	True	True	False	True	True 


SENTENCE:	2\ntablespoons	olive	oil
PREDICTED:	OTHER	NAME	NAME
GOLD:		OTHER	NAME	NAME
CORRECT:	Tru

  'precision', 'predicted', average, warn_for)


ACCURACY: 0.689
PRECISION: 0.721
RECALL: 0.689
F1: 0.688


  'precision', 'predicted', average, warn_for)
