# Building an NER tagger

Using the NY Times labelled dataset to train a named entity resolution model to identify tags such as name, quantity and measurement for ingredient phrases. 

In [23]:
import pandas as pd
import numpy as np
import re
import spacy

**Loading in the training data**

In [2]:
nyt_data = pd.read_csv("https://raw.githubusercontent.com/mtlynch/ingredient-phrase-tagger/master/nyt-ingredients-snapshot-2015.csv")
nyt_data

Unnamed: 0,index,input,name,qty,range_end,unit,comment
0,0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,"cooked and pureed fresh, or 1 10-ounce package..."
1,1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.00,0.0,cup,"peeled and cooked fresh (about 20), or 1 cup c..."
2,2,"1 medium-size onion, peeled and chopped",onion,1.00,0.0,,"medium-size, peeled and chopped"
3,3,"2 stalks celery, chopped coarse",celery,2.00,0.0,stalk,chopped coarse
4,4,1 1/2 tablespoons vegetable oil,vegetable oil,1.50,0.0,tablespoon,
...,...,...,...,...,...,...,...
179202,179202,3/4 oz. pineapple juice,pineapple juice,0.75,0.0,ounce,
179203,179203,1 tsp. fresh lemon juice,lemon juice,1.00,0.0,teaspoon,fresh
179204,179204,Angostura bitters,Angostura bitters,0.00,0.0,,
179205,179205,Wedge of pineapple,pineapple,1.00,0.0,wedge,


### Focusing on ingredient name

Initially, I shall build an NER model which focuses only on 1 entity, the name of the ingredient.

I have decided to do this as the name of the ingredient from the labelled dataset always* corresponds to a string from the input, whereas the quantity and unit do not directly relate to strings from the input. For example:

- Quantity "1 1/4" changed to "1.25" by annotator.
- Unit "oz.", "tsp." changed to "ounce", "teaspoon" by annotator.

*There are ~8000 cases where this is not the case. These cases are removed.

In [3]:
nyt_data[nyt_data.apply(lambda row: str(row['name']) in str(row['input']), axis=1) == False].head(3)

Unnamed: 0,index,input,name,qty,range_end,unit,comment
5,5,,water,0.5,0.0,cup,
123,123,1/2 teaspoon freshly ground black pepper,Freshly ground black pepper,0.5,0.0,teaspoon,
178,178,Chopped fresh parsley leaves for garnish,chopped fresh parsley leaves for garnish,0.0,0.0,,


In [4]:
nyt_data = nyt_data[nyt_data.apply(lambda row: str(row['name']) in str(row['input']), axis=1)]

**Edit:** It turns out that most units are also correct, so I could also use units as named entities. This would require some regex to include plurals, e.g. "cup" or "cups"

In [5]:
no_nans = nyt_data[nyt_data.unit.isna()==False]
no_nans[no_nans.apply(lambda row: str(row['unit']) in str(row['input']), axis=1)]

Unnamed: 0,index,input,name,qty,range_end,unit,comment
0,0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,"cooked and pureed fresh, or 1 10-ounce package..."
1,1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.00,0.0,cup,"peeled and cooked fresh (about 20), or 1 cup c..."
3,3,"2 stalks celery, chopped coarse",celery,2.00,0.0,stalk,chopped coarse
4,4,1 1/2 tablespoons vegetable oil,vegetable oil,1.50,0.0,tablespoon,
6,6,"2 tablespoons unflavored gelatin, dissolved in...",gelatin,2.00,0.0,tablespoon,"unflavored, dissolved in 1/2 cup water"
...,...,...,...,...,...,...,...
179187,179187,"2 cups cherry tomatoes, sliced into quarters",cherry tomatoes,2.00,0.0,cup,", sliced into quarters"
179188,179188,½ cup sour cream,sour cream,0.50,0.0,cup,
179189,179189,½ cup roughly chopped cilantro leaves,cilantro leaves,0.50,0.0,cup,roughly chopped
179195,179195,2 dashes Angostura or Regans’ Orange Bitters,Orange Bitters,2.00,0.0,dash,Angostura or Regans’


### Format for training data

The data for training the custom NER tagger must be of the form:

    TRAIN_DATA = [('3 tablespoons chopped fresh sage',
                {'entities': [(28, 32, 'INGREDIENT')]}),
                ('1/4 cup brown sugar',
                {'entities': [(8, 19, 'INGREDIENT')]}),
                ('1 1/2 cups heavy cream',
                {'entities': [(11, 22, 'INGREDIENT')]}),
                ('1 1/4 cups whole milk',
                {'entities': [(11, 21, 'INGREDIENT')]}),
                ...
                ]


***Cleaning NYTimes Dataset***


1. The columns - input, name - are converted to string
2. Words to be removed from the name column are identified as sometimes the name column contains descriptive information
3. Words are removed from name column if there is additional information present
4. Parantheses and text between parantheses is eliminated for name column as it contains additional information
5. Drop rows which contain null values for input and name column. This form of imputation is acceptable here as we are utilizing a very small subset of the data and there aren't many null values in the original dataset.

In [6]:
#convert input and name to string
nyt_data['input'] = nyt_data['input'].astype(str)
nyt_data['name'] = nyt_data['name'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nyt_data['input'] = nyt_data['input'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nyt_data['name'] = nyt_data['name'].astype(str)


In [7]:
#clean the name column with these words
remove_words = ['ground','to','taste', 'and', 'or', 'powder','white','red','green','yellow', 'can', 'seed', 'into', 'cut', 'grated',\
                'leaf','package','finely','divided','a','piece','optional','inch','needed','more','drained','for','flake','juice','dry','breast',\
                'extract','yellow','thinly','boneless','skinless','cubed','bell','bunch','cube','slice','pod','beaten','seeded','broth','uncooked',\
                'root','plain','baking','heavy','halved','crumbled','sweet','with','hot','confectioner','room','temperature','trimmed',\
                'all-purpose','sauce','crumb','deveined','bulk','seasoning','jar','food','sundried','italianstyle','if','bag','mix','in',\
                'each','roll','instant','double','such','extra-virgin','frying','thawed','whipping','stock','rinsed','mild','sprig','brown',\
                'freshly','toasted','link','boiling','cooked','basmati','unsalted','container','split','cooking','thin','lengthwise','warm',\
                'softened','thick','quartered','juiced','pitted','chunk','melted','cold','coloring','puree','cored','stewed',\
                'floret','coarsely','the','clarified','blanched','zested','sweetened','powdered','longgrain','garnish','indian','dressing',\
                'soup','at','active','french','lean','chip','sour','condensed','long','smoked','ripe','skinned','fillet','from','stem','flaked',\
                'removed','zest','stalk','unsweetened','baby','cover','crust', 'extra', 'prepared', 'blend', 'of', 'ring','plus','firmly', 'packed',\
                'lightly','level','even','rounded','heaping','heaped','sifted','bushel','peck','stick','chopped','sliced','halves', 'shredded',\
                'slivered','sliced','whole','paste','whole',' fresh', 'peeled', 'diced','mashed','dried','frozen','fresh','peeled','candied',\
                'no', 'pulp','crystallized','canned','crushed','minced','julienned','clove','head', 'small','large','medium', 'good', 'quality', \
                'freshly']

In [8]:
#drop null 
nyt_data = nyt_data.dropna(axis = 0, subset = ['input', 'name'])

In [9]:
def clean_nyt_data(df, col, size):
  #clean extra words and brackets
    cleaned_col = []
    for _, row in df.iloc[:size].iterrows():
        #remove text within parantheses along with the parantheses
        row[col] = re.sub("[\(\[].*?[\)\]]", "", row[col])
        row[col] = row[col].replace("-", "")
        curr_row =  row[col].split()
        if len(curr_row) > 1:
            resultwords  = [word for word in curr_row if word.lower() not in remove_words]
            row[col] = ' '.join(resultwords)
        if row[col] == '':
            cleaned_col.append(" ")
        else:
            cleaned_col.append(row[col])
    df.iloc[:size][col] = cleaned_col
    return(df.iloc[:size])

In [10]:
cleaned_nyt_data = clean_nyt_data(nyt_data, 'name', 1000)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[:size][col] = cleaned_col


In [11]:
cleaned_nyt_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 1045
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   index      1000 non-null   int64  
 1   input      1000 non-null   object 
 2   name       1000 non-null   object 
 3   qty        1000 non-null   float64
 4   range_end  1000 non-null   float64
 5   unit       748 non-null    object 
 6   comment    621 non-null    object 
dtypes: float64(2), int64(1), object(4)
memory usage: 62.5+ KB


In [14]:
# Removing any obviously wrong names, e.g. small red onion, finely diced
cleaned_nyt_data = cleaned_nyt_data[cleaned_nyt_data['name'].str.len()<18]

***Transforming the NYTimes Dataset***

In [15]:
def generateEntity(line, ingredient_list, entity):
    curr_dict = {}
    if len(ingredient_list) == 1:
        ingd_regex = re.compile(ingredient_list[0])
        entity_match = ingd_regex.search(line)
        curr_dict['entities'] = [(entity_match.start(), entity_match.end(), entity)]
        return(curr_dict['entities'])
    else:
        for i in range(len(ingredient_list)):
            ingd_regex = re.compile(ingredient_list[i])
            entity_match = ingd_regex.search(line)
            if i == 0:
                curr_dict['entities'] = [(entity_match.start(), entity_match.end(), entity)]
            else:
                curr_dict['entities'].append((entity_match.start(), entity_match.end(), entity))
    return(curr_dict['entities'])


def generateTrainingData(df, inputCol, ingredientCol, entity):
    TRAIN_DATA = []
    subset = df[[inputCol, ingredientCol]]
    for ix in range(len(df)):
        line = subset.iloc[ix, 0]
        ingd_name = subset.iloc[ix, 1]
        ent_dict = {}
        ingd_list = ingd_name.split()
        flag = 0
        #for each token
        for ingredient in ingd_list:
            if line == 'nan' or ingredient == 'nan':
                flag = 1
                continue
            if ingredient not in line:
                flag = 1
                continue
        if flag == 0:
            ent_dict['entities'] = generateEntity(line, ingd_list, entity)
            TRAIN_DATA.append((line, ent_dict))
            print("\n", "Adding", (line, ent_dict), "to row {0}".format(ix + 1), end = " ")
        else:
            print("\n","Skipping {} row".format(ix + 1), end = " ")
    return(TRAIN_DATA)

In [16]:
TRAIN_DATA = generateTrainingData(cleaned_nyt_data, "input", 'name', 'INGREDIENT')


 Adding ('1 1/4 cups cooked and pureed fresh butternut squash, or 1 10-ounce package frozen squash, defrosted', {'entities': [(35, 44, 'INGREDIENT'), (45, 51, 'INGREDIENT')]}) to row 1 
 Adding ('1 cup peeled and cooked fresh chestnuts (about 20), or 1 cup canned, unsweetened chestnuts', {'entities': [(30, 39, 'INGREDIENT')]}) to row 2 
 Adding ('1 medium-size onion, peeled and chopped', {'entities': [(14, 19, 'INGREDIENT')]}) to row 3 
 Adding ('2 stalks celery, chopped coarse', {'entities': [(9, 15, 'INGREDIENT')]}) to row 4 
 Adding ('1 1/2 tablespoons vegetable oil', {'entities': [(18, 27, 'INGREDIENT'), (28, 31, 'INGREDIENT')]}) to row 5 
 Adding ('2 tablespoons unflavored gelatin, dissolved in 1/2 cup water', {'entities': [(25, 32, 'INGREDIENT')]}) to row 6 
 Adding ('Salt', {'entities': [(0, 4, 'INGREDIENT')]}) to row 7 
 Adding ('1 cup canned plum tomatoes with juice', {'entities': [(13, 17, 'INGREDIENT'), (18, 26, 'INGREDIENT')]}) to row 8 
 Adding ('6 cups veal or beef stock

 Adding ('1 cup short-grain rice', {'entities': [(6, 17, 'INGREDIENT'), (18, 22, 'INGREDIENT')]}) to row 287 
 Adding ('2 cups chicken stock, preferably homemade, or water', {'entities': [(7, 14, 'INGREDIENT'), (15, 20, 'INGREDIENT')]}) to row 288 
 Adding ('3 tablespoons extra-virgin olive oil', {'entities': [(27, 32, 'INGREDIENT'), (33, 36, 'INGREDIENT')]}) to row 289 
 Adding ('1 tablespoon chili powder', {'entities': [(13, 18, 'INGREDIENT'), (19, 25, 'INGREDIENT')]}) to row 290 
 Adding ('6 cups cooked pinto or kidney beans, or a mixture', {'entities': [(30, 35, 'INGREDIENT')]}) to row 291 
 Adding ('1/2 cup fresh lime juice', {'entities': [(14, 18, 'INGREDIENT'), (19, 24, 'INGREDIENT')]}) to row 292 
 Adding ('3 ripe Haas avocados', {'entities': [(7, 11, 'INGREDIENT'), (12, 20, 'INGREDIENT')]}) to row 293 
 Adding ('1 fresh jalapeno pepper, seeded and minced', {'entities': [(8, 16, 'INGREDIENT'), (17, 23, 'INGREDIENT')]}) to row 294 
 Adding ('2 medium-size cucumbers, peeled and s

 Adding ('2 tablespoons chopped parsley', {'entities': [(22, 29, 'INGREDIENT')]}) to row 537 
 Adding ('1 medium-size orange, peeled and sectioned', {'entities': [(14, 20, 'INGREDIENT')]}) to row 538 
 Adding ('1/3 cup chopped red onion', {'entities': [(20, 25, 'INGREDIENT')]}) to row 539 
 Adding ('2 tablespoons red-wine vinegar', {'entities': [(14, 22, 'INGREDIENT'), (23, 30, 'INGREDIENT')]}) to row 540 
 Adding ('1/3 cup olive or vegetable or corn oil', {'entities': [(35, 38, 'INGREDIENT')]}) to row 541 
 Adding ('2 endives', {'entities': [(2, 9, 'INGREDIENT')]}) to row 542 
 Adding ('4 tablespoons grated Parmesan or pecorino cheese', {'entities': [(42, 48, 'INGREDIENT')]}) to row 543 
 Adding ('1 bunch watercress, washed and drained', {'entities': [(8, 18, 'INGREDIENT')]}) to row 544 
 Adding ('4 tablespoons coarsely chopped Italian parsley', {'entities': [(39, 46, 'INGREDIENT')]}) to row 545 
 Adding ('3/4 cup heavy cream', {'entities': [(8, 13, 'INGREDIENT'), (14, 19, 'INGREDIENT

 Adding ('1 egg yolk', {'entities': [(2, 5, 'INGREDIENT')]}) to row 787 
 Adding ('20 ounces golden (whitefish) caviar', {'entities': [(29, 35, 'INGREDIENT')]}) to row 788 
 Adding ('1 1/2 pounds angel-hair pasta', {'entities': [(24, 29, 'INGREDIENT')]}) to row 789 
 Adding ('8 tablespoons unsalted butter, softened', {'entities': [(14, 22, 'INGREDIENT'), (23, 29, 'INGREDIENT')]}) to row 790 
 Adding ('2 bay leaves', {'entities': [(2, 5, 'INGREDIENT'), (6, 12, 'INGREDIENT')]}) to row 791 
 Adding ('1/3 cup Calvados or other brandy', {'entities': [(26, 32, 'INGREDIENT')]}) to row 792 
 Adding ('3 cups créme fraîche', {'entities': [(7, 12, 'INGREDIENT'), (13, 20, 'INGREDIENT')]}) to row 793 
 Adding ('Freshly grated Reggiano Parmesan', {'entities': [(15, 23, 'INGREDIENT'), (24, 32, 'INGREDIENT')]}) to row 794 
 Adding ('About 1 cup heavy cream', {'entities': [(12, 17, 'INGREDIENT'), (18, 23, 'INGREDIENT')]}) to row 795 
 Adding ('1 clove garlic, minced', {'entities': [(8, 14, 'INGREDIENT'

In [17]:
TRAIN_DATA

[('1 1/4 cups cooked and pureed fresh butternut squash, or 1 10-ounce package frozen squash, defrosted',
  {'entities': [(35, 44, 'INGREDIENT'), (45, 51, 'INGREDIENT')]}),
 ('1 cup peeled and cooked fresh chestnuts (about 20), or 1 cup canned, unsweetened chestnuts',
  {'entities': [(30, 39, 'INGREDIENT')]}),
 ('1 medium-size onion, peeled and chopped',
  {'entities': [(14, 19, 'INGREDIENT')]}),
 ('2 stalks celery, chopped coarse', {'entities': [(9, 15, 'INGREDIENT')]}),
 ('1 1/2 tablespoons vegetable oil',
  {'entities': [(18, 27, 'INGREDIENT'), (28, 31, 'INGREDIENT')]}),
 ('2 tablespoons unflavored gelatin, dissolved in 1/2 cup water',
  {'entities': [(25, 32, 'INGREDIENT')]}),
 ('Salt', {'entities': [(0, 4, 'INGREDIENT')]}),
 ('1 cup canned plum tomatoes with juice',
  {'entities': [(13, 17, 'INGREDIENT'), (18, 26, 'INGREDIENT')]}),
 ('6 cups veal or beef stock', {'entities': [(20, 25, 'INGREDIENT')]}),
 ('4 bay leaves', {'entities': [(2, 5, 'INGREDIENT'), (6, 12, 'INGREDIENT')]}),


In [18]:
print(f'Our training set has {len(TRAIN_DATA)} observations')

Our training set has 939 observations


In [24]:
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.load('en_core_web_sm')
    ner = nlp.get_pipe("ner")
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp

In [26]:
ner_model = train_spacy(TRAIN_DATA, 25)

ValueError: [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.