# Steamboat Squad

Import and load data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import json

with open("/content/drive/MyDrive/NLP Assignment/recipes_ingredients.json", "r") as json_file:
    recipes = json.load(json_file)
    
len(recipes)

4702

Overview of data structure. This is a list of dictionary, where each dictionary is a recipe with its name, ingredients and url

In [3]:
recipes[0]

{'ingredients': ['¼ cup butter ',
  '2 tablespoons olive oil ',
  '1 teaspoon coarse salt ',
  '¼ teaspoon ground black pepper ',
  '3 cloves garlic, minced ',
  '1 pound fresh asparagus spears, trimmed '],
 'name': 'Pan-Fried Asparagus',
 'url': 'https://www.allrecipes.com/recipe/18318/pan-fried-asparagus/'}

Deleting url key

In [4]:
for recipe in recipes:
    del recipe['url']
recipes[0]

{'ingredients': ['¼ cup butter ',
  '2 tablespoons olive oil ',
  '1 teaspoon coarse salt ',
  '¼ teaspoon ground black pepper ',
  '3 cloves garlic, minced ',
  '1 pound fresh asparagus spears, trimmed '],
 'name': 'Pan-Fried Asparagus'}

# Preprocessing Recipe Names
- Lower-casing (normalise words by using POS tagging)
- Change numbers to fix number (place holder)

NLTK has a help function that explains its POS tags.

In [5]:
import nltk
from nltk import pos_tag, word_tokenize, RegexpParser, Tree
from nltk.tokenize import PunktSentenceTokenizer

nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [6]:
nltk.help.upenn_tagset() # retrieve the POS tags from NLTK

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Using %%capture, save the NLTK help text as a string

In [7]:
%%capture cap --no-stderr

nltk.help.upenn_tagset() # retrieve the POS tags from NLTK

In [8]:
cap.stdout

'$: dollar\n    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$\n\'\': closing quotation mark\n    \' \'\'\n(: opening parenthesis\n    ( [ {\n): closing parenthesis\n    ) ] }\n,: comma\n    ,\n--: dash\n    --\n.: sentence terminator\n    . ! ?\n:: colon or ellipsis\n    : ; ...\nCC: conjunction, coordinating\n    & \'n and both but either et for less minus neither nor or plus so\n    therefore times v. versus vs. whether yet\nCD: numeral, cardinal\n    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-\n    seven 1987 twenty \'79 zero two 78-degrees eighty-four IX \'60s .025\n    fifteen 271,124 dozen quintillion DM2,000 ...\nDT: determiner\n    all an another any both del each either every half la many much nary\n    neither no some such that the them these this those\nEX: existential there\n    there\nFW: foreign word\n    gemeinschaft hund ich jeux habeas Haementeria Herr K\'ang-si vous\n    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte\n    terram 

Using RE, get all the tag names

In [9]:
# convert the POS tags retrieved from NLTK and convert into a list

import re

ALL_POS = re.findall(".*: +", cap.stdout)

for i, pos in enumerate(ALL_POS):
  ALL_POS[i] = pos.replace(': ', '')


ALL_POS

['$',
 "''",
 '(',
 ')',
 ',',
 '--',
 '.',
 ':',
 '    ',
 'CC',
 'CD',
 'DT',
 'EX',
 'FW',
 'IN',
 'JJ',
 'JJR',
 'JJS',
 'LS',
 'MD',
 'NN',
 'NNP',
 'NNPS',
 'NNS',
 'PDT',
 'POS',
 'PRP',
 'PRP$',
 'RB',
 'RBR',
 'RBS',
 'RP',
 'SYM',
 'TO',
 'UH',
 'VB',
 'VBD',
 'VBG',
 'VBN',
 'VBP',
 'VBZ',
 'WDT',
 'WP',
 'WP$',
 'WRB',
 '``']

In [10]:
ALL_POS.remove('    ') # this remove the unintentionally created tag
ALL_POS

['$',
 "''",
 '(',
 ')',
 ',',
 '--',
 '.',
 ':',
 'CC',
 'CD',
 'DT',
 'EX',
 'FW',
 'IN',
 'JJ',
 'JJR',
 'JJS',
 'LS',
 'MD',
 'NN',
 'NNP',
 'NNPS',
 'NNS',
 'PDT',
 'POS',
 'PRP',
 'PRP$',
 'RB',
 'RBR',
 'RBS',
 'RP',
 'SYM',
 'TO',
 'UH',
 'VB',
 'VBD',
 'VBG',
 'VBN',
 'VBP',
 'VBZ',
 'WDT',
 'WP',
 'WP$',
 'WRB',
 '``']

This series of codes attempts to extract the POS tags from NLTK's tagsets and put them into a list (for our usage). 

During the process, an additional tag '     ' was created unintentionally, hence, we remove it from the list. 


**Create a function to pos tag a text**

In [11]:
# Create a function to pos tag a text
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

def tag_pos(corpus):
    text=word_tokenize(corpus) # word_tokenize is a tokenization function in NLTK
    return nltk.pos_tag(text)

# To test if the tagging function works well
tag_pos("This is a test sentence.")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('test', 'NN'),
 ('sentence', 'NN'),
 ('.', '.')]

Create a function that POS tag and returns words with specific POS

In [12]:
# Create a function that POS tag and returns words with specific POS
def get_words_with_pos(text, pos):
  tagged = tag_pos(text)
  return [t for t in tagged if t[1].startswith(pos)]

# To test if the return function generates what we need
get_words_with_pos("This is a test sentence.", "NN")

[('test', 'NN'), ('sentence', 'NN')]

POS tag all recipe names

In [13]:
# POS-tagging the 'Recipe Names'

tagged_recipe_names = []

for i, recipe in enumerate(recipes):
  try:
    tagged_recipe_names.append(tag_pos(recipes[i]['name']))
  except Exception as e:
    pass

len(tagged_recipe_names)

4701

## Data cleaning for Recipe names based on POS tagging

Looking at the first 10 tagged recipe names, there is a need for pre-processing, as NLTK's tagging is confused by the letter casing.

For example, Pan, which is a noun, is tagged as proper noun (NNP) instead of noun (NN). this is because the POS tagger of NLTK identifies the capital letter P of the word 'pan', thus treating it as a proper noun. Similar observations are found on several more words such as 'bread', 'dead', 'potatoes'. 

Hence, we'll deep-dive into how words are POS-tagged. 

In [14]:
tagged_recipe_names[:10]

[[('Pan-Fried', 'JJ'), ('Asparagus', 'NNP')],
 [('Pan', 'NNP'),
  ('de', 'FW'),
  ('Muertos', 'NNP'),
  ('(', '('),
  ('Mexican', 'NNP'),
  ('Bread', 'NNP'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('Dead', 'NNP'),
  (')', ')')],
 [('Creamy', 'NNP'), ('Au', 'NNP'), ('Gratin', 'NNP'), ('Potatoes', 'NNP')],
 [('Super-Delicious', 'JJ'), ('Zuppa', 'NNP'), ('Toscana', 'NNP')],
 [('Simple', 'JJ'), ('Teriyaki', 'NNP'), ('Sauce', 'NNP')],
 [('Spicy', 'JJ'),
  ('Korean', 'NNP'),
  ('Fried', 'NNP'),
  ('Chicken', 'NNP'),
  ('with', 'IN'),
  ('Gochujang', 'NNP'),
  ('Sauce', 'NNP')],
 [('Spaghetti', 'NNP'), ('Aglio', 'NNP'), ('e', 'NN'), ('Olio', 'NNP')],
 [('Easy', 'JJ'), ('Garam', 'NNP'), ('Masala', 'NNP')],
 [('Easy', 'NNP'), ('Chorizo', 'NNP'), ('Street', 'NNP'), ('Tacos', 'NNP')],
 [('Tres', 'NNS'),
  ('Leches', 'NNP'),
  ('(', '('),
  ('Milk', 'NNP'),
  ('Cake', 'NNP'),
  (')', ')')]]

Create a function that returns all tagged words with the same tag. NLTK's POS tagging assumes that capitalized noun means proper noun (name).

In [15]:
# Create a function that returns all tagged words with the same tag. 
# NLTK's POS tagging assumes that capitalized noun means proper noun (name).

def list_words_with_tag(tuple_list, pos):
  results = []
  for name in tuple_list:
    for tag in name:
      if tag[1] == pos:
        results.append(tag[0])
  return results

list_words_with_tag(tagged_recipe_names, "NNP")

['Asparagus',
 'Pan',
 'Muertos',
 'Mexican',
 'Bread',
 'Dead',
 'Creamy',
 'Au',
 'Gratin',
 'Potatoes',
 'Zuppa',
 'Toscana',
 'Teriyaki',
 'Sauce',
 'Korean',
 'Fried',
 'Chicken',
 'Gochujang',
 'Sauce',
 'Spaghetti',
 'Aglio',
 'Olio',
 'Garam',
 'Masala',
 'Easy',
 'Chorizo',
 'Street',
 'Tacos',
 'Leches',
 'Milk',
 'Cake',
 'Cabbage',
 'Rolls',
 'Gravy',
 'Shrimp',
 'Scampi',
 'Pasta',
 'Lemon',
 'Chicken',
 'Potato',
 'Bake',
 'Mexican',
 'Casserole',
 'Caldo',
 'Res',
 'Mexican',
 'Beef',
 'Soup',
 'Nogada',
 'Mexican',
 'Stuffed',
 'Poblano',
 'Peppers',
 'Walnut',
 'Sauce',
 'Apple',
 'Cake',
 'Flan',
 'Pork',
 'Chops',
 'Sauerkraut',
 'Spicy',
 'Thai',
 'Basil',
 'Chicken',
 'Pad',
 'Krapow',
 'Gai',
 'Spaghetti',
 'Cacio',
 'Pepe',
 'Chef',
 'John',
 'Chicken',
 'Kiev',
 'Chicken',
 'Onions',
 'Fajita',
 'Perfect',
 'Sushi',
 'Rice',
 'Baked',
 'Chicken',
 'German',
 'Potato',
 'Salad',
 'Miso',
 'Soup',
 'Mexican',
 'Rice',
 'II',
 'Haluski',
 'Labneh',
 'Lebanese',
 'Y

Get the number of each POS tag

In [16]:
# Get the number of tokens in each POS tag
all_name_tags = []

for POS in ALL_POS:
  new_dic = {POS: list_words_with_tag(tagged_recipe_names, POS)}
  all_name_tags.append(new_dic)

In [17]:
def get_tag_number(tag_list):
  tag_numbers = []
  for tag in tag_list:
    for key, value in tag.items(): 
      new_dict = {key: len(value)}
    tag_numbers.append(new_dict)
  return tag_numbers

get_tag_number(all_name_tags)

[{'$': 1},
 {"''": 7},
 {'(': 529},
 {')': 529},
 {',': 63},
 {'--': 0},
 {'.': 10},
 {':': 98},
 {'CC': 555},
 {'CD': 74},
 {'DT': 104},
 {'EX': 0},
 {'FW': 47},
 {'IN': 482},
 {'JJ': 1822},
 {'JJR': 4},
 {'JJS': 27},
 {'LS': 0},
 {'MD': 2},
 {'NN': 571},
 {'NNP': 13139},
 {'NNPS': 46},
 {'NNS': 307},
 {'PDT': 0},
 {'POS': 348},
 {'PRP': 72},
 {'PRP$': 20},
 {'RB': 33},
 {'RBR': 0},
 {'RBS': 1},
 {'RP': 2},
 {'SYM': 0},
 {'TO': 20},
 {'UH': 0},
 {'VB': 24},
 {'VBD': 39},
 {'VBG': 50},
 {'VBN': 133},
 {'VBP': 10},
 {'VBZ': 22},
 {'WDT': 4},
 {'WP': 0},
 {'WP$': 0},
 {'WRB': 7},
 {'``': 6}]

Some names have numbers (CD). Some are obviously not numbers, like 'Figgy'

In [18]:
def get_values_from_dict_list(dict_list, key):
  values = []
  for d in dict_list:
    if key in d:
      values.append(d[key])
  return values

cd_tokens = get_values_from_dict_list(all_name_tags, 'CD')[0]
cd_tokens

['5',
 '16',
 '2',
 '13',
 '300',
 'Figgy',
 '3',
 '9',
 'Two',
 '9',
 '22',
 '10',
 '15',
 'One',
 '18',
 'Ten',
 'Flounder',
 'Three',
 'Ziti',
 'One',
 '21',
 'Four',
 '9',
 '65',
 '17',
 '14',
 '10',
 "'n",
 '15',
 '8',
 'Minestrone',
 'Four',
 '35',
 'Fly',
 '15',
 '23',
 '8',
 '15',
 '21',
 "That's-a",
 'Tex-Mex',
 '14',
 '17',
 'Five',
 '10',
 '18',
 '5',
 "'Otai",
 '17',
 '3',
 '17',
 '75',
 '17',
 '20',
 'Take-Out',
 '16',
 '12',
 'Three',
 "'Three",
 '15',
 '20',
 '16',
 '12',
 '15',
 '22',
 '12',
 'Three',
 '21',
 '21',
 '25',
 '7',
 '10',
 '19',
 '20']

Create a function that searches for recipe name with specific string

In [19]:
def find_value_with_char(dic_list, key, char):
  matches = []
  for recipe in dic_list:
    try:
      if char in recipe[key]:
        matches.append(recipe[key])
    except Exception as e:
      pass
  return matches

find_value_with_char(recipes, 'name', 'Figgy')

['Figgy Pudding']

'Three cup chicken' is indeed a name. On the other hand, numerics, such as 9 and 13 are not part of the actual names of dishes. So, numerics, instead of NLTK's CD, should be treated. This treatment should be done using regex.

In [20]:
for cd in cd_tokens:
  print(find_value_with_char(recipes, 'name', cd))

['Our 5 Best Avgolemono Soup Recipes', '5-Ingredient Mexican Casserole', '15 Mexican-Inspired Ground Beef Casseroles That Deliver Big Flavor With Every Satisfying Bite', 'Chicken 65', 'Pan-Roasted 5-Spice Pork Loin', 'The 15 Most Iconic French Desserts', '35 Quick and Easy Chinese Dinners You Can Make at Home', '15 Essential North Indian Recipes', '15 Essential North Indian Recipes', '18 Easy Mexican Dishes With 5 Ingredients or Less', 'French 75 Cocktail', '15 Top-Rated Traditional German Christmas Cookies', '15 Traditional Italian Christmas Dinner Recipes', "25 Italian Cookies You'll Love"]
['16 German Recipes That Are Comfort Food Favorites', '16 Mexican-Inspired Casseroles for Family-Pleasing Dinners', '16 Essential Puerto Rican Recipes']
['2 Minute Cheese Quesadillas', "22 Recipes Using a Whole Baguette (That Aren't Sandwiches)", 'Our 21 Best Authentic Mexican Recipes', '23 Delicious Ways the World Cooks Pork Shoulder', '21 Easy Dinners That Start with Packaged Gnocchi', 'Our 20 B

Create a function that searches a regex pattern from a text

In [21]:
def searchWordsPatt(text, patt):
    array = re.findall(patt, text)
    return array

NUMPATTERN = r'[0-9]+'
searchWordsPatt("I want 1 cup of tea", NUMPATTERN)

['1']

Create a function that substitutes regex patterns with a given value

In [22]:
def searchReplacePatt(text, patt, new_val):
  return re.sub(patt, new_val, text)

NUMSPACEPATTERN = r'(\d+\s)' # what is \d+\s ? 
searchReplacePatt("I want 1 cup of tea", NUMSPACEPATTERN, "")

'I want cup of tea'

searchReplacePatt, except it iterates recipe list

In [23]:
def searchReplacePattList(dict_list, patt, new_val, key="name"):
    for i, recipe in enumerate(dict_list):
        try:
            dict_list[i][key] = searchReplacePatt(dict_list[i][key], patt, new_val)
        except Exception as e:
            pass

searchReplacePattList, but adds a substring at given index

In [24]:
def searchReplaceAddPattList(dict_list, patt, new_val, substring, index=0, key="name"):
    for i, recipe in enumerate(dict_list):
        try:
            dict_list[i][key] = searchReplacePatt(dict_list[i][key], patt, new_val)
            added_string = list(dict_list[i][key]).insert(index, substring)
            dict_list[i][key]=''.join(added_string)
        except Exception as e:
            pass

Remove numerics from name

In [25]:
import re

p_recipes = recipes

searchReplacePattList(p_recipes, NUMSPACEPATTERN, "")

def retag(text_list, key):
  new_list = []
  for i, recipe in enumerate(text_list):
    try:
      new_list.append(tag_pos(recipes[i][key]))
    except Exception as e:
      pass
  return new_list

tagged_recipe_names = retag(p_recipes, "name")

Get the new remaining CD

In [26]:
new_cd_tokens = list_words_with_tag(tagged_recipe_names, "CD")
new_cd_tokens

['Figgy',
 'Two',
 'One',
 'Ten',
 'Flounder',
 'Three',
 'Ziti',
 'One',
 'Four',
 '65',
 "'n",
 'Minestrone',
 'Four',
 'Fly',
 "That's-a",
 'Tex-Mex',
 'Five',
 "'Otai",
 'Take-Out',
 'Three',
 "'Three",
 'Three']

The remaining numbers (CD) are part of actual recipe names

In [27]:
for cd in new_cd_tokens:
  print(find_value_with_char(p_recipes, 'name', cd))

['Figgy Pudding']
['Two-Ingredient Naan', 'Pollo alla Birra for Two']
['A Number One Egg Bread', 'One-Egg Egg Drop Soup', 'One Pot Thai-Style Rice Noodles', 'One-Pot Vegan Potato-Lentil Curry', 'One-Bite Thai "Flavor Bomb" Salad Wraps (Miang Kham)', 'Easy One-Skillet Ground Beef Burrito', 'One-Pot Greek Lemon Chicken and Rice']
['Tender Italian Baked Chicken', 'Tuscan Pork Tenderloin', 'Asian Pork Tenderloin', 'Italian Pork Tenderloin', 'Sweet and Sour Pork Tenderloin', 'Chipotle Crusted Pork Tenderloin', 'Ten Minute Szechuan Chicken', 'Thai Quivering Tenderloins', 'Spicy Pork Tenderloin', 'Chinese Pork Tenderloin', 'Grecian Pork Tenderloin', 'Havana Slow Cooker Pork Tenderloin', 'Curry Pork Tenderloin', 'Tender Juicy Skirt Steak  (Churrasco)', 'Spicy and Tender Corned Beef', 'Pan Roasted Pork Tenderloin with a Blue Cheese and Olive Stuffing']
['Flounder Mediterranean']
['Pastel de Tres Leches (Three Milk Cake)', 'Three-Meat Italian Meatballs', 'Three Cheese Manicotti II', 'Taiwanese-S

In [28]:
new_all_name_tags = []

for POS in ALL_POS:
  new_dic = {POS: list_words_with_tag(tagged_recipe_names, POS)}
  new_all_name_tags.append(new_dic)

Can and 'll are the modal verbs found

In [29]:
md_tokens = list_words_with_tag(tagged_recipe_names, "MD")
md_tokens

['Can', "'ll"]

'can' is caused by words such as Canadian, which is processed in next section. But, 'you'll love' is not part of recipe name and more of an expression

In [30]:
for md in md_tokens:
  print(find_value_with_char(p_recipes, 'name', md))

['Canadian Yellow Split Pea Soup with Ham', 'French Canadian Tourtiere', 'Pure Maple Candy', 'Cannoli', 'The Original Donair From the East Coast of Canada', 'Sauerkraut for Canning', 'Tourtiere (French Canadian Meat Pie)', 'Pumpkin Cannoli', 'Puerto Rican Canned Corned Beef Stew', 'Canadian Pork Loin Chops', 'Caneles de Bordeaux', 'Canadian Walleye (Pickerel)', "Thera's Canadian Fried Dough", 'Italian Baked Cannelloni', 'Canary Island Red Mojo Sauce', 'Mexican Tamarind Candy', 'Cantonese Chicken Chow Mein', 'Roti Canai/Paratha (Indian Pancake)', 'Polvorones de Canele (Cinnamon Cookies)', 'Miraculous Canadian Sugar Pie', 'Canadian Tea Biscuits', 'Peanut Butter Potato Candy', 'Irish Potato Candy', 'Filipino Pancit Bihon with Canton', 'Gorton (French-Canadian Pork Spread)', 'Quick and Easy Chinese Dinners You Can Make at Home', 'Chocolate Cantucci', 'Cantonese Style Lobster', 'Real Canadian Poutine', 'French Canadian Meatball Stew', 'Canadian Butter Tarts', 'Canadian Apple Pie', 'Cantones

Removing "You'll" and retagging new list

In [31]:
searchReplacePattList(p_recipes, r"(You'll Love)", "") # Shall we also remove 'Can'?
tagged_recipe_names = retag(p_recipes, "name")

'll' removed

In [32]:
new_md_tokens = list_words_with_tag(tagged_recipe_names, "MD")
new_md_tokens

['Can']

In [33]:
for md in new_md_tokens:
  print(find_value_with_char(p_recipes, 'name', md))

['Canadian Yellow Split Pea Soup with Ham', 'French Canadian Tourtiere', 'Pure Maple Candy', 'Cannoli', 'The Original Donair From the East Coast of Canada', 'Sauerkraut for Canning', 'Tourtiere (French Canadian Meat Pie)', 'Pumpkin Cannoli', 'Puerto Rican Canned Corned Beef Stew', 'Canadian Pork Loin Chops', 'Caneles de Bordeaux', 'Canadian Walleye (Pickerel)', "Thera's Canadian Fried Dough", 'Italian Baked Cannelloni', 'Canary Island Red Mojo Sauce', 'Mexican Tamarind Candy', 'Cantonese Chicken Chow Mein', 'Roti Canai/Paratha (Indian Pancake)', 'Polvorones de Canele (Cinnamon Cookies)', 'Miraculous Canadian Sugar Pie', 'Canadian Tea Biscuits', 'Peanut Butter Potato Candy', 'Irish Potato Candy', 'Filipino Pancit Bihon with Canton', 'Gorton (French-Canadian Pork Spread)', 'Quick and Easy Chinese Dinners You Can Make at Home', 'Chocolate Cantucci', 'Cantonese Style Lobster', 'Real Canadian Poutine', 'French Canadian Meatball Stew', 'Canadian Butter Tarts', 'Canadian Apple Pie', 'Cantones

Replacing any "/" with "or" word

In [34]:
searchReplacePattList(p_recipes, r"\/", " or ")
tagged_recipe_names = retag(p_recipes, "name")

In [35]:
bracket_tokens = list(set(list_words_with_tag(tagged_recipe_names, "(")))
bracket_tokens

['(']

Examining brackers in names. Most of the words in brackets are translations

In [36]:
bracketed_names = []
for bracket in bracket_tokens:
    names = find_value_with_char(p_recipes, 'name', bracket)
    print(names)
    bracketed_names = bracketed_names + names

bracketed_names = list(set(bracketed_names))

['Pan de Muertos (Mexican Bread of the Dead)', 'Tres Leches (Milk Cake)', 'Caldo de Res (Mexican Beef Soup)', 'Chiles en Nogada (Mexican Stuffed Poblano Peppers in Walnut Sauce)', 'Spicy Thai Basil Chicken (Pad Krapow Gai)', 'Labneh (Lebanese Yogurt)', 'Indian Chicken Curry (Murgh Kari)', 'Keema Aloo (Ground Beef and Potatoes)', 'Turkish Eggs (Cilbir)', 'South African Melktert (Milk Tart)', 'Ukrainian Apple Cake (Yabluchnyk)', 'Spanish Garlic Shrimp (Gambas al Ajillo)', 'Polish Noodles (Cottage Cheese and Noodles)', 'German Potato Dumplings (Kartoffelkloesse)', 'Apfelkuchen (Apple Cake)', 'Oyakodon (Japanese Chicken and Egg Rice Bowl)', 'Bibimbap (Korean Rice With Mixed Vegetables)', 'Eggplant Caponata (Sicilian Version)', 'Chana Masala (Savory Indian Chick Peas)', 'Ricotta Pie (Old Italian Recipe)', 'Easy Blini (Russian Pancake)', 'Easy Bulgogi (Korean BBQ Beef)', 'Carne en su Jugo (Meat in its Juices)', 'Ghormeh Sabzi (Persian Herb Stew)', 'Puerto Rican Tostones (Fried Plantains)', '

"(no red sauce here...golden)" needs to be removed

In [37]:
# Redundant descriptions
searchReplacePattList(p_recipes,  r"(no red sauce here...golden)", "")
searchReplacePattList(p_recipes, r"(From a Swede!)", "")
searchReplacePattList(p_recipes, r"(from a Chinese person)", "")
searchReplacePattList(p_recipes, r"(Now Vegetarian!)", "")
searchReplacePattList(p_recipes, r"a.k.a. ", "")
searchReplacePattList(p_recipes, r"(That Aren't Sandwiches)", "")

# Remove copyright symbol
searchReplacePattList(p_recipes, r"&reg;", "")
# Asian Sesame Seared or Grilled Tuna (Gluten Free) => Gluten Free Asian Sesame Seared or Grilled Tuna
searchReplaceAddPattList(p_recipes, r"(Gluten Free)", "", "glutten-free")
tagged_recipe_names = retag(p_recipes, "name")

In [38]:
bracketed_names = []
for bracket in bracket_tokens:
    names = find_value_with_char(p_recipes, 'name', bracket)
    print(names)
    bracketed_names = bracketed_names + names

bracketed_names = list(set(bracketed_names))

['Pan de Muertos (Mexican Bread of the Dead)', 'Tres Leches (Milk Cake)', 'Caldo de Res (Mexican Beef Soup)', 'Chiles en Nogada (Mexican Stuffed Poblano Peppers in Walnut Sauce)', 'Spicy Thai Basil Chicken (Pad Krapow Gai)', 'Labneh (Lebanese Yogurt)', 'Indian Chicken Curry (Murgh Kari)', 'Keema Aloo (Ground Beef and Potatoes)', 'Turkish Eggs (Cilbir)', 'South African Melktert (Milk Tart)', 'Ukrainian Apple Cake (Yabluchnyk)', 'Spanish Garlic Shrimp (Gambas al Ajillo)', 'Polish Noodles (Cottage Cheese and Noodles)', 'German Potato Dumplings (Kartoffelkloesse)', 'Apfelkuchen (Apple Cake)', 'Oyakodon (Japanese Chicken and Egg Rice Bowl)', 'Bibimbap (Korean Rice With Mixed Vegetables)', 'Eggplant Caponata (Sicilian Version)', 'Chana Masala (Savory Indian Chick Peas)', 'Ricotta Pie (Old Italian Recipe)', 'Easy Blini (Russian Pancake)', 'Easy Bulgogi (Korean BBQ Beef)', 'Carne en su Jugo (Meat in its Juices)', 'Ghormeh Sabzi (Persian Herb Stew)', 'Puerto Rican Tostones (Fried Plantains)', '

Only three foreign words detected by NLTK, which is not true

In [39]:
fw_tokens = list(set(list_words_with_tag(tagged_recipe_names, "FW")))
fw_tokens

['de', 'et', 'Rassolnik']

From the three unique foreign words, these are the names

In [40]:
fw_names = []
for fw in fw_tokens:
    names = find_value_with_char(p_recipes, 'name', fw)
    print(names)
    fw_names = fw_names + names
fw_names = list(set(fw_names))

['Pan de Muertos (Mexican Bread of the Dead)', 'Caldo de Res (Mexican Beef Soup)', 'Tender Italian Baked Chicken', 'Herbs de Provence', "Chef John's Beef Rouladen", 'Fideo', 'Tomatillo Salsa Verde', 'Ground Beef with Homemade Taco Seasoning Mix', 'German Beef Rouladen', 'Buche de Noel', 'Tuscan Pork Tenderloin', 'Sauteed Sweet Plantains (Tajaditas Dulces de Platano)', 'Homemade Mozzarella Cheese', 'Kotlet Schabowy (Polish Breaded Pork Chop)', 'Semmelknoedel (Bread Dumplings)', 'Homemade Manti (Traditional Turkish Dumplings)', 'Kalamata Olive Tapenade', 'Barbacoa-Style Shredded Beef', "Ingrid's Rouladen", 'Original Homemade Italian Beef', 'Slow Cooker Chile Verde', 'Chicken and Sliders', 'Caldo Verde (Portuguese Sausage Kale Soup)', 'German Hamburgers (Frikadellen)', 'Slow Cooker Mexican Recipes Under Calories', 'Asian Pork Tenderloin', 'Harissa Powder', 'Colorado Green Chili (Chile Verde)', 'Schupfnudeln (German Fried Potato Dumplings)', 'French Butter Cakes (Madeleines)', 'Italian Chi

In [41]:
fw_names

['Tamales Oaxaque&ntilde;os (Oaxacan-Style Tamales)',
 'Beef Enchiladas with Homemade Sauce',
 'Pastel de Elote (Mexican Corn Cake)',
 'Turmeric Golden Milk with Turmeric Paste',
 'Sweet Corn Cake',
 'Scottish Butter Tablet',
 "Chef John's Baby Porchetta",
 'French Tartiflette',
 'Easy and Delicious Slow Cooker Cassoulet',
 'Frikadeller (Danish Meatballs)',
 'Delicious Spaghetti Bread',
 'Homemade Spaghetti Sauce',
 'Vietnamese Rice Noodle Salad',
 'Easy Italian Sausage Spaghetti',
 'Tonkatsu Shoyu Ramen (Pork Cutlet Soy Sauce Ramen)',
 "Chef John's Spanish Garlic Soup (Sopa de Ajo)",
 'Stir Fried Sesame Vegetables with Rice',
 "Sarah's Feta Rice Pilaf",
 'Mushroom Stuffed Beef Rouladen',
 'Roasted Spaghetti Squash Lasagna Boats',
 'No-Cook Chicken Lettuce Wraps',
 'Caldo de Res (Mexican Beef Soup)',
 'Sweet and Crunchy Salad',
 'Iskender Kebab',
 'Galette des Rois',
 'Colorado Green Chili (Chile Verde)',
 'Stir Fried Wok Vegetables',
 'Easy Vegetarian Kofta Curry',
 'Fettuccini Carbon

Names that both have foreign words and bracket

In [42]:
bracket_and_fw = [name for name in bracketed_names if name in fw_names]
bracket_and_fw

['Tamales Oaxaque&ntilde;os (Oaxacan-Style Tamales)',
 'Pastel de Elote (Mexican Corn Cake)',
 'Frikadeller (Danish Meatballs)',
 'Tonkatsu Shoyu Ramen (Pork Cutlet Soy Sauce Ramen)',
 "Chef John's Spanish Garlic Soup (Sopa de Ajo)",
 'Caldo de Res (Mexican Beef Soup)',
 'Colorado Green Chili (Chile Verde)',
 'Coconut Cheese Flan (Flan de Coco y Queso)',
 'French Cookies (Belgi Galettes)',
 'Mexican Chicken and Rice Soup (Sopa de Pollo y Arroz)',
 'Yogurt-Marinated Salmon Fillets (Dahi Machhali Masaledar)',
 'Pupusas de Queso (Cheese-Stuffed Tortillas)',
 'Caldo Verde (Portuguese Green Soup)',
 "Paksiw na Pata (Pig's Feet Stew)",
 'Schupfnudeln (German Fried Potato Dumplings)',
 'Mini Molletes de Frijoles (Mexican Bruschetta with Beans)',
 "Peposa Dell'Impruneta (Tuscan Black Pepper Beef)",
 'Postre de Limon (Mexican Lime Dessert)',
 'Korean Sweet Potato Noodles (Japchae)',
 'Rigatoni al Segreto (Rigatoni with Secret Sauce)',
 'Feta Cheese Burek (Phyllo Dough)',
 'Birria de Res Tacos (

Split the names into two names, one outside and one inside

In [43]:
BRACKET_REGEX = " \(.*\)"
def break_fw_bracket(name):
    name1 = re.findall(BRACKET_REGEX, name)[0]
    name1 = name1[name1.find("(")+1:name1.find(")")]
    name2 = re.sub(BRACKET_REGEX, "", name)
    return name1, name2

print(break_fw_bracket("Hearty Caldo de Res (Mexican Beef Soup)"))
print(break_fw_bracket("Ukha (Russian Fish Soup)"))

('Mexican Beef Soup', 'Hearty Caldo de Res')
('Russian Fish Soup', 'Ukha')


Apply the split function. Delete old recipe with bracket and foreign words. In both of the new recipes, duplicate old ingredients.

In [44]:
for i, recipe in enumerate(p_recipes):
    try:
        if p_recipes[i]["name"] in bracket_and_fw:
            newname1, newname2 = break_fw_bracket(p_recipes[i]["name"])
            new_recipe1 = {'name': newname1, 'ingredients': p_recipes[i]["ingredients"]}
            new_recipe2 = {'name': newname2, 'ingredients': p_recipes[i]["ingredients"]}
            p_recipes.append(new_recipe1)
            p_recipes.append(new_recipe2)
            p_recipes.remove(p_recipes[i])
    except Exception as e:
        pass

tagged_recipe_names = retag(p_recipes, "name")

There are still remaining names with bracket, mostly due to the foreign words not being recognized.

In [45]:
bracketed_names = []
for bracket in bracket_tokens:
    names = find_value_with_char(p_recipes, 'name', bracket)
    print(names)
    bracketed_names = bracketed_names + names

bracketed_names = list(set(bracketed_names))

['Tres Leches (Milk Cake)', 'Chiles en Nogada (Mexican Stuffed Poblano Peppers in Walnut Sauce)', 'Spicy Thai Basil Chicken (Pad Krapow Gai)', 'Labneh (Lebanese Yogurt)', 'Indian Chicken Curry (Murgh Kari)', 'Keema Aloo (Ground Beef and Potatoes)', 'Turkish Eggs (Cilbir)', 'South African Melktert (Milk Tart)', 'Ukrainian Apple Cake (Yabluchnyk)', 'Spanish Garlic Shrimp (Gambas al Ajillo)', 'Polish Noodles (Cottage Cheese and Noodles)', 'German Potato Dumplings (Kartoffelkloesse)', 'Apfelkuchen (Apple Cake)', 'Oyakodon (Japanese Chicken and Egg Rice Bowl)', 'Eggplant Caponata (Sicilian Version)', 'Chana Masala (Savory Indian Chick Peas)', 'Ricotta Pie (Old Italian Recipe)', 'Easy Blini (Russian Pancake)', 'Easy Bulgogi (Korean BBQ Beef)', 'Carne en su Jugo (Meat in its Juices)', 'Ghormeh Sabzi (Persian Herb Stew)', 'Puerto Rican Tostones (Fried Plantains)', 'Kalbi (Korean BBQ Short Ribs)', 'Macaron (French Macaroon)', 'Atsara (Papaya Relish)', 'Authentic Chinese Egg Rolls ()', 'Greek Le

In [46]:
bracketed_names

['Jeera (Cumin) Rice',
 'Kagianas (Greek Eggs and Tomato)',
 'Papa a la Huancaina (Huancayo-Style Potatoes)',
 'Lentils and Rice with Fried Onions (Mujadarrah)',
 'Vampiros Mexicanos (Mexican Vampires)',
 'Marranitos (Mexican Pig-Shaped Cookies)',
 'Steamed Egg (Chawan Mushi)',
 'Real German Potato Salad (No Mayo)',
 'Authentic Chinese Egg Rolls ()',
 'Gulab Jamun or Kala Jam (Waffle Balls)',
 'Exotic Brinjal (Spicy Eggplant)',
 'Dal Makhani (Indian Lentils)',
 'Ginataang Manok (Chicken Cooked in Coconut Milk)',
 'Hawaiian Bruddah Potato Mac (Macaroni) Salad',
 'Ash-e-jow (Iranian or Persian Barley Soup)',
 'Frijoles Refritos (Refried Beans)',
 'Soy Eggs (Shoyu Tamago)',
 'Chapati (East African Bread)',
 'Kransekake (Norwegian Almond Ring Cake)',
 'Ginataang Alimasag (Crabs in Coconut Milk)',
 'Dansk Aebleskiver (Danish Doughnuts)',
 'Appeltaart (Dutch Apple Tart)',
 'Mongo Guisado (Mung Bean Soup)',
 'Lahmahjoon (Armenian Pizza)',
 'Quick Nariyal Burfi (Indian Coconut Fudge)',
 'Puert

Most of the brackets are at the end of each name. For those that are in the middle, they are translations of one of the words in the name.

In [47]:
b_name_end = []
b_name_mid = []
for b_name in bracketed_names:
    if b_name.endswith(')'):
        b_name_end.append(b_name)
    else:
        b_name_mid.append(b_name)
        
b_name_end

['Kagianas (Greek Eggs and Tomato)',
 'Papa a la Huancaina (Huancayo-Style Potatoes)',
 'Lentils and Rice with Fried Onions (Mujadarrah)',
 'Vampiros Mexicanos (Mexican Vampires)',
 'Marranitos (Mexican Pig-Shaped Cookies)',
 'Steamed Egg (Chawan Mushi)',
 'Real German Potato Salad (No Mayo)',
 'Authentic Chinese Egg Rolls ()',
 'Gulab Jamun or Kala Jam (Waffle Balls)',
 'Exotic Brinjal (Spicy Eggplant)',
 'Dal Makhani (Indian Lentils)',
 'Ginataang Manok (Chicken Cooked in Coconut Milk)',
 'Ash-e-jow (Iranian or Persian Barley Soup)',
 'Frijoles Refritos (Refried Beans)',
 'Soy Eggs (Shoyu Tamago)',
 'Chapati (East African Bread)',
 'Kransekake (Norwegian Almond Ring Cake)',
 'Ginataang Alimasag (Crabs in Coconut Milk)',
 'Dansk Aebleskiver (Danish Doughnuts)',
 'Appeltaart (Dutch Apple Tart)',
 'Mongo Guisado (Mung Bean Soup)',
 'Lahmahjoon (Armenian Pizza)',
 'Quick Nariyal Burfi (Indian Coconut Fudge)',
 'Puerto Rican Rice and Beans (Arroz con Gandules)',
 'Boterkoek (Dutch Butter 

In [48]:
b_name_mid

['Jeera (Cumin) Rice',
 'Hawaiian Bruddah Potato Mac (Macaroni) Salad',
 'Lamb (Gosht) Biryani',
 'Vareniki (Russian Pierogi) with Potatoes and Mushrooms',
 'Kimchi Jun (Kimchi Pancake) and Dipping Sauce',
 'Coconut (Haupia) and Chocolate Pie',
 'Albondigas (Meatballs) en Chipotle',
 'Ulu (Breadfruit) Pancakes',
 'Karaage (Japanese Fried Chicken) with Honey Mayoster Sauce',
 'Bee Sting Cake (Bienenstich) II',
 'Korean Bean Curd (Miso) Soup',
 'Fusilli with Rapini (Broccoli Rabe), Garlic, and Tomato Wine Sauce',
 'Spicy Indian (Gujarati) Green Beans',
 'Pollo (Chicken) Fricassee from Puerto Rico',
 "World's Best () Lasagna",
 'Lengua (Beef Tongue) Stew',
 'Seaweed (Nori) Soup',
 'Classic Cuban Midnight (Medianoche) Sandwich',
 'Fish Sinigang (Tilapia) - Filipino Sour Broth Dish',
 'Fried Chicken Chunks (Chicharrones De Pollo) Dominican',
 'Besan (Gram Flour) Halwa',
 'Zito (Zhito or Koljivo) - Serbian Wheat Pudding',
 'Lazy Golumpki (Stuffed Cabbage) Soup']

On the other hand, without parenthesis anymore, names with foregin words tagged are now clean

In [49]:
fw_names = []
for fw in fw_tokens:
    names = find_value_with_char(p_recipes, 'name', fw)
    print(names)
    fw_names = fw_names + names
fw_names = list(set(fw_names))

['Tender Italian Baked Chicken', 'Herbs de Provence', "Chef John's Beef Rouladen", 'Fideo', 'Tomatillo Salsa Verde', 'Ground Beef with Homemade Taco Seasoning Mix', 'German Beef Rouladen', 'Buche de Noel', 'Tuscan Pork Tenderloin', 'Homemade Mozzarella Cheese', 'Kalamata Olive Tapenade', 'Barbacoa-Style Shredded Beef', "Ingrid's Rouladen", 'Original Homemade Italian Beef', 'Slow Cooker Chile Verde', 'Chicken and Sliders', 'Slow Cooker Mexican Recipes Under Calories', 'Asian Pork Tenderloin', 'Harissa Powder', 'Italian Chicken Marinade', 'Cinder Toffee', 'Enchiladas Verdes', 'Authentic Enchiladas Verdes', 'Korean BBQ Chicken Marinade', 'Homemade Lasagna Sheets', 'Elk Steak Marinade', 'Modenese Pork Chops', 'Italian Pork Tenderloin', 'German Rouladen', 'Brazilian Lemonade', 'Shredded Beef Enchiladas', 'Brigadeiro', 'Homemade Hoisin Sauce', 'Caneles de Bordeaux', 'Homemade Portuguese Chicken', 'Homemade Spaghetti Sauce', 'Pasta de Sardine', 'Sweet and Sour Pork Tenderloin', 'Instant Pot C

In [50]:
fw_names

['Beef Enchiladas with Homemade Sauce',
 'Turmeric Golden Milk with Turmeric Paste',
 'Sweet Corn Cake',
 'Scottish Butter Tablet',
 "Chef John's Baby Porchetta",
 'French Tartiflette',
 'Easy and Delicious Slow Cooker Cassoulet',
 'Delicious Spaghetti Bread',
 'Homemade Spaghetti Sauce',
 'Caldereta',
 'Vietnamese Rice Noodle Salad',
 'Easy Italian Sausage Spaghetti',
 'Rigatoni al Segreto',
 'Stir Fried Sesame Vegetables with Rice',
 "Sarah's Feta Rice Pilaf",
 'Mushroom Stuffed Beef Rouladen',
 'Roasted Spaghetti Squash Lasagna Boats',
 'No-Cook Chicken Lettuce Wraps',
 'Sopa de Tortilla',
 'Sweet and Crunchy Salad',
 'Iskender Kebab',
 'Galette des Rois',
 'Vietnamese Noodle Soup',
 'Stir Fried Wok Vegetables',
 'Easy Vegetarian Kofta Curry',
 'Fettuccini Carbonara',
 'Homemade Lasagna Sheets',
 'Favorite Apple Galette',
 'Zucchini Taco Skillet',
 'Sweet Chicken Marsala',
 'Japanese Steakhouse Golden Shrimp',
 'Sweet Cornmeal Cake Brazilian-Style',
 'Vegetarian Borscht',
 'Sweet Co

For the remaining names with bracket at the end, split into two new recipe names

In [51]:
for i, recipe in enumerate(p_recipes):
    try:
        if p_recipes[i]["name"] in b_name_end:
            newname1, newname2 = break_fw_bracket(p_recipes[i]["name"])
            print(p_recipes[i]["name"])
            new_recipe1 = {'name': newname1, 'ingredients': p_recipes[i]["ingredients"]}
            new_recipe2 = {'name': newname2, 'ingredients': p_recipes[i]["ingredients"]}
            p_recipes.append(new_recipe1)
            p_recipes.append(new_recipe2)
            p_recipes.remove(p_recipes[i])
    except Exception as e:
        pass

tagged_recipe_names = retag(p_recipes, "name")

Tres Leches (Milk Cake)
Chiles en Nogada (Mexican Stuffed Poblano Peppers in Walnut Sauce)
Spicy Thai Basil Chicken (Pad Krapow Gai)
Labneh (Lebanese Yogurt)
Indian Chicken Curry (Murgh Kari)
Keema Aloo (Ground Beef and Potatoes)
Turkish Eggs (Cilbir)
South African Melktert (Milk Tart)
Ukrainian Apple Cake (Yabluchnyk)
Spanish Garlic Shrimp (Gambas al Ajillo)
German Potato Dumplings (Kartoffelkloesse)
Apfelkuchen (Apple Cake)
Eggplant Caponata (Sicilian Version)
Chana Masala (Savory Indian Chick Peas)
Ricotta Pie (Old Italian Recipe)
Easy Blini (Russian Pancake)
Easy Bulgogi (Korean BBQ Beef)
Carne en su Jugo (Meat in its Juices)
Ghormeh Sabzi (Persian Herb Stew)
Puerto Rican Tostones (Fried Plantains)
Kalbi (Korean BBQ Short Ribs)
Macaron (French Macaroon)
Atsara (Papaya Relish)
Authentic Chinese Egg Rolls ()
Greek Lentil Soup (Fakes)
Lumpia (Shanghai version)
Northern Ontario Partridge (Ruffed Grouse)
Vampiros Mexicanos (Mexican Vampires)
Jamaican Saltfish Fritters (Stamp and Go)
Slo

For some reasons, need to run the cell twice

In [52]:
for i, recipe in enumerate(p_recipes):
    try:
        if p_recipes[i]["name"] in b_name_end:
            newname1, newname2 = break_fw_bracket(p_recipes[i]["name"])
            print(p_recipes[i]["name"])
            new_recipe1 = {'name': newname1, 'ingredients': p_recipes[i]["ingredients"]}
            new_recipe2 = {'name': newname2, 'ingredients': p_recipes[i]["ingredients"]}
            p_recipes.append(new_recipe1)
            p_recipes.append(new_recipe2)
            p_recipes.remove(p_recipes[i])
    except Exception as e:
        pass

tagged_recipe_names = retag(p_recipes, "name")

Polish Noodles (Cottage Cheese and Noodles)
Oyakodon (Japanese Chicken and Egg Rice Bowl)
Papas Rellenas (Fried Stuffed Potatoes)
Blaukraut (German Red Cabbage)
Irish Boiled Dinner (Corned Beef)
True Dominican Sancocho (Latin 7-Meat Stew)
Blini (Russian Pancakes)
Oeufs Cocotte (Baked Eggs)
Ropa Vieja (Cuban Beef)
Lace Cookies (Florentine Cookies)
Sinigang na Bangus (Filipino Milkfish in Tamarind Broth)
Schwabischer Kartoffelsalat (German Potato Salad - Schwabisch Style)
Roti Canai or Paratha (Indian Pancake)
Melanzana alla Parmigiana (Perfect Eggplant Parmigiana)
Pierogi (Traditional Polish Dumplings)
Nipples of Venus (Capezzoli di Venere)
Samosadilla (Samosa Quesadilla)
Bulgogi (Korean Barbecued Beef)
Sabaayad (Somali Flatbread)
Filipino Baked Milkfish (Baked Bangus)
Ash-e Reshteh (Persian Legume Soup)
Lentil and Cactus Soup (Mom's Recipe)
Ethiopian Cabbage and Potato Dish (Atkilt)
Finnish Kropser (Baked Pancakes)
Oma's Griessnockerlsuppe (Beef and Semolina Dumpling Soup)
Kewa Datshi 

Only the names with bracket in the middle of their names remain

In [53]:
bracketed_names = []
for bracket in bracket_tokens:
    names = find_value_with_char(p_recipes, 'name', bracket)
    print(names)
    bracketed_names= bracketed_names + names

bracketed_names = list(set(bracketed_names))

['Classic Cuban Midnight (Medianoche) Sandwich', 'Spicy Indian (Gujarati) Green Beans', "World's Best () Lasagna", 'Karaage (Japanese Fried Chicken) with Honey Mayoster Sauce', 'Kimchi Jun (Kimchi Pancake) and Dipping Sauce', 'Bee Sting Cake (Bienenstich) II', 'Coconut (Haupia) and Chocolate Pie', 'Lamb (Gosht) Biryani', 'Jeera (Cumin) Rice', 'Pollo (Chicken) Fricassee from Puerto Rico', 'Fish Sinigang (Tilapia) - Filipino Sour Broth Dish', 'Lazy Golumpki (Stuffed Cabbage) Soup', 'Ulu (Breadfruit) Pancakes', 'Fried Chicken Chunks (Chicharrones De Pollo) Dominican', 'Fusilli with Rapini (Broccoli Rabe), Garlic, and Tomato Wine Sauce', 'Seaweed (Nori) Soup', 'Vareniki (Russian Pierogi) with Potatoes and Mushrooms', 'Hawaiian Bruddah Potato Mac (Macaroni) Salad', 'Korean Bean Curd (Miso) Soup', 'Lengua (Beef Tongue) Stew', 'Albondigas (Meatballs) en Chipotle', 'Zito (Zhito or Koljivo) - Serbian Wheat Pudding', 'Besan (Gram Flour) Halwa']


Mac and rapini is only synonymous the the one word before them. Otherwise, the bracketed words are synonymous to all the words before them combined.

In [54]:
bracketed_names

['Coconut (Haupia) and Chocolate Pie',
 "World's Best () Lasagna",
 'Jeera (Cumin) Rice',
 'Seaweed (Nori) Soup',
 'Classic Cuban Midnight (Medianoche) Sandwich',
 'Korean Bean Curd (Miso) Soup',
 'Lengua (Beef Tongue) Stew',
 'Zito (Zhito or Koljivo) - Serbian Wheat Pudding',
 'Albondigas (Meatballs) en Chipotle',
 'Besan (Gram Flour) Halwa',
 'Fusilli with Rapini (Broccoli Rabe), Garlic, and Tomato Wine Sauce',
 'Ulu (Breadfruit) Pancakes',
 'Karaage (Japanese Fried Chicken) with Honey Mayoster Sauce',
 'Kimchi Jun (Kimchi Pancake) and Dipping Sauce',
 'Lazy Golumpki (Stuffed Cabbage) Soup',
 'Spicy Indian (Gujarati) Green Beans',
 'Pollo (Chicken) Fricassee from Puerto Rico',
 'Fried Chicken Chunks (Chicharrones De Pollo) Dominican',
 'Lamb (Gosht) Biryani',
 'Bee Sting Cake (Bienenstich) II',
 'Fish Sinigang (Tilapia) - Filipino Sour Broth Dish',
 'Hawaiian Bruddah Potato Mac (Macaroni) Salad',
 'Vareniki (Russian Pierogi) with Potatoes and Mushrooms']

The names can still be duplicated into 2, except that the bracketed word replaces the words before in the second new name, treating them as synonyms.

In [55]:
def convert_bracket_synonym(name, num=0):
    name1 = re.findall(BRACKET_REGEX, name)[0]
    name1 = name1[name1.find("(")+1:name1.find(")")]
    name1_suffix = name.split(')')[1]
    if num==0:
        name1 = name1 + name1_suffix
        name2 = re.sub(BRACKET_REGEX, "", name)
    else:
        name1_prefix = name.split('(')[0]
        name1_prefix = name1_prefix[:-num]
        name1 = name1_prefix + name1 + name1_suffix
        name2 = re.sub(BRACKET_REGEX, " ", name)
    return name1, name2

print(convert_bracket_synonym("Lamb (Gosht) Biryani"))
print(convert_bracket_synonym("Fusilli with Rapini (Broccoli Rabe), Garlic, and Tomato Wine Sauce", 1))
print(convert_bracket_synonym("Hawaiian Bruddah Potato Mac (Macaroni) Salad", 1))

('Gosht Biryani', 'Lamb Biryani')
('Fusilli with RapiniBroccoli Rabe, Garlic, and Tomato Wine Sauce', 'Fusilli with Rapini , Garlic, and Tomato Wine Sauce')
('Hawaiian Bruddah Potato MacMacaroni Salad', 'Hawaiian Bruddah Potato Mac  Salad')


### *Just a small bug, space between words after extracting the bracketted word*

In [56]:
for i, recipe in enumerate(p_recipes):
    try:
        if p_recipes[i]["name"] in b_name_mid:
            newname1, newname2 = convert_bracket_synonym(p_recipes[i]["name"])
            print(p_recipes[i]["name"])
            new_recipe1 = {'name': newname1, 'ingredients': p_recipes[i]["ingredients"]}
            new_recipe2 = {'name': newname2, 'ingredients': p_recipes[i]["ingredients"]}
            p_recipes.append(new_recipe1)
            p_recipes.append(new_recipe2)
            p_recipes.remove(p_recipes[i])
    except Exception as e:
        pass

tagged_recipe_names = retag(p_recipes, "name")

Classic Cuban Midnight (Medianoche) Sandwich
Spicy Indian (Gujarati) Green Beans
World's Best () Lasagna
Karaage (Japanese Fried Chicken) with Honey Mayoster Sauce
Kimchi Jun (Kimchi Pancake) and Dipping Sauce
Bee Sting Cake (Bienenstich) II
Coconut (Haupia) and Chocolate Pie
Lamb (Gosht) Biryani
Jeera (Cumin) Rice
Pollo (Chicken) Fricassee from Puerto Rico
Fish Sinigang (Tilapia) - Filipino Sour Broth Dish
Lazy Golumpki (Stuffed Cabbage) Soup
Ulu (Breadfruit) Pancakes
Fried Chicken Chunks (Chicharrones De Pollo) Dominican
Fusilli with Rapini (Broccoli Rabe), Garlic, and Tomato Wine Sauce
Seaweed (Nori) Soup
Vareniki (Russian Pierogi) with Potatoes and Mushrooms
Hawaiian Bruddah Potato Mac (Macaroni) Salad
Korean Bean Curd (Miso) Soup
Lengua (Beef Tongue) Stew
Albondigas (Meatballs) en Chipotle
Zito (Zhito or Koljivo) - Serbian Wheat Pudding
Besan (Gram Flour) Halwa


Successfully removed all brackets from recipe names

In [57]:
bracketed_names = []
for bracket in bracket_tokens:
    names = find_value_with_char(p_recipes, 'name', bracket)
    print(names)
    bracketed_names= bracketed_names + names

bracketed_names = list(set(bracketed_names))
bracketed_names

[]


[]

Dashes are mostly adjectives, but things like semi colon need to be removed. As for colons, its mostly translation. Semicolons are caused by K&auml;, which are dishes with special characters or German words.

In [58]:
colon_tokens = list(set(list_words_with_tag(tagged_recipe_names, ":")))
colon_tokens

[';', ':', '-']

In [59]:
for colon in colon_tokens:
  print(find_value_with_char(p_recipes, 'name', colon))

['Quorn&trade; and Chickpea Curry', 'Empanadas Salte&ntilde;as', 'Sp&auml;tzle', 'Tamales Oaxaque&ntilde;os', 'K&auml;sesahnetorte', 'K&auml;sesahnetorte']
['Spaghetti alla Carbonara: the Traditional Italian Recipe', 'Doro Wat: Ethiopian Chicken Dish', "Grandma's Focaccia: Baraise Style"]
['Pan-Fried Asparagus', 'Super-Delicious Zuppa Toscana', 'Indian-Style Chicken and Onions', 'Haluski - Cabbage and Noodles', 'Chicken Stir-Fry', 'Quick Beef Stir-Fry', 'How to Make Coquilles Saint-Jacques', 'Mexican-Style Chicken Taco Casserole', 'Make-Ahead Vegetarian Moroccan Stew', 'Japanese-Style Deep-Fried Shrimp', 'Carnitas - Pressure Cooker', 'Chicken and Broccoli Stir-Fry', 'Broccoli and Chicken Stir-Fry', 'Ginger Veggie Stir-Fry', 'White Chicken Enchilada Slow-Cooker Casserole', 'Old-Fashioned Swedish Glogg', 'Stir-Fry Chicken and Vegetables', 'Barbacoa-Style Shredded Beef', 'Simple Slow-Cooked Korean Beef Soft Tacos', 'Air-Fried Korean Chicken Wings', 'Kouign-Amann', 'Gnocchi with Sage-Butte

In [60]:
def remove_entry_with(dict_list, target, key="name"):
    for i, recipe in enumerate(dict_list):
        try:
            if target in dict_list[i]["name"]:
                dict_list.remove(dict_list[i])
        except Exception as e:
            pass

In [61]:
for semicolon in ["Quorn&trade;", "Sp&auml;tzle", "Tamales Oaxaque&ntilde;os", "K&auml;sesahnetorte", "Salte&ntilde;as"]:
    remove_entry_with(p_recipes, semicolon)
tagged_recipe_names = retag(p_recipes, "name")

Semi colons cleaned

In [62]:
colon_tokens = list(set(list_words_with_tag(tagged_recipe_names, ":")))
colon_tokens

[':', '-']

In [63]:
for colon in colon_tokens:
  print(find_value_with_char(p_recipes, 'name', colon))

['Spaghetti alla Carbonara: the Traditional Italian Recipe', 'Doro Wat: Ethiopian Chicken Dish', "Grandma's Focaccia: Baraise Style"]
['Pan-Fried Asparagus', 'Super-Delicious Zuppa Toscana', 'Indian-Style Chicken and Onions', 'Haluski - Cabbage and Noodles', 'Chicken Stir-Fry', 'Quick Beef Stir-Fry', 'How to Make Coquilles Saint-Jacques', 'Mexican-Style Chicken Taco Casserole', 'Make-Ahead Vegetarian Moroccan Stew', 'Japanese-Style Deep-Fried Shrimp', 'Carnitas - Pressure Cooker', 'Chicken and Broccoli Stir-Fry', 'Broccoli and Chicken Stir-Fry', 'Ginger Veggie Stir-Fry', 'White Chicken Enchilada Slow-Cooker Casserole', 'Old-Fashioned Swedish Glogg', 'Stir-Fry Chicken and Vegetables', 'Barbacoa-Style Shredded Beef', 'Simple Slow-Cooked Korean Beef Soft Tacos', 'Air-Fried Korean Chicken Wings', 'Kouign-Amann', 'Gnocchi with Sage-Butter Sauce', 'Giant Bacon-Wrapped Meatballs', 'Low-Carb Cauliflower Rice Sushi Rolls', 'Onigiri - Japanese Rice Balls', "Frank's Favorite Slow-Cooker Thai Chic

For these 2 names, colons are used for describing

In [64]:
# Spaghetti alla Carbonara: the Traditional Italian Recipe => traditional Italian Spaghetti alla Carbonara
searchReplaceAddPattList(p_recipes, r": the Traditional Italian Recipe", "", "traditional Italian ")
# Grandma's Focaccia: Baraise Style => Grandma's Baraise Style Focaccia
searchReplaceAddPattList(p_recipes, r": Baraise Style", "", "Baraise Style ", index=10)
tagged_recipe_names = retag(p_recipes, "name")

Cleaned 2 names with colon. If the dashes are between a word, they are either part of a word's spelling or joining two words together, typically as an adjective. However, if it is between spaces, they are translations.

In [65]:
colon_tokens = list(set(list_words_with_tag(tagged_recipe_names, ":")))
colon_tokens

[':', '-']

In [66]:
new_colon_names = []
for colon in colon_tokens:
    print(find_value_with_char(p_recipes, 'name', colon))
    new_colon_names=new_colon_names+find_value_with_char(p_recipes, 'name', colon)
new_colon_names

['Doro Wat: Ethiopian Chicken Dish']
['Pan-Fried Asparagus', 'Super-Delicious Zuppa Toscana', 'Indian-Style Chicken and Onions', 'Haluski - Cabbage and Noodles', 'Chicken Stir-Fry', 'Quick Beef Stir-Fry', 'How to Make Coquilles Saint-Jacques', 'Mexican-Style Chicken Taco Casserole', 'Make-Ahead Vegetarian Moroccan Stew', 'Japanese-Style Deep-Fried Shrimp', 'Carnitas - Pressure Cooker', 'Chicken and Broccoli Stir-Fry', 'Broccoli and Chicken Stir-Fry', 'Ginger Veggie Stir-Fry', 'White Chicken Enchilada Slow-Cooker Casserole', 'Old-Fashioned Swedish Glogg', 'Stir-Fry Chicken and Vegetables', 'Barbacoa-Style Shredded Beef', 'Simple Slow-Cooked Korean Beef Soft Tacos', 'Air-Fried Korean Chicken Wings', 'Kouign-Amann', 'Gnocchi with Sage-Butter Sauce', 'Giant Bacon-Wrapped Meatballs', 'Low-Carb Cauliflower Rice Sushi Rolls', 'Onigiri - Japanese Rice Balls', "Frank's Favorite Slow-Cooker Thai Chicken", 'Two-Ingredient Naan', 'Chicken French - Rochester, NY Style', 'Velveting Chicken Breast, C

['Doro Wat: Ethiopian Chicken Dish',
 'Pan-Fried Asparagus',
 'Super-Delicious Zuppa Toscana',
 'Indian-Style Chicken and Onions',
 'Haluski - Cabbage and Noodles',
 'Chicken Stir-Fry',
 'Quick Beef Stir-Fry',
 'How to Make Coquilles Saint-Jacques',
 'Mexican-Style Chicken Taco Casserole',
 'Make-Ahead Vegetarian Moroccan Stew',
 'Japanese-Style Deep-Fried Shrimp',
 'Carnitas - Pressure Cooker',
 'Chicken and Broccoli Stir-Fry',
 'Broccoli and Chicken Stir-Fry',
 'Ginger Veggie Stir-Fry',
 'White Chicken Enchilada Slow-Cooker Casserole',
 'Old-Fashioned Swedish Glogg',
 'Stir-Fry Chicken and Vegetables',
 'Barbacoa-Style Shredded Beef',
 'Simple Slow-Cooked Korean Beef Soft Tacos',
 'Air-Fried Korean Chicken Wings',
 'Kouign-Amann',
 'Gnocchi with Sage-Butter Sauce',
 'Giant Bacon-Wrapped Meatballs',
 'Low-Carb Cauliflower Rice Sushi Rolls',
 'Onigiri - Japanese Rice Balls',
 "Frank's Favorite Slow-Cooker Thai Chicken",
 'Two-Ingredient Naan',
 'Chicken French - Rochester, NY Style',
 

But in some cases, they are words after the dashes describe the dish, such as Rochester, NY Style and Restaurant Style

In [67]:
for colname in new_colon_names:
    if len(re.findall("( - )|(: )", colname)) > 0:
        print(colname)

Doro Wat: Ethiopian Chicken Dish
Haluski - Cabbage and Noodles
Carnitas - Pressure Cooker
Onigiri - Japanese Rice Balls
Chicken French - Rochester, NY Style
Taqueria Style Tacos - Carne Asada
Al Kabsa - Traditional Saudi Rice and Chicken
Italian Subs - Restaurant Style
Bazlama - Turkish Flat Bread
Norwegian Pancakes - Pannekaken
Pain de Campagne - Country French Bread
Flemish Frites - Belgian Fries with Andalouse Sauce
Portuguese Custard Tarts - Pasteis de Nata
Eggplant Parmesan - Gluten-Free
Tonkatsu - Asian-Style Pork Chop
Indian Eggplant - Bhurtha
Hot Pepper Sauce - A Trinidadian Staple
The Sarge's Goetta - German Breakfast Treat
Italian Sausage - Tuscan Style
Honey Milk Tea - Hong Kong Style
Mexican Lasagna - No Lasagna Noodles!
Lumpia - Filipino Shrimp and Pork Egg Rolls
Portuguese Muffins - Bolo Levedo
Curry Pasta - Pakistani Style
Cauliflower and Potato Stir-Fry - East Indian Recipe
Keftedes - Greek Meatballs
Brasato al Barolo - Braised Chuck Roast in Red Wine
Potato Salad - Ger

Replace or remove the remaining dashes that are surrounded by spaces

In [68]:
# Chicken French - Rochester, NY Style => Rochester, NY Style Chicken French
searchReplaceAddPattList(p_recipes, r" - Rochester, NY Style", "", "Rochester, NY Style ")
# Carnitas - Pressure Cooker => pressure cooker carnitas
searchReplaceAddPattList(p_recipes, r" - Rochester, NY Style", "", "Rochester, NY Style ")
# Italian Subs - Restaurant Style => restaurant style Italian subs
searchReplaceAddPattList(p_recipes, r" - Restaurant Style", "", "restaurant style ")
# Eggplant Parmesan - Gluten-Free => glutten-free Eggplant Parmesan
searchReplaceAddPattList(p_recipes, r" - Gluten-Free", "", "glutten-free ")
# Italian Sausage - Tuscan Style => Tuscan style Italian Sausage
searchReplaceAddPattList(p_recipes, r" - Tuscan Style", "", "Tuscan style ")
# Honey Milk Tea - Hong Kong Style => Hong Kong style Honey Milk Tea
searchReplaceAddPattList(p_recipes, r" - Hong Kong Style", "", "Hong Kong style ")
# Curry Pasta - Pakistani Style => Pakistani style Curry Pasta
searchReplaceAddPattList(p_recipes, r" - Pakistani Style", "", "Pakistani style ")
# Cauliflower and Potato Stir-Fry - East Indian Recipe => East Indian style Cauliflower and Potato Stir-Fry
searchReplaceAddPattList(p_recipes, r" - East Indian Recipe", "", "East Indian style ")
# German Potato Salad - Schwabisch Style => Schwabisch style German Potato Salad
searchReplaceAddPattList(p_recipes, r" - Schwabisch Style", "", "Schwabisch style ")
# Tilapia - Filipino Sour Broth Dish => Filipino Sour Broth tilapia
searchReplaceAddPattList(p_recipes, r"Tilapia - ", "", "tilapia", index=20)
# Fish Sinigang - Filipino Sour Broth Dish - Schwabisch Style => Filipino Sour Broth Sinigang fish
searchReplaceAddPattList(p_recipes, r"Fish Sinigang - ", "", "Sinigang fish", index=20)

# remove  - A Trinidadian Staple from Hot Pepper Sauce - A Trinidadian Staple
searchReplacePattList(p_recipes, r" - A Trinidadian Staple", "")
# remove  - German Breakfast Treat from The Sarge's Goetta - German Breakfast Treat
searchReplacePattList(p_recipes, r" - German Breakfast Treat", "")
# remove  - No Lasagna Noodles! from Mexican Lasagna - No Lasagna Noodles!
searchReplacePattList(p_recipes, r" - No Lasagna Noodles!", "")
# remove  - Not Just for Chicken from Sweet and Sour Jam - Not Just for Chicken
searchReplacePattList(p_recipes, r" - Not Just for Chicken", "")
                      
tagged_recipe_names = retag(p_recipes, "name")

### *please check the line Carnitas - Pressure Cooker => pressure cooker carnitas*

In [69]:
new_colon_names = []
for colon in colon_tokens:
    print(find_value_with_char(p_recipes, 'name', colon))
    new_colon_names=new_colon_names+find_value_with_char(p_recipes, 'name', colon)
new_colon_names

['Doro Wat: Ethiopian Chicken Dish']
['Pan-Fried Asparagus', 'Super-Delicious Zuppa Toscana', 'Indian-Style Chicken and Onions', 'Haluski - Cabbage and Noodles', 'Chicken Stir-Fry', 'Quick Beef Stir-Fry', 'How to Make Coquilles Saint-Jacques', 'Mexican-Style Chicken Taco Casserole', 'Make-Ahead Vegetarian Moroccan Stew', 'Japanese-Style Deep-Fried Shrimp', 'Carnitas - Pressure Cooker', 'Chicken and Broccoli Stir-Fry', 'Broccoli and Chicken Stir-Fry', 'Ginger Veggie Stir-Fry', 'White Chicken Enchilada Slow-Cooker Casserole', 'Old-Fashioned Swedish Glogg', 'Stir-Fry Chicken and Vegetables', 'Barbacoa-Style Shredded Beef', 'Simple Slow-Cooked Korean Beef Soft Tacos', 'Air-Fried Korean Chicken Wings', 'Kouign-Amann', 'Gnocchi with Sage-Butter Sauce', 'Giant Bacon-Wrapped Meatballs', 'Low-Carb Cauliflower Rice Sushi Rolls', 'Onigiri - Japanese Rice Balls', "Frank's Favorite Slow-Cooker Thai Chicken", 'Two-Ingredient Naan', 'Velveting Chicken Breast, Chinese Restaurant-Style', 'Garlic-Herb L

['Doro Wat: Ethiopian Chicken Dish',
 'Pan-Fried Asparagus',
 'Super-Delicious Zuppa Toscana',
 'Indian-Style Chicken and Onions',
 'Haluski - Cabbage and Noodles',
 'Chicken Stir-Fry',
 'Quick Beef Stir-Fry',
 'How to Make Coquilles Saint-Jacques',
 'Mexican-Style Chicken Taco Casserole',
 'Make-Ahead Vegetarian Moroccan Stew',
 'Japanese-Style Deep-Fried Shrimp',
 'Carnitas - Pressure Cooker',
 'Chicken and Broccoli Stir-Fry',
 'Broccoli and Chicken Stir-Fry',
 'Ginger Veggie Stir-Fry',
 'White Chicken Enchilada Slow-Cooker Casserole',
 'Old-Fashioned Swedish Glogg',
 'Stir-Fry Chicken and Vegetables',
 'Barbacoa-Style Shredded Beef',
 'Simple Slow-Cooked Korean Beef Soft Tacos',
 'Air-Fried Korean Chicken Wings',
 'Kouign-Amann',
 'Gnocchi with Sage-Butter Sauce',
 'Giant Bacon-Wrapped Meatballs',
 'Low-Carb Cauliflower Rice Sushi Rolls',
 'Onigiri - Japanese Rice Balls',
 "Frank's Favorite Slow-Cooker Thai Chicken",
 'Two-Ingredient Naan',
 'Velveting Chicken Breast, Chinese Restau

The remaining names with dashes surrounded by dashes are translations, which can be split into two names

In [70]:
colnames_to_split = []
for colname in new_colon_names:
    if len(re.findall("( - )|(: )", colname)) > 0:
        print(colname)
        colnames_to_split.append(colname)

Doro Wat: Ethiopian Chicken Dish
Haluski - Cabbage and Noodles
Carnitas - Pressure Cooker
Onigiri - Japanese Rice Balls
Taqueria Style Tacos - Carne Asada
Al Kabsa - Traditional Saudi Rice and Chicken
Bazlama - Turkish Flat Bread
Norwegian Pancakes - Pannekaken
Pain de Campagne - Country French Bread
Flemish Frites - Belgian Fries with Andalouse Sauce
Portuguese Custard Tarts - Pasteis de Nata
Tonkatsu - Asian-Style Pork Chop
Indian Eggplant - Bhurtha
Lumpia - Filipino Shrimp and Pork Egg Rolls
Portuguese Muffins - Bolo Levedo
Keftedes - Greek Meatballs
Brasato al Barolo - Braised Chuck Roast in Red Wine
Potato Salad - German Kartoffel
Tembleque de Coco - Coconut Tembleque
Kroppkakor - Swedish Potato Dumplings
Ladolemono - Lemon Oil Sauce for Fish or Chicken
Mie Goreng - Indonesian Fried Noodles
Vaselopita - Greek New Years Cake
Knedliky - Czech Dumpling with Sauerkraut
Zhito or Koljivo - Serbian Wheat Pudding
Zito - Serbian Wheat Pudding


In [71]:
for i, recipe in enumerate(p_recipes):
    try:
        if p_recipes[i]["name"] in colnames_to_split:
            splits = re.split("( - )|(: )", p_recipes[i]["name"])
            newname1 = splits[0]
            newname2 = splits[len(splits)-1]
            new_recipe1 = {'name': newname1, 'ingredients': p_recipes[i]["ingredients"]}
            new_recipe2 = {'name': newname2, 'ingredients': p_recipes[i]["ingredients"]}
            p_recipes.append(new_recipe1)
            p_recipes.append(new_recipe2)
            p_recipes.remove(p_recipes[i])
    except Exception as e:
        pass

tagged_recipe_names = retag(p_recipes, "name")

The remaining names with dash are those in words

In [72]:
colon_tokens = list(set(list_words_with_tag(tagged_recipe_names, ":")))
colon_tokens

['-']

In [73]:
new_colon_names = []
for colon in colon_tokens:
    print(find_value_with_char(p_recipes, 'name', colon))
    new_colon_names=new_colon_names+find_value_with_char(p_recipes, 'name', colon)
new_colon_names

['Pan-Fried Asparagus', 'Super-Delicious Zuppa Toscana', 'Indian-Style Chicken and Onions', 'Chicken Stir-Fry', 'Quick Beef Stir-Fry', 'How to Make Coquilles Saint-Jacques', 'Mexican-Style Chicken Taco Casserole', 'Make-Ahead Vegetarian Moroccan Stew', 'Japanese-Style Deep-Fried Shrimp', 'Chicken and Broccoli Stir-Fry', 'Broccoli and Chicken Stir-Fry', 'Ginger Veggie Stir-Fry', 'White Chicken Enchilada Slow-Cooker Casserole', 'Old-Fashioned Swedish Glogg', 'Stir-Fry Chicken and Vegetables', 'Barbacoa-Style Shredded Beef', 'Simple Slow-Cooked Korean Beef Soft Tacos', 'Air-Fried Korean Chicken Wings', 'Kouign-Amann', 'Gnocchi with Sage-Butter Sauce', 'Giant Bacon-Wrapped Meatballs', 'Low-Carb Cauliflower Rice Sushi Rolls', "Frank's Favorite Slow-Cooker Thai Chicken", 'Two-Ingredient Naan', 'Velveting Chicken Breast, Chinese Restaurant-Style', 'Garlic-Herb Linguine', 'Korean-style Seaweed Soup', 'Ube-Macapuno Cake', 'Cuban-Style Yuca', 'Japanese-Style Cabbage Salad', "Jorge's Indian-Spice

['Pan-Fried Asparagus',
 'Super-Delicious Zuppa Toscana',
 'Indian-Style Chicken and Onions',
 'Chicken Stir-Fry',
 'Quick Beef Stir-Fry',
 'How to Make Coquilles Saint-Jacques',
 'Mexican-Style Chicken Taco Casserole',
 'Make-Ahead Vegetarian Moroccan Stew',
 'Japanese-Style Deep-Fried Shrimp',
 'Chicken and Broccoli Stir-Fry',
 'Broccoli and Chicken Stir-Fry',
 'Ginger Veggie Stir-Fry',
 'White Chicken Enchilada Slow-Cooker Casserole',
 'Old-Fashioned Swedish Glogg',
 'Stir-Fry Chicken and Vegetables',
 'Barbacoa-Style Shredded Beef',
 'Simple Slow-Cooked Korean Beef Soft Tacos',
 'Air-Fried Korean Chicken Wings',
 'Kouign-Amann',
 'Gnocchi with Sage-Butter Sauce',
 'Giant Bacon-Wrapped Meatballs',
 'Low-Carb Cauliflower Rice Sushi Rolls',
 "Frank's Favorite Slow-Cooker Thai Chicken",
 'Two-Ingredient Naan',
 'Velveting Chicken Breast, Chinese Restaurant-Style',
 'Garlic-Herb Linguine',
 'Korean-style Seaweed Soup',
 'Ube-Macapuno Cake',
 'Cuban-Style Yuca',
 'Japanese-Style Cabbage 

!, ? and . are found, which are odd for recipe names

In [74]:
punc_tokens = list_words_with_tag(tagged_recipe_names, ".")
punc_tokens

['!', '!', '!', '!', '.', '?']

The punctuations are mostly slang abbreviations and exclamations

In [75]:
for punc in list(set(punc_tokens)):
  print(find_value_with_char(p_recipes, 'name', punc))

['Sangria! Sangria!', 'Oatmeal Apple Crisp To Die For!', "Sushi House Salad Dressing, It's ORANGE!"]
["Our Top P.F. Chang's Copycat Recipes", "Perfect St. Patrick's Day Cake"]
['Real Canadian Butter Tarts, eh?']


Remove the exclamations

In [76]:
searchReplacePattList(p_recipes, r"! Sangria!", "")
searchReplacePattList(p_recipes, r" To Die For!", "")
searchReplacePattList(p_recipes, r", It's ORANGE!", "")
searchReplacePattList(p_recipes, r", eh\?", "")
searchReplacePattList(p_recipes, r"Our Top ", "")

tagged_recipe_names = retag(p_recipes, "name")

Fullstops that remain are part of recipe names

In [77]:
punc_tokens = list_words_with_tag(tagged_recipe_names, ".")
punc_tokens

['.']

In [78]:
for punc in list(set(punc_tokens)):
  print(find_value_with_char(p_recipes, 'name', punc))

["P.F. Chang's Copycat Recipes", "Perfect St. Patrick's Day Cake"]


Some 'that' can be found

In [79]:
wdt_tokens = list_words_with_tag(tagged_recipe_names, "WDT")
wdt_tokens

['That', 'That', 'That', 'That']

The 'that's are used to add details, but not actual recipe name

In [80]:
for wdt in list(set(wdt_tokens)):
  print(find_value_with_char(p_recipes, 'name', wdt))

['German Recipes That Are Comfort Food Favorites', 'Mexican-Inspired Ground Beef Casseroles That Deliver Big Flavor With Every Satisfying Bite', 'Tuscan Recipes That Reveal the Best of Italian Cooking', 'Easy Dinners That Start with Packaged Gnocchi', "That's-a Meatloaf", 'Favorite Recipes That Show Off Armenian Cuisine', 'Our Best Stir-Fry Recipes That Are Even Better Than Take-Out', 'Comforting Polish Cabbage Recipes That Are Family Favorites']


Remove

In [81]:
searchReplacePattList(p_recipes, r" That Are Comfort Food Favorites", "")
searchReplacePattList(p_recipes, r" That Deliver Big Flavor With Every Satisfying Bite", "")
searchReplacePattList(p_recipes, r" That Reveal the Best of Italian Cooking", "")
searchReplacePattList(p_recipes, r"That's-a ", "")
searchReplacePattList(p_recipes, r"Favorite Recipes That Show Off ", "")
searchReplacePattList(p_recipes, r" That Are Even Better Than Take-Out", "")
searchReplacePattList(p_recipes, r" That Are Family Favorites", "")

searchReplaceAddPattList(p_recipes, r" That Start with Packaged Gnocchi", "", "packaged gnocchi ", index=5)
tagged_recipe_names = retag(p_recipes, "name")

That removed

In [82]:
wdt_tokens = list_words_with_tag(tagged_recipe_names, "WDT")
wdt_tokens

[]

There's some 'how's

In [83]:
wrb_tokens = list_words_with_tag(tagged_recipe_names, "WRB")
wrb_tokens

['How', 'How', 'How', 'How', 'How', 'How', 'How']

In [84]:
for wrb in list(set(wrb_tokens)):
  print(find_value_with_char(p_recipes, 'name', wrb))

['How to Make Coquilles Saint-Jacques', 'How to Make Bolognese Sauce', 'How to Make Beef Satay', 'How to Make Peanut Dipping Sauce', 'How to Make Tres Leches Cake', 'How to Make Cassoulet', 'How to Make Turkey Manicotti']


Remove the 'how's and keep only the name

In [85]:
searchReplacePattList(p_recipes, r"How to Make ", "")

tagged_recipe_names = retag(p_recipes, "name")

In [86]:
list_words_with_tag(tagged_recipe_names, "WRB")

[]

There's some personal pronouns (possessive)

In [87]:
prp_tokens = list_words_with_tag(tagged_recipe_names, "PRP$")
prp_tokens

['Our',
 'My',
 'My',
 'My',
 'Our',
 'My',
 'Our',
 'My',
 'My',
 'My',
 'Our',
 'My',
 'My',
 'Your',
 'Our',
 'Our',
 'Our',
 'My',
 'its']

In [88]:
for prp in list(set(prp_tokens)):
  print(find_value_with_char(p_recipes, 'name', prp))

['Sweet Recipes to Complete Your Indian Dinner', 'Melt-in-Your-Mouth Beef Cacciatore', 'Polish Recipes to Make Your Grandmother Proud']
['Anzac Biscuits I', "Sadie's Buttermilk Biscuits", 'Canadian Tea Biscuits', 'Empire Biscuits', 'Pastitsio IV', 'Crescent Butter Biscuits', 'Pastitsio', "Nanny's Newfoundland Tea Biscuits", 'Meat in its Juices']
['My Own Famous Stuffed Grape Leaves', 'My Best Chicken Piccata', 'My Favorite Sesame Noodles', 'My Chicken Parmesan', "My Mom's Greek Lemon Rice", 'My Fly Stir-Fry', 'My Chicken Pho Recipe', 'My Tangy German Potato Salad', 'My Big Fat Greek Baked Beans', "My Grandmother's French Dressing"]
['Our Best Avgolemono Soup Recipes', 'Our Best Authentic Mexican Recipes', 'Our Best Empanada Recipes', 'Our Best Indian Recipes for Beginner Cooks', 'Our Best Stir-Fry Recipes', 'Our Favorite German Potato Recipes', 'Say Aloha to Our Best Hawaiian Recipes']


Most can be removed

In [89]:
searchReplacePattList(p_recipes, r"Our ", "")
searchReplacePattList(p_recipes, r"Your ", "")
searchReplacePattList(p_recipes, r"Melt-in-Your-Mouth ", "")
searchReplacePattList(p_recipes, r"My Own ", "")
searchReplacePattList(p_recipes, r"My Best ", "")
searchReplacePattList(p_recipes, r"My Favorite ", "")
searchReplacePattList(p_recipes, r"My Mom's ", "")
searchReplacePattList(p_recipes, r"My Grandmother's ", "")
searchReplacePattList(p_recipes, r"My ", "")

tagged_recipe_names = retag(p_recipes, "name")

The remaining ones are misclassified tags by nltk

In [90]:
prp_tokens = list_words_with_tag(tagged_recipe_names, "PRP$")
prp_tokens

['its']

In [91]:
for prp in list(set(prp_tokens)):
  print(find_value_with_char(p_recipes, 'name', prp))

['Anzac Biscuits I', "Sadie's Buttermilk Biscuits", 'Canadian Tea Biscuits', 'Empire Biscuits', 'Pastitsio IV', 'Crescent Butter Biscuits', 'Pastitsio', "Nanny's Newfoundland Tea Biscuits", 'Meat in its Juices']


There's some personal pronouns

In [92]:
prp_tokens = list_words_with_tag(tagged_recipe_names, "PRP")
prp_tokens

['I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'You',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'You',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'I',
 'We',
 'I',
 'I',
 'I']

In [93]:
for prp in list(set(prp_tokens)):
  print(find_value_with_char(p_recipes, 'name', prp))

['West African Peanut Stew', 'Real Welsh Rarebit', 'Fabulous Wet Burritos', 'Mexican Wedding Cookies', 'Italian Wedding Cookies III', 'Beef Wellington', 'West African-Style Peanut Stew with Chicken', 'Party Italian Wedding Soup', 'West Coast Trail Cookies', 'Italian Wedding Cake', 'Weeknight Mexican Chicken Lasagna', 'Comforting Russian Soups for Fall and Winter Weather', 'Comforting Russian Soups for Fall and Winter Weather', 'West Indian Curried Chicken', 'Welsh Cakes', "Mrs Welch's Butter Tarts", 'Italian Wedding Cake Martini', 'West African Lime Cake', 'Hawaiian Wedding Cake II', 'Weeknight Wonton Soup', 'Traditional Welsh Rarebit', 'West African Peanut Soup', "We Be Jammin' Jamaican Banana Bread", 'Italian Wedding Soup II', 'Chocolate Mexican Wedding Cookies', 'Traditional Welsh Broth']
['German Apple Cake I', 'Indian-Style Chicken and Onions', 'Tender Italian Baked Chicken', 'Mexican Rice II', 'Sweet and Sour Chicken I', 'Chicken Cordon Bleu II', 'Hot German Potato Salad III', 'S

Not much to remove, since most are misclassified POS

In [94]:
searchReplacePattList(p_recipes, r" You Can Make at Home", "")

tagged_recipe_names = retag(p_recipes, "name")

Some base verbs can be removed

In [95]:
vb_tokens = list_words_with_tag(tagged_recipe_names, "VB")
vb_tokens

['Take',
 'Make',
 'Take',
 'Kedgeree',
 'Swordfish',
 'Serve',
 'Make',
 'Celebrate',
 'Chicken',
 'Pata',
 'aux',
 'Poulet',
 'Papa',
 'Tarte',
 'Pollo',
 'Pancake',
 'Dutch',
 'Kransekake',
 'Dish',
 'Pannekaken']

In [96]:
for vb in list(set(vb_tokens)):
  print(find_value_with_char(p_recipes, 'name', vb))

['Poulet de Provencal', 'Tajine de Poulet aux Carottes et Patates Douces', 'Poulet a la Moutarde']
['German Pancakes II', "Mom's Buttermilk Pancakes", 'Japanese-Style Fluffy Pancakes', 'Arvidson Swedish Pancakes', 'Easy Swedish Pancakes', 'Easy Potato Pancakes', 'Finnish Pancakes', 'Coconut Pancake Syrup', 'Japanese Souffle Pancakes', "Barbarella's German Pancakes", 'Pan-Fried Chinese Pancakes', 'The Best Ricotta Pancakes', "Chef John's Chinese Scallion Pancakes", 'Traditional Swedish Pancakes', 'Chinese Scallion Pancakes', 'Authentic Potato Pancakes', 'German Pancake with Buttermilk Sauce', 'German Puff Pancakes', 'Dutch Pancakes', 'Russian Pancake', 'Russian Cheese Pancakes', 'Czech Savory Potato Pancakes', 'Japanese Pancake', 'Dutch Mini Pancakes', 'Moroccan Pancakes', 'Polish Apple Pancakes', 'Russian Pancakes', 'Indian Pancake', 'Baked Pancakes', 'Kimchi Pancake and Dipping Sauce', 'Breadfruit Pancakes', 'Ulu Pancakes', 'Norwegian Pancakes']
['Kransekake']
['Apple Tarte Tatin', 'P

Remove recipe names with instruction

In [97]:
searchReplacePattList(p_recipes, r" to Make at Home", "")
searchReplacePattList(p_recipes, r" to Make Grandmother Proud", "")
searchReplacePattList(p_recipes, r"Ways The World Makes Chicken And ", "")

searchReplaceAddPattList(p_recipes, r"Make Ahead ", "", "packaged gnocchi ")

tagged_recipe_names = retag(p_recipes, "name")

In [98]:
for vb in list(set(vb_tokens)):
  print(find_value_with_char(p_recipes, 'name', vb))

['Poulet de Provencal', 'Tajine de Poulet aux Carottes et Patates Douces', 'Poulet a la Moutarde']
['German Pancakes II', "Mom's Buttermilk Pancakes", 'Japanese-Style Fluffy Pancakes', 'Arvidson Swedish Pancakes', 'Easy Swedish Pancakes', 'Easy Potato Pancakes', 'Finnish Pancakes', 'Coconut Pancake Syrup', 'Japanese Souffle Pancakes', "Barbarella's German Pancakes", 'Pan-Fried Chinese Pancakes', 'The Best Ricotta Pancakes', "Chef John's Chinese Scallion Pancakes", 'Traditional Swedish Pancakes', 'Chinese Scallion Pancakes', 'Authentic Potato Pancakes', 'German Pancake with Buttermilk Sauce', 'German Puff Pancakes', 'Dutch Pancakes', 'Russian Pancake', 'Russian Cheese Pancakes', 'Czech Savory Potato Pancakes', 'Japanese Pancake', 'Dutch Mini Pancakes', 'Moroccan Pancakes', 'Polish Apple Pancakes', 'Russian Pancakes', 'Indian Pancake', 'Baked Pancakes', 'Kimchi Pancake and Dipping Sauce', 'Breadfruit Pancakes', 'Ulu Pancakes', 'Norwegian Pancakes']
['Kransekake']
['Apple Tarte Tatin', 'P

Words like best and most can be removed

In [99]:
rbs_tokens = list_words_with_tag(tagged_recipe_names, "RBS")
rbs_tokens

['Best', 'Most', 'Best']

In [100]:
for rbs in list(set(rbs_tokens)):
  print(find_value_with_char(p_recipes, 'name', rbs))

['The Most Iconic French Desserts', 'Alfredo Mostaccioli']
['Best Bobotie', 'Best Fried Walleye', 'Best Avgolemono Soup Recipes', "Chef John's Best German Recipes", 'The Best Thai Peanut Sauce', 'Best Ever Russian Beef Stroganoff', "Grandma's Best Ever Sour Cream Lasagna", 'Best Guacamole', 'Best Ever Slow Cooker Italian Beef Roast', 'The Best Pavlova', "Savannah's Best Marinated Portobello Mushrooms", 'Best Peanut Sauce', 'Best Ever Carne Asada Marinade', "Mom's Best Spaghetti Sauce", 'The Best Korean Chicken Recipes', 'Best Instant Pot Chicken Cacciatore', 'Best Ziti Ever', 'Best Authentic Mexican Recipes', 'Best Empanada Recipes', 'Best Ziti Ever with Sausage', 'Best Chicken Parmesan', 'Best Pernil Ever', 'The Best Ricotta Pancakes', 'Best Indian Recipes for Beginner Cooks', 'Best Hot Sauce', 'Best Ever Irish Soda Bread', 'Best Hummus', 'The Best Thai Tom Kha Soup Recipe', 'Best French Macarons', 'Best Falafel', "Gordo's Best of the Best Lasagna", 'The Best Classic Beef Stroganoff',

In [101]:
searchReplacePattList(p_recipes, r"Best Ever ", "")
searchReplacePattList(p_recipes, r"Best ", "")
searchReplacePattList(p_recipes, r" ever", "")
searchReplacePattList(p_recipes, r"The Most Iconic ", "")

tagged_recipe_names = retag(p_recipes, "name")

In [102]:
rbs_tokens = list_words_with_tag(tagged_recipe_names, "RBS")
rbs_tokens

[]

Adverbs with -ly can be removed, except for the misclassified ones mainly caused by foreign recipe names

In [103]:
rb_tokens = list_words_with_tag(tagged_recipe_names, "RB")
rb_tokens

['Absolutely',
 'Aebleskiver',
 'Incredibly',
 'Perfectly',
 'Absolutely',
 'Oven',
 'Perfectly',
 'Absolutely',
 'Heavenly',
 'Asiago',
 'Philly',
 'Family',
 'Deadly',
 'Yet',
 'Absolutely',
 'Ever',
 'Tourtiere',
 'Tourtiere',
 'Soon',
 'Here',
 'Long',
 'Tourtiere',
 'Tourtiere']

In [104]:
for rb in list(set(rb_tokens)):
  print(find_value_with_char(p_recipes, 'name', rb))

['Heavenly Raspberry Dessert']
['Soon Du Bu Jigae']
['Deadly Delicious Lasagna']
['Ziti Ever', 'Ziti Ever with Sausage', 'Pernil Ever', 'Date Squares Ever']
['No Tomato Paste Here']
['Absolutely Fabulous Greek or House Dressing', 'Absolutely Amazing Ahi', 'Absolutely Delicious Stuffed Calamari', 'Absolutely Perfect Palak Paneer']
['French Canadian Tourtiere', 'Traditional French Canadian Tourtiere', 'Reveillon Tourtiere', 'Tourtiere Spices', 'Tourtiere', 'Tourtiere', 'Tourtiere', 'Tourtiere']
['Yet Turkey Chili']
['Chicken Long Rice Soup', 'Vietnamese Chicken and Long-Grain Rice Congee', 'Long Soup', 'Philippine Longanisa de Eugenio', 'Long Drink']
['Willard Family German Chocolate Cake', 'Mexican-Inspired Casseroles for Family-Pleasing Dinners', 'Chinese Happy Family', 'Family Sicilian Sauce and Meatballs', 'Greek Ground Beef Recipes Sure To Become Family Favorites']
['Philly Cheesesteak Quesadillas']
['Asiago Sun-Dried Tomato Pasta', 'Chicken and Bowtie Pasta with Asiago Cream Sauce'

In [105]:
searchReplacePattList(p_recipes, r"Deadly Delicious ", "")
searchReplacePattList(p_recipes, r"Heavenly ", "")
searchReplacePattList(p_recipes, r"Perfectly ", "")
searchReplacePattList(p_recipes, r"Absolutely Fabulous ", "")
searchReplacePattList(p_recipes, r"Absolutely Amazing  ", "")
searchReplacePattList(p_recipes, r"Absolutely Delicious ", "")
searchReplacePattList(p_recipes, r"Absolutely Perfect ", "")

searchReplaceAddPattList(p_recipes, r"No Tomato Paste Here", "", "tomato paste")

tagged_recipe_names = retag(p_recipes, "name")

In [106]:
rb_tokens = list_words_with_tag(tagged_recipe_names, "RB")
rb_tokens

['Aebleskiver',
 'Incredibly',
 'Absolutely',
 'Oven',
 'Asiago',
 'Philly',
 'Family',
 'Yet',
 'Ever',
 'Tourtiere',
 'Tourtiere',
 'Soon',
 'Long',
 'Tourtiere',
 'Tourtiere']

In [107]:
for rb in list(set(rb_tokens)):
  print(find_value_with_char(p_recipes, 'name', rb))

['Soon Du Bu Jigae']
['Ziti Ever', 'Ziti Ever with Sausage', 'Pernil Ever', 'Date Squares Ever']
['Absolutely Amazing Ahi']
['French Canadian Tourtiere', 'Traditional French Canadian Tourtiere', 'Reveillon Tourtiere', 'Tourtiere Spices', 'Tourtiere', 'Tourtiere', 'Tourtiere', 'Tourtiere']
['Yet Turkey Chili']
['Chicken Long Rice Soup', 'Vietnamese Chicken and Long-Grain Rice Congee', 'Long Soup', 'Philippine Longanisa de Eugenio', 'Long Drink']
['Willard Family German Chocolate Cake', 'Mexican-Inspired Casseroles for Family-Pleasing Dinners', 'Chinese Happy Family', 'Family Sicilian Sauce and Meatballs', 'Greek Ground Beef Recipes Sure To Become Family Favorites']
['Philly Cheesesteak Quesadillas']
['Asiago Sun-Dried Tomato Pasta', 'Chicken and Bowtie Pasta with Asiago Cream Sauce']
['Incredibly Delicious Italian Cream Cake']
['Air Fryer Oven Taco Shells', 'Oven Kalua Pork', 'Oven-Roasted Chicken Thighs', 'Oven Baked Chicken Teriyaki', 'Oven-Baked Chicken Fajitas', 'Oven-Baked Teriyaki

In [108]:
all_name_tags = []

for POS in ALL_POS:
  new_dic = {POS: list_words_with_tag(tagged_recipe_names, POS)}
  all_name_tags.append(new_dic)

get_tag_number(all_name_tags)

[{'$': 1},
 {"''": 7},
 {'(': 0},
 {')': 0},
 {',': 62},
 {'--': 0},
 {'.': 1},
 {':': 1},
 {'CC': 506},
 {'CD': 23},
 {'DT': 96},
 {'EX': 0},
 {'FW': 67},
 {'IN': 464},
 {'JJ': 1897},
 {'JJR': 2},
 {'JJS': 1},
 {'LS': 0},
 {'MD': 0},
 {'NN': 659},
 {'NNP': 12712},
 {'NNPS': 36},
 {'NNS': 389},
 {'PDT': 0},
 {'POS': 346},
 {'PRP': 69},
 {'PRP$': 1},
 {'RB': 15},
 {'RBR': 0},
 {'RBS': 0},
 {'RP': 2},
 {'SYM': 0},
 {'TO': 10},
 {'UH': 0},
 {'VB': 18},
 {'VBD': 39},
 {'VBG': 59},
 {'VBN': 139},
 {'VBP': 9},
 {'VBZ': 29},
 {'WDT': 0},
 {'WP': 0},
 {'WP$': 0},
 {'WRB': 0},
 {'``': 6}]

## Examining other POS in names

So as to get an idea of POS tagging in the later section

In [109]:
vbz_tokens = list_words_with_tag(tagged_recipe_names, "VBZ")
vbz_tokens

['Ties',
 'el',
 'Leaves',
 'al',
 'al',
 'Leaves',
 'au',
 'di',
 'Ways',
 'de',
 'al',
 'Breasts',
 'en',
 'e',
 'al',
 'Leaves',
 'Breasts',
 'Squares',
 'al',
 'di',
 'aux',
 'di',
 'Leaves',
 'au',
 'di',
 'di',
 'al',
 'en',
 'en']

In [110]:
vbp_tokens = list_words_with_tag(tagged_recipe_names, "VBP")
vbp_tokens

['Rellenos',
 'Greek',
 'Divine',
 'Wat',
 'Be',
 'en',
 'Mexicanos',
 'Rellenos',
 'en']

In [111]:
vbg_tokens = list_words_with_tag(tagged_recipe_names, "VBG")
vbg_tokens

['Seasoning',
 'Dressing',
 'Pudding',
 'Using',
 'Canning',
 'Pudding',
 'Velveting',
 'Pudding',
 'Pudding',
 'Pudding',
 'Seasoning',
 'Comforting',
 'Seasoning',
 'Pouding',
 'Pudding',
 'Amazing',
 'Pudding',
 'Refreshing',
 'Pudding',
 'Seasoning',
 'Dressing',
 'Comforting',
 'Pudding',
 'Making',
 'Comforting',
 'Pudding',
 'Dumpling',
 'Dipping',
 'Refreshing',
 'Pudding',
 'Seasoning',
 'Seasoning',
 'Filling',
 'Thanksgiving',
 'Stuffing',
 'Pudding',
 'Pudding',
 'Refreshing',
 'Pudding',
 'Sizzling',
 'Topping',
 'Amazing',
 'Refreshing',
 'Comforting',
 'Dressing',
 'Using',
 'Seasoning',
 'Refreshing',
 'Pudding',
 'Pudding',
 'Pudding',
 'Ping',
 'Pudding',
 'Pudding',
 'Pudding',
 'Pudding',
 'Pudding',
 'Dumpling',
 'Pudding']

In [112]:
vbd_tokens = list_words_with_tag(tagged_recipe_names, "VBD")
vbd_tokens

['Braised',
 'Corned',
 'Corned',
 'Pickled',
 'Shredded',
 'Braised',
 'Fashioned',
 'Filled',
 'Corned',
 'Fashioned',
 'Pickled',
 'Braised',
 'Breaded',
 'Fried',
 'Grilled',
 'Braised',
 'Pickled',
 'Braised',
 'Braised',
 'Planked',
 'Corned',
 'Corned',
 'Braised',
 'Infused',
 'Corned',
 'Obsessed',
 'Pickled',
 'Pulled',
 'Roasted',
 'Broiled',
 'Pickled',
 'Roasted',
 'di',
 'Braised',
 'Braised',
 'Pickled',
 'Mulled',
 'Pickled',
 'Boiled']

In [113]:
rp_tokens = set(list(list_words_with_tag(tagged_recipe_names, "RP")))
rp_tokens

{'Hanout', 'Over'}

In [114]:
comma_tokens = set(list(list_words_with_tag(tagged_recipe_names, ",")))
comma_tokens

{','}

In [115]:
for c in list(set(comma_tokens)):
  print(find_value_with_char(p_recipes, 'name', c))

['Bow Ties with Sausage, Tomatoes and Cream', 'Velveting Chicken Breast, Chinese Restaurant-Style', 'Chicken, Spinach, and Cheese Pasta Bake', 'Super-Simple, Super-Spicy Mongolian Beef', 'Creamy Potato, Carrot, and Leek Soup', 'Beef, Mushroom and Guinness Pie', 'Easy, Chewy Flourless Peanut Butter Cookies', 'Filipino Steamed Rice, Cebu Style', 'Orange, Honey and Soy Chicken', 'Chicken Francese, Italian-Style', 'Duck with Honey, Soy, and Ginger', 'Steak, Onion, and Pepper Fajitas', 'Indian Carrots, Peas and Potatoes', 'Simple, Baked Finnan Haddie', 'Indian-Style Rice with Cashews, Raisins and Turmeric', 'Serbian Ground Beef, Veggie, and Potato Bake', 'Fried Rice with Ginger, Hoisin, and Sesame', 'Chard Lentil Soup, Lebanese-Style', 'Easy, Cheesy Tortellini Bake', 'Curried Cashew, Pear, and Grape Salad', 'Pork, Sauerkraut and Dumplings', 'Spinach, Feta, and Pine Nut Ravioli Filling', 'Bell Pepper, Tomato, and Potato Indian Curry', 'Mascarpone Pasta with Chicken, Bacon and Spinach', 'Past

In [116]:
jjr_tokens = list_words_with_tag(tagged_recipe_names, "JJR")
jjr_tokens

['Healthier', 'Lighter']

In [117]:
for j in list(set(jjr_tokens)):
  print(find_value_with_char(p_recipes, 'name', j))

['Lighter Mexican Meatloaf']
['Healthier Bang Bang Chicken in the Air Fryer', 'Healthier Swedish Meatballs', 'Healthier Pan-Fried Honey-Sesame Chicken', 'Healthier Chicken Enchiladas I', 'Healthier Honey-Sesame Chicken']


In [118]:
jjs_tokens = list_words_with_tag(tagged_recipe_names, "JJS")
jjs_tokens

['Oktoberfest']

In [119]:
for j in list(set(jjs_tokens)):
  print(find_value_with_char(p_recipes, 'name', j))

['Oktoberfest Chicken and Red Cabbage', 'Oktoberfest Potato Salad', 'Oktoberfest Chili', 'The Recipes to Celebrate Oktoberfest']


In [120]:
dt_tokens = list_words_with_tag(tagged_recipe_names, "DT")
dt_tokens

['a',
 'The',
 'No',
 'The',
 'the',
 'a',
 'the',
 'The',
 'the',
 'the',
 'a',
 'the',
 'the',
 'A',
 'a',
 'The',
 'a',
 'the',
 'the',
 'a',
 'a',
 'A',
 'The',
 'A',
 'the',
 'a',
 'a',
 'The',
 'a',
 'a',
 'The',
 'the',
 'The',
 'This',
 'The',
 'a',
 'a',
 'the',
 'The',
 'a',
 'a',
 'The',
 'a',
 'A',
 'the',
 'the',
 'No',
 'the',
 'a',
 'a',
 'The',
 'The',
 'a',
 'The',
 'the',
 'the',
 'The',
 'the',
 'a',
 'a',
 'The',
 'a',
 'the',
 'a',
 'The',
 'All',
 'The',
 'a',
 'the',
 'the',
 'the',
 'The',
 'The',
 'A',
 'a',
 'the',
 'a',
 'the',
 'The',
 'the',
 'a',
 'a',
 'a',
 'the',
 'a',
 'a',
 'the',
 'a',
 'An',
 'the',
 'a',
 'a',
 'a',
 'No',
 'a',
 'No']

In [121]:
for dt in list(set(dt_tokens)):
  print(find_value_with_char(p_recipes, 'name', dt))

['Authentic German Potato Salad', 'Easy Authentic Mexican Rice', "Authentic Russian Salad 'Olivye'", 'Authentic Mexican Tortillas', 'The Original Donair From the East Coast of Canada', 'Authentic Paella Valenciana', 'Authentic Pad Thai', 'Lumpia in the Air Fryer', 'Refried Beans Without the Refry', 'Beef Stifado in the Slow Cooker', 'Authentic Mexican Breakfast Tacos', 'Authentic Enchiladas Verdes', 'Toad in the Hole', 'Authentic Chicken Tikka Masala', 'Authentic Mexican Enchiladas', 'Authentic Miso Soup', 'Healthier Bang Bang Chicken in the Air Fryer', 'Authentic Mexican Picadillo', 'Margaritas on the Rocks', 'Authentic French Meringues', 'Authentic Hungarian Goulash', 'Eggplant Parmesan For the Slow Cooker', 'Authentic Greek Moussaka', 'Authentic Chicken Madras', 'Authentic Thai Coconut Soup', 'Mongolian Beef from the Slow Cooker', 'Authentic and Easy Shrimp Curry', 'Authentic Patatas Bravas', 'Cuban Black Bean Soup in the Slow Cooker', 'Authentic Bangladeshi Beef Curry', 'Dave Matth

In [122]:
to_tokens = list_words_with_tag(tagged_recipe_names, "TO")
to_tokens

['to', 'na', 'to', 'to', 'to', 'To', 'to', 'na', 'na', 'na']

In [123]:
for to in list(set(to_tokens)):
  print(find_value_with_char(p_recipes, 'name', to))

['Super-Delicious Zuppa Toscana', 'Spinach Tomato Tortellini', 'Tomatillo Salsa Verde', 'French Canadian Tourtiere', 'Bow Ties with Sausage, Tomatoes and Cream', 'Authentic Mexican Tortillas', 'Tomato Basil Salmon', 'Sticky Toffee Pudding Cake', 'Italian Stewed Tomatoes', 'Cinder Toffee', 'Toad in the Hole', 'Baked Corn Tortilla Strips for Mexican Soups', "Jorge's Indian-Spiced Tomato Lentil Soup", 'Corn Tortillas', "Daddy's Shrimp Toast", 'Tofu Salad', 'Tomatillo Soup', 'Air Fryer Tonkatsu', 'Double Tomato Bruschetta', 'Dobos Torte', 'Korean Street Toast', 'Tonkatsu  or  Katsu Sauce', 'Tomato Bredie', 'Sweet Corn Tomalito', 'Gnocchi with Tomato Sauce and Mozzarella', 'Chicken Tortilla Soup II', 'Tofu Parmigiana', 'Indian Tomato Chicken', "Chef John's Shrimp Toast", 'Zuppa Toscana', 'Tomato and Mozzarella Bites', 'Cod with Italian Crumb Topping', 'Spinach and Sun-Dried Tomato Pasta', 'Roasted Tomatillo and Garlic Salsa', "Chef John's Spaghetti al Tonno", 'Chicken with Artichokes and Su

Chicken is considered dollar?

In [124]:
dol_tokens = list_words_with_tag(tagged_recipe_names, "$")
dol_tokens

['Chicken']

It's a tagging error, so this can be ignored

In [125]:
for dol in dol_tokens:
  print(find_value_with_char(p_recipes, 'name', dol))

['Spicy Korean Fried Chicken with Gochujang Sauce', 'Greek Lemon Chicken and Potato Bake', "Chef John's Chicken Kiev", 'Indian-Style Chicken and Onions', 'Tender Italian Baked Chicken', 'Chicken Katsu', 'Chicken Stir-Fry', 'Mexican-Style Chicken Taco Casserole', 'Curry Stand Chicken Tikka Masala Sauce', 'Chicken Enchiladas V', 'Jamaican Style Curry Chicken', 'Salsa Chicken', 'Grilled Asian Chicken', 'Chicken Tikka Masala', 'Sweet and Sour Chicken I', 'Chicken Cordon Bleu II', 'Turkish Chicken Kebabs', 'Chicken Souvlaki with Tzatziki Sauce', 'Greek Lemon Chicken Soup', 'Chicken Cacciatore in a Slow Cooker', 'Chicken and Broccoli Stir-Fry', 'Creamy Chicken Lasagna', 'Broccoli and Chicken Stir-Fry', 'Chicken Parmigiana', 'Shoyu Chicken', 'Skillet Chicken Bulgogi', 'Easy Slow Cooker Chicken Tetrazzini', 'Sheet Pan Chicken Fajitas', 'White Chicken Enchilada Slow-Cooker Casserole', 'Chicken Enchiladas II', 'Chinese Chicken Fried Rice II', 'Chicken Milanese', 'Chicken Massaman Curry', "Chef J

There are some quotation marks

In [126]:
quote_tokens = list_words_with_tag(tagged_recipe_names, "''")
quote_tokens

["''", "''", "'", "''", "''", "''", "''"]

Quotation marks are caused by possessive -'s

In [127]:
for quote in quote_tokens:
  print(find_value_with_char(p_recipes, 'name', quote))

[]
[]
["Chef John's Chicken Kiev", "Angela's Awesome Enchiladas", "Randy's Slow Cooker Ravioli Lasagna", "'Chinese Buffet' Green Beans", "Chef John's Beef Rouladen", "Corned Beef and Cabbage Shepherd's Pie", "Gramma's Date Squares", "Authentic Russian Salad 'Olivye'", "Chef John's Meatless Meatballs", "Chef John's Beef Goulash", "Grandma's Noodles II", "Chef John's Clotted Cream", "Newfoundland Jigg's Dinner", "Chef John's Coq Au Vin", "Chef John's Loco Moco", "Dash's Donair", "Turkey Shepherd's Pie", "Papa Drexler's Bavarian Pretzels", "Bob's Stuffed Banana Peppers", "Chef John's Swedish Meatballs", "Chef John's German Recipes", "Chef John's Chicken Tikka Masala", "Maria's Mexican Rice", "Mom's Buttermilk Pancakes", "Geneva's Ultimate Hungarian Mushroom Soup", "Charley's Slow Cooker Mexican Style Meat", "Ingrid's Rouladen", "Chef John's Lasagna", "Lola's Horchata", "Chef John's Italian Sausage Chili", "Kid's Favorite Pizza Casserole", "Traci's Adobo Seasoning", "Frank's Favorite Slow-

 For now, leave the preprocessing of the recipe names first.

## Preprocessing of ingredients

Ingriendts are a lot more straightforward to preprocess, since recipe names have to be attractive to encourage user to click in

In [128]:
p_ingredients = []

for recipe in p_recipes:
    p_ingredients = p_ingredients + recipe['ingredients']
    
p_ingredients = list(set(p_ingredients))
len(p_ingredients)

19342

In [129]:
p_ingredients[:10]

['1 tablespoon soy sauce, or more to taste ',
 '1 large onion, cut into rings ',
 '½ teaspoon onion salt ',
 '2 skinless, bone-in chicken breast halves - cut in half ',
 '½ red onion, finely diced ',
 '2 zucchini, halved lengthwise ',
 '3 large apples - peeled, cored, and sliced ',
 '8 kaffir lime leaves, thinly sliced ',
 '2 fresh jalapeno peppers - seeded, sliced, and divided ',
 '2 lime, juiced ']

In [130]:
for i, ingre in enumerate(p_ingredients):
    p_ingredients[i] = p_ingredients[i].strip()

p_ingredients[:10]

['1 tablespoon soy sauce, or more to taste',
 '1 large onion, cut into rings',
 '½ teaspoon onion salt',
 '2 skinless, bone-in chicken breast halves - cut in half',
 '½ red onion, finely diced',
 '2 zucchini, halved lengthwise',
 '3 large apples - peeled, cored, and sliced',
 '8 kaffir lime leaves, thinly sliced',
 '2 fresh jalapeno peppers - seeded, sliced, and divided',
 '2 lime, juiced']

A reusable function that re-tags ingredients

In [131]:
def retag_ingredients():
    tagged_recipe_ingredients = []

    for ingredient in p_ingredients:
        tagged_recipe_ingredients.append(tag_pos(ingredient))
        
    return tagged_recipe_ingredients

tagged_recipe_ingredients = retag_ingredients()
tagged_recipe_ingredients[:10]

[[('1', 'CD'),
  ('tablespoon', 'NN'),
  ('soy', 'NN'),
  ('sauce', 'NN'),
  (',', ','),
  ('or', 'CC'),
  ('more', 'JJR'),
  ('to', 'TO'),
  ('taste', 'VB')],
 [('1', 'CD'),
  ('large', 'JJ'),
  ('onion', 'NN'),
  (',', ','),
  ('cut', 'VBN'),
  ('into', 'IN'),
  ('rings', 'NNS')],
 [('½', 'JJ'), ('teaspoon', 'NN'), ('onion', 'NN'), ('salt', 'NN')],
 [('2', 'CD'),
  ('skinless', 'NN'),
  (',', ','),
  ('bone-in', 'JJ'),
  ('chicken', 'NN'),
  ('breast', 'NN'),
  ('halves', 'VBZ'),
  ('-', ':'),
  ('cut', 'NN'),
  ('in', 'IN'),
  ('half', 'NN')],
 [('½', 'RB'),
  ('red', 'JJ'),
  ('onion', 'NN'),
  (',', ','),
  ('finely', 'RB'),
  ('diced', 'VBD')],
 [('2', 'CD'),
  ('zucchini', 'NN'),
  (',', ','),
  ('halved', 'VBD'),
  ('lengthwise', 'NN')],
 [('3', 'CD'),
  ('large', 'JJ'),
  ('apples', 'NNS'),
  ('-', ':'),
  ('peeled', 'VBN'),
  (',', ','),
  ('cored', 'VBN'),
  (',', ','),
  ('and', 'CC'),
  ('sliced', 'VBD')],
 [('8', 'CD'),
  ('kaffir', 'NN'),
  ('lime', 'NN'),
  ('leaves', '

Numbers need a placeholder

In [132]:
list_words_with_tag(tagged_recipe_ingredients, "CD")

['1',
 '1',
 '2',
 '2',
 '3',
 '8',
 '2',
 '2',
 '1',
 '3',
 '3',
 '1',
 '1',
 '1',
 '2',
 '1',
 '1',
 '1',
 '3',
 '3',
 '4',
 '3',
 '1',
 '1',
 '2',
 '1',
 '4',
 '4',
 '3',
 '8',
 '1',
 '15.25',
 '1',
 '1',
 '1',
 '1',
 '1',
 '2',
 '2',
 '1',
 '6',
 '2',
 '1',
 '2',
 '15.5',
 '2',
 '5',
 '1',
 '2',
 '6',
 '6',
 '1',
 '1',
 '2',
 '1',
 '16',
 '1',
 '1',
 '4',
 '12',
 '1',
 '1',
 '14',
 '4',
 '4',
 '2',
 '1',
 '5',
 '1/2',
 '1',
 '3',
 '1',
 '1',
 '1',
 '18',
 '2',
 '1',
 '1',
 '3',
 '1',
 '8',
 '4',
 '2',
 '3',
 '1',
 '16',
 '1',
 '1',
 '1',
 '4',
 '1',
 '1',
 '1',
 '1',
 '16',
 '6',
 '3',
 '1',
 '2',
 '1',
 '1',
 '1',
 '1/2',
 '4',
 '4',
 '1',
 '3',
 '3',
 '1',
 '17',
 '1',
 '8',
 '8',
 '1',
 '4',
 '2',
 '2',
 '1',
 '1',
 '4',
 '4',
 '1',
 '2',
 '1',
 '1',
 '1',
 '8',
 '3',
 '6',
 '1/2',
 '2',
 '3',
 '1',
 '10.5',
 '3',
 '5',
 '9',
 '1',
 '16',
 '1',
 '1',
 '2',
 '1',
 '3',
 '6',
 '2',
 '1',
 '2',
 '2',
 '1',
 '1',
 '1',
 '8',
 '1',
 '16',
 '1',
 '1',
 '1',
 '1',
 '2',
 '2',
 '2',
 '1

NLTK assumes fractions as JJ (adjectives)

In [133]:
list_words_with_tag(tagged_recipe_ingredients, "JJ")

['large',
 '½',
 'bone-in',
 'red',
 'large',
 'fresh',
 '¼',
 'green',
 'bone-in',
 '½',
 'black',
 'red',
 '½',
 'boneless',
 '1/2-inch',
 'Japanese',
 'sweet',
 'whole',
 '¼',
 'red',
 'bite-size',
 'beef',
 '1-inch',
 '½',
 'large',
 'sweet',
 'thin',
 '½',
 'all-purpose',
 'white',
 '¼',
 'torn',
 'chopped',
 '¼',
 'dry',
 'fresh',
 '¼',
 'Japanese',
 'sweet',
 'large',
 '¾',
 'black',
 'raw',
 '½',
 'fresh',
 'red',
 '½',
 'ripe',
 'white',
 'reduced-fat',
 '1/2-inch',
 '¼',
 'tart',
 'pork',
 '¼',
 'pita',
 '½',
 'frozen',
 'fresh',
 'fresh',
 '⅔',
 '½',
 'black',
 '½',
 'blue',
 'whole',
 'large',
 'garlic',
 '½',
 'sweet',
 'frozen',
 'pinch',
 'avocado',
 'green',
 'red',
 '¼',
 'bow',
 '⅔',
 'cold',
 'Swiss',
 '½',
 'skinless',
 'boneless',
 'fresh',
 'fresh',
 '¼',
 'salsa',
 'fresh',
 'fresh',
 'Irish',
 'such',
 'angel',
 'bunch',
 'olive',
 'instant',
 '¼',
 'dash',
 'hot',
 'green',
 '¼',
 '½',
 'white',
 '½',
 'olive',
 'black',
 'garlic',
 '⅛',
 'red',
 'coarse',
 'sa

Create a function that converts any fraction in a text to integer

In [134]:
import unicodedata
from decimal import Decimal

def fraction_to_int(text):
  for i, char in enumerate(text):
    try:
      # unicode.numeric converts fractions such as ½ to decimal place, 0.25
      # remove trailing decimals, otherwise keep decimals
      text = text[:i] + str(Decimal(unicodedata.numeric(char)).normalize()) + text[i + 1:]
    except Exception as e:
      pass
  # Because number + fraction, such as 1 1/4 may be converted to 1 0, so use re.sub to remove
  text = re.sub("([0-9]+ [0])+", "4", text)
  return text

for i, ingre in enumerate(p_ingredients):
    p_ingredients[i] = fraction_to_int(p_ingredients[i])

tagged_recipe_ingredients = retag_ingredients()
p_ingredients[:20]

['1 tablespoon soy sauce, or more to taste',
 '1 large onion, cut into rings',
 '0.5 teaspoon onion salt',
 '2 skinless, bone-in chicken breast halves - cut in half',
 '0.5 red onion, finely diced',
 '2 zucchini, halved lengthwise',
 '3 large apples - peeled, cored, and sliced',
 '8 kaffir lime leaves, thinly sliced',
 '2 fresh jalapeno peppers - seeded, sliced, and divided',
 '2 lime, juiced',
 '0.25 cup ice, or as needed',
 '1 cup chunky peanut butter',
 '3 tablespoons chopped green onion',
 '3 pounds bone-in chicken pieces',
 '4.5 teaspoons black pepper',
 '1 gallon lard for frying (manteca)',
 '1 red pepper, seeded and thinly sliced',
 '2 tablespoons thinly sliced lemongrass',
 '4.5 pounds boneless pork chops',
 '1 pound broccoli rabe, cut into 1 1/2-inch lengths']

By converting fractions into integers, NLTK stops seeing them as adjectives (JJ) and instead, they are considered numbers (CD)

In [135]:
list_words_with_tag(tagged_recipe_ingredients, "JJ")

['large',
 'bone-in',
 'red',
 'large',
 'fresh',
 'green',
 'bone-in',
 'black',
 'red',
 'boneless',
 '1/2-inch',
 'Japanese',
 'sweet',
 'whole',
 'red',
 'bite-size',
 'beef',
 '1-inch',
 'large',
 'sweet',
 'thin',
 'all-purpose',
 'white',
 'torn',
 'chopped',
 'dry',
 'fresh',
 'Japanese',
 'sweet',
 'large',
 'black',
 'raw',
 'fresh',
 'red',
 'soy',
 'ripe',
 'white',
 'reduced-fat',
 '1/2-inch',
 'tart',
 'pork',
 'pita',
 'frozen',
 'fresh',
 'fresh',
 'black',
 'blue',
 'whole',
 'large',
 'garlic',
 'sweet',
 'frozen',
 'pinch',
 'avocado',
 'green',
 'red',
 'bow',
 'cold',
 'Swiss',
 'skinless',
 'boneless',
 'fresh',
 'fresh',
 'cup',
 'salsa',
 'fresh',
 'fresh',
 'Irish',
 'such',
 'angel',
 'bunch',
 'olive',
 'instant',
 'cup',
 'dash',
 'hot',
 'green',
 'cup',
 'white',
 'olive',
 'black',
 'garlic',
 'red',
 'coarse',
 'sangrita',
 'Mexican-style',
 'mary',
 'unsalted',
 'kielbasa',
 'maple',
 'cod',
 'all-purpose',
 'small',
 'light',
 'stale',
 'Italian',
 'mi

Replace all the numbers with placeholder of 4

In [136]:
for i, ingre in enumerate(p_ingredients):
    p_ingredients[i] = searchReplacePatt(p_ingredients[i], NUMPATTERN, "4")
    
tagged_recipe_ingredients = retag_ingredients()
p_ingredients[:20]

['4 tablespoon soy sauce, or more to taste',
 '4 large onion, cut into rings',
 '4.4 teaspoon onion salt',
 '4 skinless, bone-in chicken breast halves - cut in half',
 '4.4 red onion, finely diced',
 '4 zucchini, halved lengthwise',
 '4 large apples - peeled, cored, and sliced',
 '4 kaffir lime leaves, thinly sliced',
 '4 fresh jalapeno peppers - seeded, sliced, and divided',
 '4 lime, juiced',
 '4.4 cup ice, or as needed',
 '4 cup chunky peanut butter',
 '4 tablespoons chopped green onion',
 '4 pounds bone-in chicken pieces',
 '4.4 teaspoons black pepper',
 '4 gallon lard for frying (manteca)',
 '4 red pepper, seeded and thinly sliced',
 '4 tablespoons thinly sliced lemongrass',
 '4.4 pounds boneless pork chops',
 '4 pound broccoli rabe, cut into 4 4/4-inch lengths']

In [137]:
new_cd_tokens = list(set(list_words_with_tag(tagged_recipe_ingredients, "CD")))
new_cd_tokens.remove('4')
new_cd_tokens

['4.4',
 'beef4',
 '4up',
 'mascarpone',
 'four',
 'mozzarella',
 'fontina',
 'provolone',
 'marinara',
 'zapallo',
 'seven',
 'yellow',
 '4.4.4',
 'mostaccioli',
 'kalamata',
 '4/4',
 '4p',
 "za'atar",
 'one',
 '4/4x4/4',
 'yum',
 'xanthan',
 'zucchini',
 'bleu',
 'ziti',
 'millet']

Define a function that returns ingredient with specific substring

In [138]:
def find_ingre_with_substring(sub):
    ingres = []
    for ingre in p_ingredients:
        matches = searchWordsPatt(ingre, sub)
        if len(matches)  > 0:
            ingres.append(ingre)
    return ingres

find_ingre_with_substring('4/4')

['4 pound broccoli rabe, cut into 4 4/4-inch lengths',
 '4 russet potato, cut into 4/4-inch cubes',
 '4 (4 4/4 pound) corned beef brisket',
 '4 (4 4/4 inch) piece chopped fresh ginger, or to taste',
 '4 boneless pork loin chops, 4/4 inch thick',
 '4 green plantains, peeled and cut into 4 4/4-inch chunks',
 '4 (4 ounce) package extra-firm tofu, cut into 4/4-inch cubes',
 '4 zucchini, sliced 4/4-inch thick',
 '4/4 teaspoon liquid rennet',
 '4 pounds skinless, boneless chicken breast, cut into 4 4/4-inch pieces',
 '4 pound paneer, cut into 4/4-inch cubes',
 '4 slices bacon, sliced crosswise into 4/4-inch pieces',
 '4 eggplant, peeled into long strips 4/4-inch thick',
 '4 (4/4 inch x 4 inch) strip lime peel',
 '4 bone-in pork loin chops, 4 4/4-inch thick',
 '4 pound skinless, boneless chicken thighs, cut into 4 4/4-inch pieces',
 '4 (4/4 inch thick) pork loin chops',
 '4 orange bell pepper, cut into 4/4-inch dice',
 '4 (4 4/4 inch) piece fresh ginger, peeled',
 '4.4 yellow onion, cut int4 

Define a function that searches and replace specific regex pattern from ingredients

In [139]:
def search_edit_ingredient(regex, new_val):
    for i, ingre in enumerate(p_ingredients):
        p_ingredients[i] = searchReplacePatt(p_ingredients[i], regex, new_val)
        
search_edit_ingredient(r"4/4", "4.4")

find_ingre_with_substring('4/4')

[]

Remove copyright symbols

In [140]:
search_edit_ingredient(r"®", "")

find_ingre_with_substring('®')

[]

Remove 4p

In [141]:
find_ingre_with_substring('4p')

['4.4 4p warm milk (4 degrees F/4 degrees C)',
 '4.4 c4p4.4-inch long vermicelli']

In [142]:
search_edit_ingredient(r"c4p", "")
search_edit_ingredient(r"4p", "")

find_ingre_with_substring('4p')

[]

Change 4up back to 7up

In [143]:
find_ingre_with_substring('4up')

['4.4 4up 4% milk']

In [144]:
search_edit_ingredient(r"4up", "7up")

find_ingre_with_substring('7up')

['4.4 7up 4% milk']

Define a function that splits a list element into two new elements and deletes it

In [145]:
def split_ingre_to_two(target, search, retain_target=False):
    for i, ingre in enumerate(p_ingredients):
        if p_ingredients[i] == target:
            splits = re.split(search, p_ingredients[i])
            new_ingre1 = splits[0].strip()
            new_ingre2 = splits[1].strip()
            if retain_target:
                new_ingre2 = search.strip()
            del p_ingredients[i]
            p_ingredients.append(new_ingre1)
            p_ingredients.append(new_ingre2)

split_ingre_to_two('4.4 7up 4% milk', " 4% milk", retain_target=True)

find_ingre_with_substring('7up')

['4.4 7up']

In [146]:
tagged_recipe_ingredients = retag_ingredients()
p_ingredients[:20]

['4 tablespoon soy sauce, or more to taste',
 '4 large onion, cut into rings',
 '4.4 teaspoon onion salt',
 '4 skinless, bone-in chicken breast halves - cut in half',
 '4.4 red onion, finely diced',
 '4 zucchini, halved lengthwise',
 '4 large apples - peeled, cored, and sliced',
 '4 kaffir lime leaves, thinly sliced',
 '4 fresh jalapeno peppers - seeded, sliced, and divided',
 '4 lime, juiced',
 '4.4 cup ice, or as needed',
 '4 cup chunky peanut butter',
 '4 tablespoons chopped green onion',
 '4 pounds bone-in chicken pieces',
 '4.4 teaspoons black pepper',
 '4 gallon lard for frying (manteca)',
 '4 red pepper, seeded and thinly sliced',
 '4 tablespoons thinly sliced lemongrass',
 '4.4 pounds boneless pork chops',
 '4 pound broccoli rabe, cut into 4 4.4-inch lengths']

Numbers are mostly cleaned

In [147]:
new_cd_tokens = list(set(list_words_with_tag(tagged_recipe_ingredients, "CD")))
new_cd_tokens

['4',
 '7up',
 '4.4',
 'beef4',
 'mascarpone',
 'four',
 'mozzarella',
 'fontina',
 'provolone',
 'marinara',
 'zapallo',
 'seven',
 'yellow',
 '4.4x4.4',
 '4.4.4',
 'mostaccioli',
 'kalamata',
 "za'atar",
 'one',
 'yum',
 'xanthan',
 'zucchini',
 'bleu',
 'ziti',
 'millet']

Looking at the number of each POS tag for ingredient list

In [148]:
tagged_recipe_ingredients = retag_ingredients()

all_ingre_tags = []

for POS in ALL_POS:
  new_dic = {POS: list_words_with_tag(tagged_recipe_ingredients, POS)}
  all_ingre_tags.append(new_dic)

get_tag_number(all_ingre_tags)

[{'$': 0},
 {"''": 14},
 {'(': 3744},
 {')': 3828},
 {',': 8512},
 {'--': 0},
 {'.': 23},
 {':': 304},
 {'CC': 3074},
 {'CD': 21788},
 {'DT': 99},
 {'EX': 0},
 {'FW': 52},
 {'IN': 2849},
 {'JJ': 13401},
 {'JJR': 523},
 {'JJS': 6},
 {'LS': 0},
 {'MD': 612},
 {'NN': 32987},
 {'NNP': 2395},
 {'NNPS': 2},
 {'NNS': 13598},
 {'PDT': 1},
 {'POS': 126},
 {'PRP': 2},
 {'PRP$': 1},
 {'RB': 1452},
 {'RBR': 5},
 {'RBS': 0},
 {'RP': 13},
 {'SYM': 53},
 {'TO': 1039},
 {'UH': 0},
 {'VB': 1725},
 {'VBD': 8949},
 {'VBG': 354},
 {'VBN': 3434},
 {'VBP': 646},
 {'VBZ': 588},
 {'WDT': 1},
 {'WP': 0},
 {'WP$': 0},
 {'WRB': 0},
 {'``': 0}]

In [149]:
colon_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, ":")))
colon_tags

['--', ';', ':', '-']

In [150]:
for c in colon_tags:
    print(find_ingre_with_substring(c))

['4 large skinless, boneless chicken breast halves -- trimmed and cut into 4-inch pieces']
['4 raw chop with refuse, 4 g; (blank) 4.4 ounces boneless pork chops, pounded to 4.4 inch thick', '4 cups assorted mushrooms, sliced (I like white buttons, oyster, shiitake, portobello and crimini; if using shiitake, discard stems)', '4 (4 ounce) can black beans; drain and reserve liquid']
['Gravy:', 'Dipping Sauce:', 'Meatballs:', 'Caramel:', 'Chipotle Mayonnaise:', 'Fillings:', 'Spice Blend:']
['4 skinless, bone-in chicken breast halves - cut in half', '4 large apples - peeled, cored, and sliced', '4 fresh jalapeno peppers - seeded, sliced, and divided', '4 pounds bone-in chicken pieces', '4 pound broccoli rabe, cut into 4 4.4-inch lengths', '4 red bell pepper, cut into bite-size pieces', '4 pounds beef chuck, cut into 4-inch cubes', '4.4 cups all-purpose flour', '4 tablespoon reduced-fat cream cheese', '4 russet potato, cut into 4.4-inch cubes', '4 avocado - peeled, pitted, and sliced', '4 ou

In [151]:
find_ingre_with_substring("--")

['4 large skinless, boneless chicken breast halves -- trimmed and cut into 4-inch pieces']

In [152]:
search_edit_ingredient(r"--", ",")

find_ingre_with_substring('--')

[]

Remove the hanging colons

In [153]:
find_ingre_with_substring(":")

['Gravy:',
 'Dipping Sauce:',
 'Meatballs:',
 'Caramel:',
 'Chipotle Mayonnaise:',
 'Fillings:',
 'Spice Blend:']

In [154]:
search_edit_ingredient(r":", "")

find_ingre_with_substring(':')

[]

In [155]:
find_ingre_with_substring(";")

['4 raw chop with refuse, 4 g; (blank) 4.4 ounces boneless pork chops, pounded to 4.4 inch thick',
 '4 cups assorted mushrooms, sliced (I like white buttons, oyster, shiitake, portobello and crimini; if using shiitake, discard stems)',
 '4 (4 ounce) can black beans; drain and reserve liquid']

In [156]:
find_ingre_with_substring(', 4 g')

['4 raw chop with refuse, 4 g; (blank) 4.4 ounces boneless pork chops, pounded to 4.4 inch thick']

Remove the \(blanlk\) typo

In [157]:
search_edit_ingredient(r", 4 g; \(blank\)", ", 4g")

find_ingre_with_substring(";")

['4 cups assorted mushrooms, sliced (I like white buttons, oyster, shiitake, portobello and crimini; if using shiitake, discard stems)',
 '4 (4 ounce) can black beans; drain and reserve liquid']

In [158]:
split_ingre_to_two('4 raw chop with refuse, 4g; (blank) 4.4 ounces boneless pork chops, pounded to 4.4 inch thick', "; ")

find_ingre_with_substring(";")

['4 cups assorted mushrooms, sliced (I like white buttons, oyster, shiitake, portobello and crimini; if using shiitake, discard stems)',
 '4 (4 ounce) can black beans; drain and reserve liquid']

In [159]:
split_ingre_to_two("4 cups assorted mushrooms, sliced (I like white buttons, oyster, shiitake, portobello and crimini; if using shiitake, discard stems)", r"\(I like ")

find_ingre_with_substring(";")

['4 (4 ounce) can black beans; drain and reserve liquid',
 'white buttons, oyster, shiitake, portobello and crimini; if using shiitake, discard stems)']

In [160]:
find_ingre_with_substring("/")

['4 tablespoons warm milk (4 degrees F/4 degrees C)',
 '4 (4.4 ounce) package corn bread/muffin mix',
 '4 tablespoons warm water (4 degrees F/4 degrees C)',
 '4.4 tablespoon Guacamole, salsa, and/or sour cream',
 '4.4 c4 warm water (4 degrees F/4 degrees C)',
 '4 cups warm water (4 degrees F/4 degrees C)',
 '4.4 c4 warm water (4 degrees F/4 degrees C)',
 '4 cups warm water (4 degrees F/4 degrees C)',
 '4 cup warm milk (4 degrees F/4 degrees C)',
 '4 cup warm water (4 degrees F/4 degrees C)',
 '4 cups warm water (4 degrees F/4 degrees C)',
 '4 cup warm water (4 degrees F/4 degrees C)',
 '4.4 cu4 warm water (4 degrees F/4 degrees C)',
 '4 cup shredded Cheddar/Monterey Jack cheese blend',
 '4 (4 ounce) package round gyoza/potsticker wrappers',
 '4.4  warm milk (4 degrees F/4 degrees C)',
 '4.4 cups warm wat4(4 degree4F/4 degrees C)',
 '4 tablespoons warm water (4 degrees F/4 degrees C)']

Replace / with or

In [161]:
search_edit_ingredient(r"\/", " or ")
find_ingre_with_substring("/")

[]

In [162]:
tagged_recipe_ingredients = retag_ingredients()

tagged_recipe_ingredients[:20]

[[('4', 'CD'),
  ('tablespoon', 'NN'),
  ('soy', 'NN'),
  ('sauce', 'NN'),
  (',', ','),
  ('or', 'CC'),
  ('more', 'JJR'),
  ('to', 'TO'),
  ('taste', 'VB')],
 [('4', 'CD'),
  ('large', 'JJ'),
  ('onion', 'NN'),
  (',', ','),
  ('cut', 'VBN'),
  ('into', 'IN'),
  ('rings', 'NNS')],
 [('4.4', 'CD'), ('teaspoon', 'NN'), ('onion', 'NN'), ('salt', 'NN')],
 [('4', 'CD'),
  ('skinless', 'NN'),
  (',', ','),
  ('bone-in', 'JJ'),
  ('chicken', 'NN'),
  ('breast', 'NN'),
  ('halves', 'VBZ'),
  ('-', ':'),
  ('cut', 'NN'),
  ('in', 'IN'),
  ('half', 'NN')],
 [('4.4', 'CD'),
  ('red', 'JJ'),
  ('onion', 'NN'),
  (',', ','),
  ('finely', 'RB'),
  ('diced', 'VBD')],
 [('4', 'CD'),
  ('zucchini', 'NN'),
  (',', ','),
  ('halved', 'VBD'),
  ('lengthwise', 'NN')],
 [('4', 'CD'),
  ('large', 'JJ'),
  ('apples', 'NNS'),
  ('-', ':'),
  ('peeled', 'VBN'),
  (',', ','),
  ('cored', 'VBN'),
  (',', ','),
  ('and', 'CC'),
  ('sliced', 'VBD')],
 [('4', 'CD'),
  ('kaffir', 'NN'),
  ('lime', 'NN'),
  ('leaves

## Examining other POS in ingredients

So as to get an idea of POS tagging in the later section

In [163]:
fw_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "FW")))
fw_tags

['arbol',
 'di',
 'skin',
 's',
 'gallo',
 'de',
 'paprika',
 'kielbasa',
 'vanilla',
 'pico',
 'kalamansi',
 'kalonji',
 'mirin',
 'bilbao',
 'miso',
 'herbes']

In [164]:
rp_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "RP")))
rp_tags

['up', 'tomato', 'out', 'dashi', 'off', 'aside']

In [165]:
for rp in rp_tags:
    print(find_ingre_with_substring(" " + rp))

['4 slices bread, broken up into small pieces', '4 cut up chicken pieces']
['4 whole tomatoes', '4 slices ripe tomato', '4 cup finely diced tomato', '4 large tomatoes, coarsely chopped', '4 cup cherry tomatoes', '4 (4 ounce) cans crushed tomatoes', '4.4 cups chopped tomato', '4 (4.4 ounce) container cherry tomatoes, halved', '4 (4 ounce) can diced tomatoes', '4 tomatoes, diced', '4 sun-dried tomatoes (not oil packed)', '4 cups tomato juice', '4 large tomato, cubed', '4 gallon tomato juice', '4.4 cup chopped sun-dried tomatoes', '4 tomatoes, seeded and sliced', '4 tablespoon tomato puree', '4 roma tomato, cut into wedges', '4 large ripe tomatoes, quartered', '4 tomatoes, chopped', '4 (4.4 ounce) can diced tomatoes', '4 tablespoons chopped sun-dried tomatoes', '4 (4 ounce) can diced tomatoes, undrained', '4 cherry tomatoes, halved', '4.4 cup diced fresh tomato', '4.4 cup oil-packed sun-dried tomatoes, coarsely chopped', '4.4(4 ounce) can crushed San Marzano tomatoes', '4 (4.4 ounce) can 

Lamb, lobster and leeks are supposed to be nouns!

In [166]:
rbr_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "RBR")))
rbr_tags

['lobster', 'leeks', 'lamb']

In [167]:
wdt_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "WDT")))
wdt_tags

['whole']

In [168]:
pdt_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "PDT")))
pdt_tags

['half']

In [169]:
prp_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "PRP")))
prp_tags

['you']

In [170]:
find_ingre_with_substring("you ")

['4 (4 ounce) packages garlic and herb couscous mix (or any flavor you prefer)']

In [171]:
prp_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "PRP$")))
prp_tags

['your']

In [172]:
find_ingre_with_substring("your")

['4 (4 ounce) package pasta, your choice of shape']

In [173]:
punc_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, ".")))
punc_tags

['!', '.']

In [174]:
find_ingre_with_substring("!")

['4.4 cup Greek salad dressing, such as Yazzo!']

In [175]:
quote_tags = list(set(list_words_with_tag(tagged_recipe_ingredients, "''")))
quote_tags

["''"]

In [176]:
for q in quote_tags:
    print(find_ingre_with_substring("'"))

["4.4 cup confectioners' sugar", "4.4 cup hot honey (such as Mike's Hot Honey)", "4 tablespoons za'atar", "4 (4.4 ounce) can reduced-fat, reduced-sodium cream of mushroom soup (such as Campbell's Healthy Request)", "4 tablespoons hot sauce (such as Frank's Red Hot )", "4 pound smoked sausage (such as farmer's sausage), sliced", "4 (4 ounce) bottle red Thai curry sauce (such as Trader Joe's)", "4 tablespoon golden syrup (such as Lyle's)", "4.4 cup cajeta, sweetened caramelized goat's milk syrup", "4.4 cups pearl sugar (such as Lars' Own)", "4 cups confectioners' sugar, or more as needed", "4.4 cup confectioners' sugar for dusting", "4 ounce sheep's milk feta cheese", "4 ounces chocolate confectioners' coating", "4.4 cups sifted confectioners' sugar", "4.4 cup confectioners' sugar for dusting", "4 cup confectioners' sugar, sifted", "4 cups confectioners' sugar", "4 tablespoon confectioners' sugar, or to taste (Optional)", "4 frozen meatless vegetable meatballs (such as IKEA's frozen vege

How can these words be symbols?

In [177]:
list(set(list_words_with_tag(tagged_recipe_ingredients, "SYM")))

['tomato',
 'lettuce',
 'thighs',
 'beaten',
 'spinach',
 'mangos',
 'leeks',
 'avocados',
 'basil',
 'choy',
 'kale',
 'mangoes',
 'squash',
 'avocado',
 'shrimp',
 'mango',
 'breast',
 'sauerkraut',
 'lemon',
 'cucumber']

## Casing of recipe names

Because almost all words are capitalized by default in recipe name, need to correct the casing

In [178]:
all_recipe_names = []

for recipe in p_recipes:
    try:
        all_recipe_names.append(recipe['name'])
    except Exception as e:
        pass
    
all_recipe_names[:10]

['Pan-Fried Asparagus',
 'Creamy Au Gratin Potatoes',
 'Super-Delicious Zuppa Toscana',
 'Simple Teriyaki Sauce',
 'Spicy Korean Fried Chicken with Gochujang Sauce',
 'Spaghetti Aglio e Olio',
 'Easy Garam Masala',
 'Easy Chorizo Street Tacos',
 'Russian Cabbage Rolls with Gravy',
 'Shrimp Scampi with Pasta']

Create a corpus by joining all recipe names with \n, because the names were not literally a single text originally. Othwewise it will confuse the tokenisation

In [179]:
all_recipe_names_corpus = ("\n").join(all_recipe_names)

all_recipe_names_corpus

'Pan-Fried Asparagus\nCreamy Au Gratin Potatoes\nSuper-Delicious Zuppa Toscana\nSimple Teriyaki Sauce\nSpicy Korean Fried Chicken with Gochujang Sauce\nSpaghetti Aglio e Olio\nEasy Garam Masala\nEasy Chorizo Street Tacos\nRussian Cabbage Rolls with Gravy\nShrimp Scampi with Pasta\nGreek Lemon Chicken and Potato Bake\nEasy Mexican Casserole\nGerman Apple Cake I\nSpanish Flan\nGerman Pork Chops and Sauerkraut\nSpaghetti Cacio e Pepe\nChef John\'s Chicken Kiev\nIndian-Style Chicken and Onions\nFajita Seasoning\nPerfect Sushi Rice\nTender Italian Baked Chicken\nAuthentic German Potato Salad\nMiso Soup\nMexican Rice II\nSpongy Japanese Cheesecake\nChicken Katsu\nChicken Stir-Fry\nQuick Beef Stir-Fry\nEasy Authentic Mexican Rice\nHerbs de Provence\nGreek or House Dressing\nFrench Bread\nFocaccia Bread\nJamaican Fried Dumplings\nGluehwein\nCoquilles Saint-Jacques\nMexican-Style Chicken Taco Casserole\nRosemary Braised Lamb Shanks\nMake-Ahead Vegetarian Moroccan Stew\nCurry Stand Chicken Tikka

Tokenize

In [180]:
import nltk

recipe_tokens = list(set(nltk.word_tokenize(all_recipe_names_corpus)))
recipe_tokens[:10]

['Cupcakes',
 'Codfish',
 'Chinese',
 'Steak',
 'Chips',
 'Molo',
 'Biscuits',
 'Rice-On-Top',
 'Cocoa',
 'Cashews']

In [181]:
len(recipe_tokens)

3271

Join ingredients into a text with \n and tokenize

In [182]:
ingredients_corpus = ("\n").join(p_ingredients)

ingredients_corpus

'4 tablespoon soy sauce, or more to taste\n4 large onion, cut into rings\n4.4 teaspoon onion salt\n4 skinless, bone-in chicken breast halves - cut in half\n4.4 red onion, finely diced\n4 zucchini, halved lengthwise\n4 large apples - peeled, cored, and sliced\n4 kaffir lime leaves, thinly sliced\n4 fresh jalapeno peppers - seeded, sliced, and divided\n4 lime, juiced\n4.4 cup ice, or as needed\n4 cup chunky peanut butter\n4 tablespoons chopped green onion\n4 pounds bone-in chicken pieces\n4.4 teaspoons black pepper\n4 gallon lard for frying (manteca)\n4 red pepper, seeded and thinly sliced\n4 tablespoons thinly sliced lemongrass\n4.4 pounds boneless pork chops\n4 pound broccoli rabe, cut into 4 4.4-inch lengths\n4 pounds pork tenderloin\n4 pounds pork shoulder\n4 teaspoons mirin (Japanese sweet wine)\n4 whole tomatoes\n4.4 cup vanilla sugar, or as needed\n4 teaspoon ketchup, or to taste\n4 red bell pepper, cut into bite-size pieces\n4 pounds beef chuck, cut into 4-inch cubes\n4.4 teaspoo

In [183]:
ingre_tokens = list(set(nltk.word_tokenize(ingredients_corpus)))
ingre_tokens[:10]

['reserve',
 'Creole-style',
 'fennel',
 'Steak',
 'Chinese',
 'chilies',
 'packages',
 'out',
 'ramen',
 'San']

In [184]:
len(ingre_tokens)

2813

Most words in recipe tokens are capitalized

In [185]:
lower_recipe_tokens = []
for token in recipe_tokens:
    if token[0].islower():
        lower_recipe_tokens.append(token)
        
lower_recipe_tokens

['des',
 'from',
 'la',
 'without',
 'et',
 'le',
 'y',
 'powder',
 'e',
 'a',
 'on',
 'de',
 'to',
 'version',
 'bil',
 "l'Oignon",
 "all'Amatriciana",
 'au',
 'con',
 'na',
 'over',
 'aka',
 'for',
 'or',
 'by',
 'laziale',
 'nach',
 'in',
 'alla',
 'chili',
 'el',
 'aux',
 'and',
 'en',
 'its',
 'al',
 'su',
 'di',
 'the',
 'z',
 'sa',
 'with',
 'of']

Number of words that are not capitalized increased significantly crosschecking with lowercase words in ingredient tokens

In [186]:
for i, name in enumerate(recipe_tokens):
    for ingre in ingre_tokens:
        if recipe_tokens[i].lower() == ingre:
            recipe_tokens[i] = recipe_tokens[i].lower()

lower_recipe_tokens = []
for token in recipe_tokens:
    if token[0].islower():
        lower_recipe_tokens.append(token)
        
len(lower_recipe_tokens)

923

In [187]:
upper_recipe_tokens = list(filter(str.istitle, recipe_tokens))
len(upper_recipe_tokens)

2314

In [188]:
upper_recipe_tokens[:20]

['Cupcakes',
 'Codfish',
 'Chinese',
 'Molo',
 'Biscuits',
 'Rice-On-Top',
 'Be',
 'Fatoosh',
 'Jagerschnitzel',
 'Sum',
 'Kristen',
 'Ube-Macapuno',
 'Scotian',
 'Salat',
 'Costa',
 'Scotia',
 'Egg-Fried',
 'Waffles',
 'Blini',
 'Mangonada']

Use country names to get the words related to country names for capitalization

In [189]:
!pip install country_list

Collecting country_list
  Downloading country_list-1.0.0-py3-none-any.whl (1.5 MB)
[?25l[K     |▏                               | 10 kB 21.4 MB/s eta 0:00:01[K     |▍                               | 20 kB 22.3 MB/s eta 0:00:01[K     |▋                               | 30 kB 26.7 MB/s eta 0:00:01[K     |▉                               | 40 kB 26.6 MB/s eta 0:00:01[K     |█                               | 51 kB 24.2 MB/s eta 0:00:01[K     |█▎                              | 61 kB 25.9 MB/s eta 0:00:01[K     |█▌                              | 71 kB 22.2 MB/s eta 0:00:01[K     |█▊                              | 81 kB 22.0 MB/s eta 0:00:01[K     |██                              | 92 kB 23.2 MB/s eta 0:00:01[K     |██▏                             | 102 kB 24.8 MB/s eta 0:00:01[K     |██▍                             | 112 kB 24.8 MB/s eta 0:00:01[K     |██▋                             | 122 kB 24.8 MB/s eta 0:00:01[K     |██▉                             | 133 kB 24.8 

In [190]:
from country_list import countries_for_language

countries = dict(countries_for_language('en'))
countries = list(countries.values())

countries

['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua & Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia & Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'British Virgin Islands',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Caribbean Netherlands',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo - Brazzaville',
 'Congo - Kinshasa',
 'Cook Islands',
 'Costa Rica',
 'Côte d’Ivoire',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egyp

Not all words are captured in the country names library, added some more.

In [191]:
countries = ' '.join([elem for elem in countries])
countries = countries.replace('&', '')
countries = countries.split(" ")
countries = [i.strip() for i in countries]
countries = [string for string in countries if string != ""]
countries = [string for string in countries if string != "-"]

countries = countries + ["Filipino", "Malay", "Spanish", "Danish", "Welsh", "Polish", "Schwabisch", "Rochester", "Asia",
                         "Aussie", "Greek", "German", "Mexica", "Hawaii", "Irish", "Mediterranean", "Middle", "East",
                        "Norwegian", "Persian", "Pollo", "Thai", "West"]

countries

['Afghanistan',
 'Åland',
 'Islands',
 'Albania',
 'Algeria',
 'American',
 'Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua',
 'Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia',
 'Herzegovina',
 'Botswana',
 'Bouvet',
 'Island',
 'Brazil',
 'British',
 'Indian',
 'Ocean',
 'Territory',
 'British',
 'Virgin',
 'Islands',
 'Brunei',
 'Bulgaria',
 'Burkina',
 'Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape',
 'Verde',
 'Caribbean',
 'Netherlands',
 'Cayman',
 'Islands',
 'Central',
 'African',
 'Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas',
 'Island',
 'Cocos',
 '(Keeling)',
 'Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Brazzaville',
 'Congo',
 'Kinshasa',
 'Cook',
 'Islands',
 'Costa',
 'Rica',
 'Côte',
 'd’Ivoire',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'C

Then use stemmer to get the stem of the words in the country names. But if the stem is too short, just use the first 5 characters of the word

In [192]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

porter = PorterStemmer()
lancaster=LancasterStemmer()

porter_c = []
lancester_c = []

for c in countries:
    port = porter.stem(c.split(' ').pop(0))
    if len(port) < 5:
        port = c[:4]
    porter_c.append(port.capitalize())
    lan = lancaster.stem(c.split(' ').pop(0))
    if len(lan) < 5:
        lan = c[:4]
    lancester_c.append(lan.capitalize())

print(porter_c[:10])
print(lancester_c[:10])

['Afghanistan', 'Åland', 'Island', 'Albania', 'Algeria', 'American', 'Samoa', 'Andorra', 'Angola', 'Anguilla']
['Afgh', 'Åland', 'Island', 'Alban', 'Alger', 'Amer', 'Samo', 'Andorr', 'Angol', 'Anguill']


In [193]:
lancester_c.append("Victoria")
lancester_c

['Afgh',
 'Åland',
 'Island',
 'Alban',
 'Alger',
 'Amer',
 'Samo',
 'Andorr',
 'Angol',
 'Anguill',
 'Antarctic',
 'Antigu',
 'Barbud',
 'Argentin',
 'Armen',
 'Arub',
 'Austral',
 'Austr',
 'Azerbaid',
 'Bahama',
 'Bahrain',
 'Bangladesh',
 'Barbado',
 'Belar',
 'Belg',
 'Beli',
 'Benin',
 'Bermud',
 'Bhut',
 'Boliv',
 'Bosn',
 'Herzegovin',
 'Botswan',
 'Bouvet',
 'Island',
 'Brazil',
 'Brit',
 'Indi',
 'Ocea',
 'Territ',
 'Brit',
 'Virgin',
 'Island',
 'Brune',
 'Bulgar',
 'Burkin',
 'Faso',
 'Burund',
 'Cambod',
 'Cameroon',
 'Canad',
 'Cape',
 'Verd',
 'Carib',
 'Netherland',
 'Caym',
 'Island',
 'Cent',
 'Afri',
 'Republ',
 'Chad',
 'Chil',
 'Chin',
 'Christmas',
 'Island',
 'Coco',
 '(keeling)',
 'Island',
 'Colomb',
 'Comoro',
 'Congo',
 'Brazzavil',
 'Congo',
 'Kinshas',
 'Cook',
 'Island',
 'Cost',
 'Rica',
 'Côte',
 'D’ivoire',
 'Croat',
 'Cuba',
 'Curaçao',
 'Cypr',
 'Czech',
 'Denmark',
 'Djibout',
 'Dominic',
 'Domin',
 'Republ',
 'Ecuad',
 'Egypt',
 'El',
 'Salvad',
 'E

Get all the recipe tokens that have the country names stem and remove the unrelated tokens

In [194]:
token_with_country_prefix = []
for rt in recipe_tokens:
    for lan in lancester_c:
        if lan in rt:
            token_with_country_prefix.append(rt)

token_with_country_prefix = sorted(list(set(token_with_country_prefix)))
token_with_country_prefix.remove("No-Cook")
token_with_country_prefix.remove("Man")
token_with_country_prefix.remove("Slow-Cooked")
token_with_country_prefix.remove("Slow-Cooker")
token_with_country_prefix.remove("Garlic-Anchovy-Sardine")
token_with_country_prefix

["'Chinese",
 'Afghan',
 'Afghani',
 'African',
 'African-Style',
 'Afritada',
 'Algerian',
 'Almond-Ricotta',
 'American',
 'Americano',
 'Arabic',
 'Argentine',
 'Argentinean',
 'Armenian',
 'Asiago',
 'Asian',
 'Asian-Inspired',
 'Asian-Style',
 'Asian-Themed',
 'Australian',
 'Bangladeshi',
 'Belgi',
 'Belgian',
 'Belizean',
 'Bermuda',
 'Bhutanese',
 'Bolivian',
 'Brazilian',
 'Brazilian-Style',
 'British',
 'Bulgarian',
 'Cambodian',
 'Canada',
 'Canadian',
 'Cape',
 'Capezzoli',
 'Caribbean',
 'Caribbean-Spiced',
 'Chad',
 'Chilaquiles',
 'Chilean',
 'Chilean-Style',
 'Chinese',
 'Chinese-Style',
 'Christmas',
 'Coco',
 'Coconut-Lentil',
 'Coconut-Lime',
 'Cocotte',
 'Colombian',
 'Cooker',
 'Cooks',
 'Cookup',
 'Costa',
 'Croatian',
 'Cuban',
 'Cuban-Inspired',
 'Cuban-Style',
 'Cubanos',
 'Curry-Coconut',
 'Czech',
 'Czechoslovakian',
 'Danielle',
 'Danish',
 'Dominican',
 'Dominican-Style',
 'East',
 'Easter',
 'Eastern',
 'Eastern-Style',
 'Egyptian',
 'Elizabeth',
 'Ellen',

In [195]:
token_with_country_prefix

["'Chinese",
 'Afghan',
 'Afghani',
 'African',
 'African-Style',
 'Afritada',
 'Algerian',
 'Almond-Ricotta',
 'American',
 'Americano',
 'Arabic',
 'Argentine',
 'Argentinean',
 'Armenian',
 'Asiago',
 'Asian',
 'Asian-Inspired',
 'Asian-Style',
 'Asian-Themed',
 'Australian',
 'Bangladeshi',
 'Belgi',
 'Belgian',
 'Belizean',
 'Bermuda',
 'Bhutanese',
 'Bolivian',
 'Brazilian',
 'Brazilian-Style',
 'British',
 'Bulgarian',
 'Cambodian',
 'Canada',
 'Canadian',
 'Cape',
 'Capezzoli',
 'Caribbean',
 'Caribbean-Spiced',
 'Chad',
 'Chilaquiles',
 'Chilean',
 'Chilean-Style',
 'Chinese',
 'Chinese-Style',
 'Christmas',
 'Coco',
 'Coconut-Lentil',
 'Coconut-Lime',
 'Cocotte',
 'Colombian',
 'Cooker',
 'Cooks',
 'Cookup',
 'Costa',
 'Croatian',
 'Cuban',
 'Cuban-Inspired',
 'Cuban-Style',
 'Cubanos',
 'Curry-Coconut',
 'Czech',
 'Czechoslovakian',
 'Danielle',
 'Danish',
 'Dominican',
 'Dominican-Style',
 'East',
 'Easter',
 'Eastern',
 'Eastern-Style',
 'Egyptian',
 'Elizabeth',
 'Ellen',

Possessives can also be used for capitalizing, since proper names like Chef John's occur a lot

In [196]:
possesive_tokens = list_words_with_tag(tagged_recipe_names, "''")
possesive_tokens

["''", "''", "'", "''", "''", "''", "''"]

In [197]:
possessive_names = []
for ps in possesive_tokens:
    print(find_value_with_char(p_recipes, 'name', ps))
    possessive_names = possessive_names + find_value_with_char(p_recipes, 'name', ps)

[]
[]
["Chef John's Chicken Kiev", "Angela's Awesome Enchiladas", "Randy's Slow Cooker Ravioli Lasagna", "'Chinese Buffet' Green Beans", "Chef John's Beef Rouladen", "Corned Beef and Cabbage Shepherd's Pie", "Gramma's Date Squares", "Authentic Russian Salad 'Olivye'", "Chef John's Meatless Meatballs", "Chef John's Beef Goulash", "Grandma's Noodles II", "Chef John's Clotted Cream", "Newfoundland Jigg's Dinner", "Chef John's Coq Au Vin", "Chef John's Loco Moco", "Dash's Donair", "Turkey Shepherd's Pie", "Papa Drexler's Bavarian Pretzels", "Bob's Stuffed Banana Peppers", "Chef John's Swedish Meatballs", "Chef John's German Recipes", "Chef John's Chicken Tikka Masala", "Maria's Mexican Rice", "Mom's Buttermilk Pancakes", "Geneva's Ultimate Hungarian Mushroom Soup", "Charley's Slow Cooker Mexican Style Meat", "Ingrid's Rouladen", "Chef John's Lasagna", "Lola's Horchata", "Chef John's Italian Sausage Chili", "Kid's Favorite Pizza Casserole", "Traci's Adobo Seasoning", "Frank's Favorite Slow-

In [198]:
possessive_names

["Chef John's Chicken Kiev",
 "Angela's Awesome Enchiladas",
 "Randy's Slow Cooker Ravioli Lasagna",
 "'Chinese Buffet' Green Beans",
 "Chef John's Beef Rouladen",
 "Corned Beef and Cabbage Shepherd's Pie",
 "Gramma's Date Squares",
 "Authentic Russian Salad 'Olivye'",
 "Chef John's Meatless Meatballs",
 "Chef John's Beef Goulash",
 "Grandma's Noodles II",
 "Chef John's Clotted Cream",
 "Newfoundland Jigg's Dinner",
 "Chef John's Coq Au Vin",
 "Chef John's Loco Moco",
 "Dash's Donair",
 "Turkey Shepherd's Pie",
 "Papa Drexler's Bavarian Pretzels",
 "Bob's Stuffed Banana Peppers",
 "Chef John's Swedish Meatballs",
 "Chef John's German Recipes",
 "Chef John's Chicken Tikka Masala",
 "Maria's Mexican Rice",
 "Mom's Buttermilk Pancakes",
 "Geneva's Ultimate Hungarian Mushroom Soup",
 "Charley's Slow Cooker Mexican Style Meat",
 "Ingrid's Rouladen",
 "Chef John's Lasagna",
 "Lola's Horchata",
 "Chef John's Italian Sausage Chili",
 "Kid's Favorite Pizza Casserole",
 "Traci's Adobo Seasoning"

Chef John's Lasagna, but how about just lasagna? Saving both

In [199]:
non_possessive = []
for ps in possessive_names:
    if "'s " in ps:
        non_possessive.append(ps.split("'s ",1)[1].lower())

non_possessive

['chicken kiev',
 'awesome enchiladas',
 'slow cooker ravioli lasagna',
 'beef rouladen',
 'pie',
 'date squares',
 'meatless meatballs',
 'beef goulash',
 'noodles ii',
 'clotted cream',
 'dinner',
 'coq au vin',
 'loco moco',
 'donair',
 'pie',
 'bavarian pretzels',
 'stuffed banana peppers',
 'swedish meatballs',
 'german recipes',
 'chicken tikka masala',
 'mexican rice',
 'buttermilk pancakes',
 'ultimate hungarian mushroom soup',
 'slow cooker mexican style meat',
 'rouladen',
 'lasagna',
 'horchata',
 'italian sausage chili',
 'favorite pizza casserole',
 'adobo seasoning',
 'favorite slow-cooker thai chicken',
 'shrimp fra diavolo',
 'chicken paprikash',
 'french omelette',
 'pie',
 'hazelnut christmas cookies',
 'patatas bravas',
 'italian bread',
 'cuban bread',
 'pie',
 'chimichurri sauce',
 'easy german sauerbraten',
 'pie',
 'german marble cake',
 'steak pizzaiola',
 'sour cream lasagna',
 'beef shish kabobs',
 'polish perogies',
 'indian-spiced tomato lentil soup',
 'shep

In [200]:
all_recipe_names[:10]

['Pan-Fried Asparagus',
 'Creamy Au Gratin Potatoes',
 'Super-Delicious Zuppa Toscana',
 'Simple Teriyaki Sauce',
 'Spicy Korean Fried Chicken with Gochujang Sauce',
 'Spaghetti Aglio e Olio',
 'Easy Garam Masala',
 'Easy Chorizo Street Tacos',
 'Russian Cabbage Rolls with Gravy',
 'Shrimp Scampi with Pasta']

Create a copy of all_recipe_names as backup

In [201]:
all_recipe_names2 = all_recipe_names.copy()
all_recipe_names2[:10]

['Pan-Fried Asparagus',
 'Creamy Au Gratin Potatoes',
 'Super-Delicious Zuppa Toscana',
 'Simple Teriyaki Sauce',
 'Spicy Korean Fried Chicken with Gochujang Sauce',
 'Spaghetti Aglio e Olio',
 'Easy Garam Masala',
 'Easy Chorizo Street Tacos',
 'Russian Cabbage Rolls with Gravy',
 'Shrimp Scampi with Pasta']

Drop the recipe names that have possessives temporarily

In [202]:
print(len(all_recipe_names))
  
all_recipe_names2 = [ele for ele in all_recipe_names2 if ele not in possessive_names] 
print(len(all_recipe_names2))

#what is the purpose of this?

5249
4890


If a word in a recipe does not belong to the tokens with country prefix, lowercase it by default

In [203]:
# https://stackoverflow.com/questions/40291443/python-convert-a-string-to-lowercase-except-some-special-strings/40291577
lowerAllExcept = lambda x: " ".join( a if a in token_with_country_prefix else a.lower()
                                    for a in x.split() )

for i, recipe in enumerate(all_recipe_names2):
    for t in token_with_country_prefix:
        all_recipe_names2[i] = lowerAllExcept(all_recipe_names2[i])

Join the names with possessives back to the list

In [204]:
all_recipe_names2 = all_recipe_names2 +  possessive_names
print(len(all_recipe_names2))
all_recipe_names2 = all_recipe_names2 +  non_possessive
print(len(all_recipe_names2))
all_recipe_names2 = list(set(all_recipe_names2)) # what does this code do?
print(len(all_recipe_names2))

5249
5577
5362


For some reasons, 'Thai' is saved as 'thai'

In [205]:
all_recipe_names_corpus = ("\n").join(all_recipe_names2)

recipe_tokens = list(set(nltk.word_tokenize(all_recipe_names_corpus)))

recipe_tokens = [i.replace('thai','Thai') for i in recipe_tokens]

lower_recipe_tokens = []
for token in recipe_tokens:
    if token[0].islower():
        lower_recipe_tokens.append(token)
        
lower_recipe_tokens

['texas',
 'fennel',
 'gods',
 'drink',
 'out',
 'unbeatable',
 'ramen',
 'boscobel',
 'tartare',
 'belly',
 'famous',
 'bao',
 'nutty',
 'piccata',
 'brewis',
 'minestrone',
 'koong',
 'dynamite',
 'cowboy',
 'timballo',
 'kanafa',
 'mint',
 'prizewinning',
 'pignoli',
 'potica',
 'des',
 'kaiserschmarrn',
 'tilapia',
 'flourless',
 'gujarati',
 'oil-poached',
 'furikake',
 'shiitake',
 'rhubarb',
 'hunan-style',
 'mayo',
 'reuben',
 'halloween',
 'ribs',
 'deluxe',
 'shaking',
 'family-pleasing',
 'karjalan',
 'tinaktak',
 'octoberfest',
 'hasenpfeffer',
 'beef-and-bean',
 'pancetta',
 'florentine',
 'chipas',
 'oyster',
 'cashews',
 'stovetop',
 'sum',
 'souvlaki',
 'wraps',
 'ghormeh',
 'churrasco',
 'hamantashen',
 'biriyani',
 'paprikash',
 'grill',
 'melenzana',
 'tak',
 'marinade',
 'aguadito',
 'kofta',
 'dressing',
 'seed',
 'brinjal',
 'moco',
 'biscuits',
 'noodles',
 'say',
 'enchilada',
 'rice-on-top',
 'pizzelles',
 'teff',
 'alentejana',
 'crabmeat',
 'soda',
 'garlic',

In [206]:
len(lower_recipe_tokens)

2811

In [207]:
upper_recipe_tokens = list(filter(str.istitle, recipe_tokens))
len(upper_recipe_tokens)

853

In [208]:
upper_recipe_tokens

['Chinese',
 'Steak',
 'Biscuits',
 'Be',
 'Simple',
 'Kristen',
 'Costa',
 'Scotia',
 'Honey',
 'Mangonada',
 'Santorini',
 'Diavolo',
 'Samosas',
 'Kong',
 'Jansson',
 'D',
 'Famous',
 'Jamaica',
 'Pita',
 'Wings',
 'Sans',
 'Mozambique',
 'Beans',
 'Porchetta',
 'Asian',
 'Samosa',
 'Alfredo',
 'Guyanese',
 'Mike',
 'Eastern-Style',
 'Tonno',
 'Fruit',
 'Almond',
 'Spanish-Style',
 'Taiwanese',
 'Apple',
 'Tuscan',
 'Salt',
 'Prizewinning',
 'Brisket',
 'Nut',
 'Danish',
 'Thai',
 'Hazelnut',
 'Leek',
 'Colombian',
 'Poor',
 'Lucy',
 'Victorian',
 'Cabbage',
 'Americano',
 'Keon',
 'Canadian',
 'Cubanos',
 'Armenian',
 'Sangria',
 'Coconut-Lime',
 'Out',
 'India',
 'Whitney',
 'Parmesan',
 'Afghan',
 'Buttermilk',
 'Dough',
 'Pad',
 'Varenyky',
 'Manok',
 'Melissa',
 'Malaysian',
 'South',
 'Jamaican',
 'Pierogy',
 'Sea',
 'Cornish',
 'Rustica',
 'Donair',
 'Tonkatsu',
 'Thai',
 'Spicy',
 'Slammin',
 'Joy',
 'Indio',
 'Drexler',
 'Rico',
 'Gambas',
 'Tofu',
 'Southwestern',
 'Maltes

Recreating recipe for token list after changing casing of the corpus 

In [209]:
all_recipe_names_corpus = ("\n").join(all_recipe_names2)

all_recipe_names_corpus

'\neasy German apple streusel pie\ntraditional Turkish dumplings\ntofu hiyayakko\nA Scotsman\'s Shepherd Pie\ntamale pie\nsopapilla cheesecake dessert\nPortuguese chourico and peppers\neasy bruschetta\nRussian Pierogi with potatoes and mushrooms\nKrista\'s Sticky Honey Garlic Wings\neasy baked Indian Samosas\nPerfect St. Patrick\'s Day Cake\ntraditional Italian limoncello\nrabbit stew\nCara\'s Creamy Stuffed Shells\nfried chicken chunks Dominican\nquick chicken enchiladas\nkaiserschmarrn\nGerman cucumber salad\nperfect parkin\nlasagna alfredo\nSamosas\ntake-out fake-out Pollo con crema\nMillionaire\'s Shortbread\nkartoffelsuppe nach bayrischer art\nArgentine chimichurri bread\nJamaican me crazy chili\nchiles rellenos\nsuper easy slow Cooker chicken enchilada meat\nIndian tomato chicken\nbiriyani\nslow Cooker pork cacciatore\nrigatoni alla genovese\nMexican pot roast\nfettuccine alfredo v\nduck fried rice\nIsolde\'s German Cheesecake\nthree cheese manicotti\npork in peanut sauce\nziti w

In [210]:
recipe_tokens = list(set(nltk.word_tokenize(all_recipe_names_corpus)))

recipe_tokens[:10]

['texas',
 'Chinese',
 'fennel',
 'gods',
 'Steak',
 'Biscuits',
 'drink',
 'out',
 'unbeatable',
 'Be']

## Updating POS tags in names after changing casing

Previously, almost all the words belong to NNP or NNPS, due to capitalization. By fixing the letter casings, now most of the words are NN (common nouns)

In [211]:
final_tagged_names = []

for recipe in all_recipe_names2:
    final_tagged_names.append(tag_pos(recipe))

all_name_tags = []

for POS in ALL_POS:
  new_dic = {POS: list_words_with_tag(final_tagged_names, POS)}
  all_name_tags.append(new_dic)

get_tag_number(all_name_tags)

[{'$': 0},
 {"''": 8},
 {'(': 0},
 {')': 0},
 {',': 63},
 {'--': 0},
 {'.': 1},
 {':': 1},
 {'CC': 509},
 {'CD': 28},
 {'DT': 101},
 {'EX': 3},
 {'FW': 46},
 {'IN': 523},
 {'JJ': 3132},
 {'JJR': 10},
 {'JJS': 6},
 {'LS': 0},
 {'MD': 2},
 {'NN': 9070},
 {'NNP': 1724},
 {'NNPS': 5},
 {'NNS': 1426},
 {'PDT': 0},
 {'POS': 344},
 {'PRP': 3},
 {'PRP$': 1},
 {'RB': 61},
 {'RBR': 0},
 {'RBS': 0},
 {'RP': 4},
 {'SYM': 1},
 {'TO': 10},
 {'UH': 0},
 {'VB': 52},
 {'VBD': 261},
 {'VBG': 101},
 {'VBN': 232},
 {'VBP': 211},
 {'VBZ': 20},
 {'WDT': 0},
 {'WP': 0},
 {'WP$': 0},
 {'WRB': 0},
 {'``': 6}]

## Chunking (recipe names)

If the number of words in recipes are more than 2 (since bigram can deal with 2-word names), then it can be treated as a recipe name chunk

In [212]:
def sort_unique_list(old_list):
    return sorted(list(set(old_list)))

In [213]:
recipe_name_chunk = []

for recipe in all_recipe_names2:
    if len(recipe.split()) > 2:
        recipe_name_chunk.append(recipe)

recipe_name_chunk = sort_unique_list(recipe_name_chunk)

for n in recipe_name_chunk:
    print(n)

"million dollar" Chinese cabbage salad
"pantry raid" chicken enchilada casserole
"skinny" chicken tacos
'Chinese Buffet' Green Beans
3-ingredient lemon scones
5-ingredient Mexican casserole
A Firefighter's Meatloaf
A Scotsman's Shepherd Pie
Adriel's Chinese Curry Chicken
Afghan beef raviolis
Afghani kabli pulao
African cabbage stew
African chicken stew
African sweet potato and peanut soup
African sweet potato stew
African-Style oxtail stew
Al's Baked Swiss Steak
Al's Burmese Chicken Curry
Ali's Amazing Bruschetta
Alicia's Aloo Gobi
Allie's Mushroom Pizza
Alysia's Basic Meat Lasagna
Amanda's Stuffed Peppers
Andy's Spicy Green Chile Pork
Angela's Asian-Inspired Chicken Noodle Soup
Angela's Awesome Enchiladas
Anne's Chicken Chilaquiles Rojas
Arabic fattoush salad
Argentine chimichurri bread
Argentine meat empanadas
Argentinean cheese bread
Armenian Easter bread
Armenian shish kabob
Armenian stuffed eggplant
Asiago sun-dried tomato pasta
Asian beef with snow peas
Asian chicken salad
Asian 

Get all the prepositions found by NLTK

In [214]:
in_tokens = sort_unique_list(get_values_from_dict_list(all_name_tags, 'IN')[0])

in_tokens

['Of',
 'Under',
 'arroz',
 'bayrischer',
 'before',
 'beyond',
 'brown',
 'by',
 'de',
 'dough',
 'en',
 'for',
 'from',
 'in',
 'of',
 'on',
 'out',
 'over',
 'pina',
 'so',
 'trout',
 'under',
 'with',
 'without',
 'worth']

Keep only the actual prepositions

In [215]:
in_tokens = ['Of',
 'Under',
 'before',
 'beyond',
 'by',
 'for',
 'from',
 'in',
 'of',
 'on',
 'out',
 'over',
 'so',
 'under',
 'with',
 'without']

in_tokens

['Of',
 'Under',
 'before',
 'beyond',
 'by',
 'for',
 'from',
 'in',
 'of',
 'on',
 'out',
 'over',
 'so',
 'under',
 'with',
 'without']

Get all the recipe names with prepositions

In [216]:
names_in_tokens = [s for s in all_recipe_names2 if any(xs in s for xs in in_tokens)]

names_in_tokens

['traditional Turkish dumplings',
 'tofu hiyayakko',
 'sopapilla cheesecake dessert',
 'Russian Pierogi with potatoes and mushrooms',
 "Krista's Sticky Honey Garlic Wings",
 'traditional Italian limoncello',
 'fried chicken chunks Dominican',
 'perfect parkin',
 'take-out fake-out Pollo con crema',
 "Millionaire's Shortbread",
 'kartoffelsuppe nach bayrischer art',
 'Argentine chimichurri bread',
 'rigatoni alla genovese',
 'fettuccine alfredo v',
 "Isolde's German Cheesecake",
 'pork in peanut sauce',
 'ziti with Italian sausage',
 'jerk chicken wings',
 'marzipan Christmas kringle',
 'camarones con crema',
 'sweet and sour sauce ii',
 'baked penne with Italian sausage',
 'azteca soup',
 'Chinese steamed buns with meat filling',
 'tonkatsu shoyu ramen',
 'crispy tilapia fish tacos with slaw',
 'sopa de tortilla',
 'fereni starch pudding',
 'pork, apple, and ginger stir-fry with hoisin sauce',
 'instant pot barbacoa',
 'Indonesian chicken skewers with peanut sauce',
 '5-ingredient Mexi

Define a function that chunks based on grammar given, but only return chunk that have more than 2 words, since bigrams already can deal with phrases with 2 words anyway

In [217]:
from nltk import RegexpParser

def chunk(corpus, grammar, target):
    chunker = RegexpParser(grammar)
    tagged = pos_tag(word_tokenize(corpus))
    output = chunker.parse(tagged)
    outputs = []
    for subtree in output.subtrees(filter=lambda t: t.label() == target):
        result = re.sub("(\([A-Z]+ )|(\/[A-Z]+)|(\))+", "", str(subtree))
        if len(result.split()) > 2:
            outputs.append(result)
    return outputs

# https://github.com/nopynospy/pos_tagging/blob/main/pos.ipynb

PP_REGEX = r"""
  ADJP: {<RB>?<JJ|JJR|JJS|RBR|RBS>}    # Adjectives may have comparative and superlative, and come after adverbs like very
  NP: {<DT|WDT|WP$>?<CD>?<AdjP>*<NN|NNS|NNP|NNPS><POS>*<NN|NNS|NNP|NNPS|PP|CD>*<VBG>?}    # Determiner, number and adjectives come before nouns and nouns may have possessive -s and followed by another noun
  NP: {<PRP|EX|CD|WP|WRB|PRP$|WP$>}    # Pronouns and numbers can also replace nouns and function as one
  PP: {<IN>?<IN>?<IN|TO><NP>}    # Prepositions come before nouns and sometimes two prepositions come together
"""

chunk("chicken marsala with portobello mushrooms", PP_REGEX, "PP")

['with portobello mushrooms']

Get prepositional phrases from recipe names

In [218]:
prepositional_phrases = []

for name in names_in_tokens:
    prepositional_phrases = prepositional_phrases + (chunk(name, PP_REGEX, "PP"))
    
prepositional_phrases

['in peanut sauce',
 'with meat filling',
 'with hoisin sauce',
 'with peanut sauce',
 'with cream cheese',
 'with tzatziki sauce',
 'with jasmine rice',
 'with ramen noodles',
 'with bechamel sauce',
 'with cauli rice',
 'with peanut sauce',
 'with Seared Sea Scallops',
 'under a brick',
 'with Guinness Chocolate Icing',
 'with crumbly topping',
 'with coconut crust',
 'with coconut milk',
 'with homemade beef stock',
 'with coconut milk',
 'in tomato sauce',
 'with homemade taco seasoning',
 'of cauliflower soup',
 'on the cob',
 'with hollandaise sauce',
 'in puff pastry',
 'with sake butter',
 'on the rocks',
 'for Mexican soups',
 'in tamarind broth',
 'in tomato sauce',
 'of the Lasagna',
 'with Mango-Pineapple slaw',
 'out of this world spaghetti',
 'in the air fryer',
 'in chicken broth',
 'with water chestnuts',
 'without coconut milk',
 'with a twist',
 'in a bag',
 'with honey mayoster sauce',
 'with peanut butter',
 'in coconut milk',
 'with gochujang sauce',
 'with Asiago 

## Chunking (ingredients)

In [219]:
all_ingre_tags = []

for POS in ALL_POS:
  new_dic = {POS: list_words_with_tag(tagged_recipe_ingredients, POS)}
  all_ingre_tags.append(new_dic)

get_tag_number(all_ingre_tags)

[{'$': 0},
 {"''": 14},
 {'(': 3742},
 {')': 3827},
 {',': 8513},
 {'--': 0},
 {'.': 23},
 {':': 295},
 {'CC': 3094},
 {'CD': 21802},
 {'DT': 99},
 {'EX': 0},
 {'FW': 52},
 {'IN': 2849},
 {'JJ': 13400},
 {'JJR': 523},
 {'JJS': 6},
 {'LS': 0},
 {'MD': 612},
 {'NN': 32984},
 {'NNP': 2400},
 {'NNPS': 2},
 {'NNS': 13598},
 {'PDT': 1},
 {'POS': 126},
 {'PRP': 1},
 {'PRP$': 1},
 {'RB': 1452},
 {'RBR': 5},
 {'RBS': 0},
 {'RP': 13},
 {'SYM': 53},
 {'TO': 1039},
 {'UH': 0},
 {'VB': 1724},
 {'VBD': 8947},
 {'VBG': 354},
 {'VBN': 3436},
 {'VBP': 645},
 {'VBZ': 588},
 {'WDT': 1},
 {'WP': 0},
 {'WP$': 0},
 {'WRB': 0},
 {'``': 0}]

Get all the prepositions detected by NLTK from ingredients

In [220]:
in_tokens = sort_unique_list(get_values_from_dict_list(all_ingre_tags, 'IN')[0])

in_tokens

['OF',
 'about',
 'across',
 'against',
 'aji',
 'almond',
 'ancho',
 'aonori',
 'as',
 'at',
 'brown',
 'by',
 'de',
 'dough',
 'for',
 'from',
 'if',
 'in',
 'into',
 'nonfat',
 'nutmeg',
 'of',
 'on',
 'orzo',
 'out',
 'over',
 'pepper',
 'per',
 'pimento',
 'pinto',
 'taco',
 'tamarind',
 'through',
 'trout',
 'until',
 'with',
 'without',
 'wrapper']

Keep only actual prepositions

In [221]:
in_tokens = ['OF',
 'about',
 'across',
 'against',
 'as',
 'at',
 'by',
 'for',
 'from',
 'if',
 'in',
 'into',
 'of',
 'on',
 'out',
 'over',
 'per',
 'through',
 'until',
 'with',
 'without']

in_tokens

['OF',
 'about',
 'across',
 'against',
 'as',
 'at',
 'by',
 'for',
 'from',
 'if',
 'in',
 'into',
 'of',
 'on',
 'out',
 'over',
 'per',
 'through',
 'until',
 'with',
 'without']

Get all the ingredients with prepositions

In [222]:
ingres_in_tokens = [s for s in p_ingredients if any(xs in s for xs in in_tokens)]

ingres_in_tokens

['4 tablespoon soy sauce, or more to taste',
 '4 large onion, cut into rings',
 '4.4 teaspoon onion salt',
 '4 skinless, bone-in chicken breast halves - cut in half',
 '4.4 red onion, finely diced',
 '4 zucchini, halved lengthwise',
 '4 kaffir lime leaves, thinly sliced',
 '4 fresh jalapeno peppers - seeded, sliced, and divided',
 '4.4 cup ice, or as needed',
 '4 tablespoons chopped green onion',
 '4 pounds bone-in chicken pieces',
 '4.4 teaspoons black pepper',
 '4 gallon lard for frying (manteca)',
 '4 red pepper, seeded and thinly sliced',
 '4 tablespoons thinly sliced lemongrass',
 '4.4 pounds boneless pork chops',
 '4 pound broccoli rabe, cut into 4 4.4-inch lengths',
 '4 pounds pork tenderloin',
 '4 teaspoons mirin (Japanese sweet wine)',
 '4 whole tomatoes',
 '4.4 cup vanilla sugar, or as needed',
 '4 teaspoon ketchup, or to taste',
 '4 red bell pepper, cut into bite-size pieces',
 '4 pounds beef chuck, cut into 4-inch cubes',
 '4.4 teaspoon ground cloves',
 '4 large sweet onion

In [223]:
prepositional_phrases2 = []

for name in ingres_in_tokens:
    prepositional_phrases2 = prepositional_phrases2 + (chunk(name, PP_REGEX, "PP"))
    
prepositional_phrases2

['into 4 inch pieces',
 'into 4 inch pieces',
 'of mushroom soup',
 'at room temperature',
 'into 4 inch pieces',
 'at room temperature',
 'as Smart Balance',
 "as Mike 's Hot Honey",
 'in adobo sauce',
 'into 4 inch cubes',
 'at room temperature',
 'of mushroom soup',
 "as Campbell 's Healthy Request",
 "as Frank 's Red Hot",
 'of mushroom soup',
 "as farmer 's sausage",
 'into 4 wedges',
 'de arbol peppers',
 'into 4 inch pieces',
 'as Trader Joe',
 'of 4 limes',
 'into 4 wedges',
 'at room temperature',
 'from the cob',
 'into 4 pieces',
 'for 4 minutes',
 'in au jus',
 'into 4 inch pieces',
 "as Lyle 's",
 'against the grain',
 'in adobo sauce',
 'as Montreal Steak Seasoning',
 'of one lemon',
 'to 4 % cocao',
 'into 4 inch pieces',
 'as RO*TEL Hot',
 'in adobo sauce',
 'de arbol peppers',
 'at room temperature',
 "as Lars ' Own",
 'as FAGE Total',
 'into 4 inch cubes',
 'into 4 pieces',
 'of chicken soup',
 'into 4.4 inch slices',
 'at room temperature',
 'about 4 inches',
 'as Di

Fix typo

In [224]:
prepositional_phrases2 = ["as SuzyQ's Santa Maria Valley Style Seasoning" if x=="(PP\n  as\n  (NP\n    SuzyQ\n    's\n    Santa\n    Maria\n    Valley\n    Style\n    Seasoning" else x for x in prepositional_phrases2]

prepositional_phrases2

['into 4 inch pieces',
 'into 4 inch pieces',
 'of mushroom soup',
 'at room temperature',
 'into 4 inch pieces',
 'at room temperature',
 'as Smart Balance',
 "as Mike 's Hot Honey",
 'in adobo sauce',
 'into 4 inch cubes',
 'at room temperature',
 'of mushroom soup',
 "as Campbell 's Healthy Request",
 "as Frank 's Red Hot",
 'of mushroom soup',
 "as farmer 's sausage",
 'into 4 wedges',
 'de arbol peppers',
 'into 4 inch pieces',
 'as Trader Joe',
 'of 4 limes',
 'into 4 wedges',
 'at room temperature',
 'from the cob',
 'into 4 pieces',
 'for 4 minutes',
 'in au jus',
 'into 4 inch pieces',
 "as Lyle 's",
 'against the grain',
 'in adobo sauce',
 'as Montreal Steak Seasoning',
 'of one lemon',
 'to 4 % cocao',
 'into 4 inch pieces',
 'as RO*TEL Hot',
 'in adobo sauce',
 'de arbol peppers',
 'at room temperature',
 "as Lars ' Own",
 'as FAGE Total',
 'into 4 inch cubes',
 'into 4 pieces',
 'of chicken soup',
 'into 4.4 inch slices',
 'at room temperature',
 'about 4 inches',
 'as Di

Get all the singular common nouns detected by NLTK from ingredients

In [225]:
nn_tokens = sort_unique_list(get_values_from_dict_list(all_ingre_tags, 'NN')[0])

nn_tokens

['%',
 '4.4-pound',
 'Caramel',
 'Class',
 'Italian',
 'Moist',
 'Oil',
 'SHAKE-N-BAKE',
 'TOUCH',
 'Yazzo',
 'acacia',
 'achiote',
 'acid',
 'acini',
 'adobo',
 'advieh',
 'agave',
 'ahi',
 'aisle',
 'alcohol',
 'ale',
 'allspice',
 'almond',
 'aluminum',
 'amani',
 'amaretto',
 'amarillo',
 'amber',
 'ammonia',
 'amount',
 'ancho',
 'anchovy',
 'angel',
 'anise',
 'annato',
 'annatto',
 'aperitif',
 'apple',
 'applesauce',
 'apricot',
 'arbol',
 'arborio',
 'arrachera',
 'arrowroot',
 'artichoke',
 'arugula',
 'asadero',
 'asafoetida',
 'asparagus',
 'au',
 'avocado',
 'avocados',
 'baby',
 'bacon',
 'bag',
 'baguette',
 'baking',
 'ball',
 'balsamic',
 'bamboo',
 'banana',
 'bananas',
 'bangus',
 'bar',
 'barbecue',
 'barbeque',
 'barley',
 'base',
 'basil',
 'basmati',
 'bass',
 'batter',
 'bay',
 'bean',
 'beaten',
 'bechamel',
 'bee4',
 'beech',
 'beef',
 'beer',
 'beeswax',
 'beet',
 'bell',
 'bella',
 'bellas',
 'beluga',
 'berry',
 'besan',
 'beverage',
 'bhaji',
 'bias',
 'bi

Get all the ingredients with the common nouns

In [226]:
ingres_nn_tokens = [s for s in p_ingredients if any(xs in s for xs in nn_tokens)]

ingres_nn_tokens

['4 tablespoon soy sauce, or more to taste',
 '4 large onion, cut into rings',
 '4.4 teaspoon onion salt',
 '4 skinless, bone-in chicken breast halves - cut in half',
 '4.4 red onion, finely diced',
 '4 zucchini, halved lengthwise',
 '4 large apples - peeled, cored, and sliced',
 '4 kaffir lime leaves, thinly sliced',
 '4 fresh jalapeno peppers - seeded, sliced, and divided',
 '4 lime, juiced',
 '4.4 cup ice, or as needed',
 '4 cup chunky peanut butter',
 '4 tablespoons chopped green onion',
 '4 pounds bone-in chicken pieces',
 '4.4 teaspoons black pepper',
 '4 gallon lard for frying (manteca)',
 '4 red pepper, seeded and thinly sliced',
 '4 tablespoons thinly sliced lemongrass',
 '4.4 pounds boneless pork chops',
 '4 pound broccoli rabe, cut into 4 4.4-inch lengths',
 '4 pounds pork tenderloin',
 '4 pounds pork shoulder',
 '4 teaspoons mirin (Japanese sweet wine)',
 '4 whole tomatoes',
 '4.4 cup vanilla sugar, or as needed',
 '4 teaspoon ketchup, or to taste',
 '4 red bell pepper, cut

Filter for those without numbers at the beginning and make sure that each has at least 3 words

In [227]:
ingres_nn_tokens = [s for s in ingres_nn_tokens if not any(xs in s for xs in ["4", "4.4"]) and len(s.split()) > 2]

ingres_nn_tokens

['sweetened flaked coconut for decorating',
 'cayenne pepper, to taste',
 'black pepper to taste',
 'Hog casing, rinsed well',
 'plain bread crumbs',
 'vegetable oil as needed',
 'chopped roasted peanuts',
 'freshly ground pepper, to taste',
 'oil for frying',
 'ground white pepper to taste',
 'salt and freshly ground black pepper',
 'Salt and black pepper to taste',
 'cold water, as needed',
 'coarse sea salt to taste',
 'sour cream for garnish',
 'sweet Thai basil',
 'salt and ground black pepper, to taste',
 'onion salt to taste',
 'lemon pepper to taste',
 'sliced French bread',
 'juice of one lemon',
 'fresh cilantro sprigs, for garnish',
 'chopped peanuts, or to taste',
 'chili powder to taste',
 'Vegetable oil, for deep-frying',
 'Canola oil, for frying',
 'nonstick cooking spray',
 'canola oil for frying',
 'spicy cilantro chutney',
 'fresh-ground black pepper',
 'romaine leaves, rinsed and dried',
 'olive oil for frying, as needed',
 'fresh ground black pepper to taste',
 'plu

Get all the proper nouns from ingredients

In [228]:
nnp_tokens = sort_unique_list(get_values_from_dict_list(all_ingre_tags, 'NNP')[0])

nnp_tokens

["Ac'cent",
 'Accent',
 'Adobo',
 'Agave',
 'Aji-No-Moto',
 'Ajinomoto',
 'Alcaparrado',
 'Aleppo',
 'Alfredo',
 'All-Purpose',
 'Aloha™',
 'Aluminum',
 'Anaheim',
 'Ancho',
 'Angel',
 'Angeli',
 'Angostura',
 'Annatto',
 'Arborio',
 'Archer',
 'Arthur',
 'Asafoetida',
 'Asiago',
 'Asian',
 'Authentic',
 'Azafran',
 'B',
 'BC',
 'BEN',
 'BOCA',
 'Bacardi',
 'Badia',
 'Baileys',
 'Baker',
 'Balance',
 'Barbeque',
 'Barilla',
 'Barolo',
 'Base',
 'Basics',
 'Basil',
 'Basmati',
 'Bavarian-style',
 'Bay',
 'Bay™',
 'Beaujolais',
 'Beef',
 'Ben',
 'Bengal',
 'Betty',
 'Beyond',
 'Bing',
 'Bisquick',
 'Black',
 'Blanc',
 'Blend',
 'Blue',
 'Bob',
 'Bold',
 'Bosc',
 'Boston',
 'Bouillon',
 'Bouquet',
 'Bragg',
 'Brand',
 'Branzino',
 'Bread',
 'Brie',
 'Broth',
 'Brown',
 'Brussels',
 'Buffalo',
 'Buitoni',
 "Bull's-Eye",
 'Buns',
 'Burgundy',
 'Butter',
 'Buttercream',
 'C',
 'Cabernet',
 'Cabot',
 'Cajun',
 'California',
 'Calimyrna',
 'Campari',
 'Campbell',
 'Canilla',
 'Canola',
 'Canto

Get all the ingredients with the proper noun

In [229]:
ingres_nnp_tokens = [s for s in p_ingredients if any(xs in s for xs in nnp_tokens)]

ingres_nnp_tokens

['4.4 cup vanilla sugar, or as needed',
 '4 tablespoons chopped raw cashews (Optional)',
 '4.4 cup Marsala wine',
 '4 teaspoons water (Optional)',
 '4.4 cup grated Parmesan cheese',
 '4.4 cup Goya Tomato Sauce',
 '4 slices Swiss cheese',
 '4 cups Muscatel wine, or orange Muscat',
 '4 tablespoon chopped fresh mint leaves (Optional)',
 '4 tablespoons Irish cream liqueur (such as Baileys), or more to taste',
 '4 dash Sriracha hot sauce, or more to taste',
 '4.4 cups white sugar',
 '4 ounces sangrita (Mexican-style bloody mary mix with orange and lime)',
 '4 cup light brown sugar',
 '4 cups stale Italian bread, crumbled',
 '4 tablespoons warm milk (4 degrees F or 4 degrees C)',
 '4 tablespoons freshly grated Parmesan cheese',
 '4 cup guacamole (Optional)',
 '4.4 cup shredded Chihuahua cheese',
 '4 cup pico de gallo (Optional)',
 '4.4 cup rock sugar candy',
 "4.4 cup confectioners' sugar",
 '4 avocado - peeled, pitted and diced (Optional)',
 '4.4 ounce) tub vegan margarine (such as Smart Ba

Filter for those without numbers at the beginning and make sure that each has at least 3 words

In [230]:
ingres_nnp_tokens = [s for s in ingres_nnp_tokens if not any(xs in s for xs in ["4", "4.4"]) and len(s.split()) > 2]

ingres_nnp_tokens

['Hog casing, rinsed well',
 'Salt and black pepper to taste',
 'sweet Thai basil',
 'sliced French bread',
 'Vegetable oil, for deep-frying',
 'Canola oil, for frying',
 'Salt and freshly ground black pepper',
 'Parsley or cilantro for garnish',
 'Lime wedges for serving',
 'white sugar for decoration',
 'Goya Ground Black Pepper, to taste',
 'Salt and pepper, to taste',
 'Salt and freshly ground pepper to taste',
 'Hot cooked regular long-grain white rice',
 'Kosher salt and fresh cracked pepper to taste',
 'grated zest of one orange',
 'Salt and ground black pepper to taste',
 'grated Parmesan cheese',
 'Water to cover',
 'cooking spray (such as Misto)',
 'Chopped Italian parsley',
 'Curry powder to taste',
 'Freshly grated lemon zest',
 'Vegetable oil for deep-frying',
 'German stone ground mustard, to taste',
 "confectioners' sugar for dusting",
 'Salt and pepper to taste',
 'superfine sugar as needed',
 'freshly grated Parmesan cheese',
 'lemon, zested and juiced',
 'pearl sugar,

Define noun phrase rule

In [231]:
NP_REGEX = r"""
  ADJP: {<RB>?<JJ|JJR|JJS|RBR|RBS>}    # Adjectives may have comparative and superlative, and come after adverbs like very
  NP: {<DT|WDT|WP$>?<CD>?<AdjP>*<NN|NNS|NNP|NNPS><POS>*<NN|NNS|NNP|NNPS|PP|CD>*<VBG>?}    # Determiner, number and adjectives come before nouns and nouns may have possessive -s and followed by another noun
  NP: {<NP><,>*<NP>*<,>*<NP>*<CC>?<NP>}    # Multiple nouns can come in comma and 'and'
"""

chunk("salt and pepper", NP_REGEX, "NP")
# pos_tag("salt and pepper")

['salt and pepper']

These are the results of the chunking. Some chunks are used more than once

In [232]:
noun_phrases = []

for name in ingres_nn_tokens:
    noun_phrases = noun_phrases + (chunk(name, NP_REGEX, "NP"))
    
noun_phrases

['plain bread crumbs',
 'coarse sea salt',
 'salt and ground',
 'spicy cilantro chutney',
 'margarita or kosher salt',
 'salt and ground',
 'Parsley or cilantro',
 'Goya Ground Black Pepper',
 'salt and ground',
 'salt and ground',
 'Salt and pepper',
 'paper candy cups',
 'Salt and ground',
 'salt and ground pepper',
 'salt and ground',
 'Chopped Italian parsley',
 'kosher salt and ground',
 'sea salt and ground',
 'oil cooking spray',
 "confectioners ' sugar",
 'Salt and pepper',
 'cheesecloth and kitchen string',
 'salt and pepper',
 'tomato and clam juice cocktail',
 'clam juice cocktail',
 'salt and pepper',
 'oil cooking spray',
 'Goya Hot Sauce',
 'Reynolds Wrap Heavy Duty Aluminum Foil',
 'coarse kosher salt',
 'salt and pepper',
 "confectioners ' sugar",
 'cream or half-and-half',
 'mustard or Kikkoman Sweet',
 'Goya Corn Oil',
 'buttons ,/, oyster ,/, shiitake',
 'portobello and crimini']

Fix typo

In [233]:
noun_phrases = ["buttons, oyster, shitake" if x=="buttons ,/, oyster ,/, shiitake" else x for x in noun_phrases]

noun_phrases

['plain bread crumbs',
 'coarse sea salt',
 'salt and ground',
 'spicy cilantro chutney',
 'margarita or kosher salt',
 'salt and ground',
 'Parsley or cilantro',
 'Goya Ground Black Pepper',
 'salt and ground',
 'salt and ground',
 'Salt and pepper',
 'paper candy cups',
 'Salt and ground',
 'salt and ground pepper',
 'salt and ground',
 'Chopped Italian parsley',
 'kosher salt and ground',
 'sea salt and ground',
 'oil cooking spray',
 "confectioners ' sugar",
 'Salt and pepper',
 'cheesecloth and kitchen string',
 'salt and pepper',
 'tomato and clam juice cocktail',
 'clam juice cocktail',
 'salt and pepper',
 'oil cooking spray',
 'Goya Hot Sauce',
 'Reynolds Wrap Heavy Duty Aluminum Foil',
 'coarse kosher salt',
 'salt and pepper',
 "confectioners ' sugar",
 'cream or half-and-half',
 'mustard or Kikkoman Sweet',
 'Goya Corn Oil',
 'buttons, oyster, shitake',
 'portobello and crimini']

In [234]:
for name in ingres_nnp_tokens:
    noun_phrases = noun_phrases + (chunk(name, NP_REGEX, "NP"))
    
noun_phrases

['plain bread crumbs',
 'coarse sea salt',
 'salt and ground',
 'spicy cilantro chutney',
 'margarita or kosher salt',
 'salt and ground',
 'Parsley or cilantro',
 'Goya Ground Black Pepper',
 'salt and ground',
 'salt and ground',
 'Salt and pepper',
 'paper candy cups',
 'Salt and ground',
 'salt and ground pepper',
 'salt and ground',
 'Chopped Italian parsley',
 'kosher salt and ground',
 'sea salt and ground',
 'oil cooking spray',
 "confectioners ' sugar",
 'Salt and pepper',
 'cheesecloth and kitchen string',
 'salt and pepper',
 'tomato and clam juice cocktail',
 'clam juice cocktail',
 'salt and pepper',
 'oil cooking spray',
 'Goya Hot Sauce',
 'Reynolds Wrap Heavy Duty Aluminum Foil',
 'coarse kosher salt',
 'salt and pepper',
 "confectioners ' sugar",
 'cream or half-and-half',
 'mustard or Kikkoman Sweet',
 'Goya Corn Oil',
 'buttons, oyster, shitake',
 'portobello and crimini',
 'Parsley or cilantro',
 'Goya Ground Black Pepper',
 'Salt and pepper',
 'Salt and ground',
 '

In [235]:
noun_phrases = sort_unique_list(noun_phrases)

noun_phrases

['Chopped Italian parsley',
 'Goya Corn Oil',
 'Goya Ground Black Pepper',
 'Goya Hot Sauce',
 'Parsley or cilantro',
 'Reynolds Wrap Heavy Duty Aluminum Foil',
 'Salt and ground',
 'Salt and pepper',
 'buttons, oyster, shitake',
 'cheesecloth and kitchen string',
 'clam juice cocktail',
 'coarse kosher salt',
 'coarse sea salt',
 "confectioners ' sugar",
 'cream or half-and-half',
 'kosher salt and ground',
 'margarita or kosher salt',
 'mustard or Kikkoman Sweet',
 'oil cooking spray',
 'paper candy cups',
 'plain bread crumbs',
 'portobello and crimini',
 'salt and ground',
 'salt and ground pepper',
 'salt and pepper',
 'sea salt and ground',
 'spicy cilantro chutney',
 'tomato and clam juice cocktail']

Fix typos

In [236]:
prepositional_phrases = sort_unique_list(prepositional_phrases + prepositional_phrases2)

prepositional_phrases = ["as Bull's-Eye Texas-Style Bold Barbeque Sauce" if x=="(PP\n  as\n  Bull's-Eye Texas-Style Bold Barbeque Sauce" else x for x in prepositional_phrases]
prepositional_phrases = ["as Grill Mates Montreal Chicken Seasoning" if x=="(PP\n  as\n  Grill Mates Montreal Chicken Seasoning" else x for x in prepositional_phrases]

prepositional_phrases

["as Bull's-Eye Texas-Style Bold Barbeque Sauce",
 'as Grill Mates Montreal Chicken Seasoning',
 'Of This World Spaghetti',
 'Under a Brick',
 'about 4 inches',
 'about 4 inches thick',
 'across the grain',
 'against the grain',
 'ancho chile powder',
 'arroz con Pollo',
 'as Aloha™ Shoyu',
 'as Archer Farms',
 'as Bacardi Coconut™',
 'as Badia Complete Seasoning',
 'as Badia Tropical',
 "as Baker 's Angel Flake",
 "as Baker 's German",
 'as Baker Fine Dessert Filling',
 'as Barilla Napoletana',
 'as Betty Crocker',
 'as Beyond Meat',
 'as Beyond Meat Beyond Beef',
 "as Bob 's Red Mill",
 'as Bob Evans',
 'as Cabernet Sauvignon',
 'as Cabot Seriously Sharp',
 "as Campbell 's",
 "as Campbell 's Healthy Request",
 "as Cavender 's",
 'as Chantaboon Rice Noodles',
 'as Chocolate Ibarra',
 'as Classico Cabernet Marinara',
 'as Coco Lopez',
 'as Cool Whip',
 'as Country Crock',
 'as De Cecco',
 'as Diamond Crystal',
 'as Diet Sprite',
 'as Duncan Hines',
 'as El Paso',
 'as El Pato',
 'as FA

Save all the phrases as a txt file

In [238]:
all_phrases = sort_unique_list(prepositional_phrases + noun_phrases)

with open('/content/drive/MyDrive/NLP Assignment/all_phrases.txt', 'w') as filehandle:
    for listitem in all_phrases:
        filehandle.write('%s\n' % listitem)

## Data merging and creating bigram

In [239]:
all_recipe_names2[:10]

['',
 'easy German apple streusel pie',
 'traditional Turkish dumplings',
 'tofu hiyayakko',
 "A Scotsman's Shepherd Pie",
 'tamale pie',
 'sopapilla cheesecake dessert',
 'Portuguese chourico and peppers',
 'easy bruschetta',
 'Russian Pierogi with potatoes and mushrooms']

In [240]:
p_ingredients[:10]

['4 tablespoon soy sauce, or more to taste',
 '4 large onion, cut into rings',
 '4.4 teaspoon onion salt',
 '4 skinless, bone-in chicken breast halves - cut in half',
 '4.4 red onion, finely diced',
 '4 zucchini, halved lengthwise',
 '4 large apples - peeled, cored, and sliced',
 '4 kaffir lime leaves, thinly sliced',
 '4 fresh jalapeno peppers - seeded, sliced, and divided',
 '4 lime, juiced']

In [241]:
ingres_and_names = all_recipe_names2 + p_ingredients

In [242]:
tok = nltk.TweetTokenizer()
tok.tokenize("Alicia's aloo gobi")

["Alicia's", 'aloo', 'gobi']

Generate bigram from each entry, rather than directly as a whole chunk of text, since they were not joined originally in the source

In [243]:
from nltk import ngrams
from nltk import TweetTokenizer
from collections import Counter

def generate_bigram_from_entry(entry):
  tokenizer = nltk.TweetTokenizer()
  bigrams = nltk.bigrams(tokenizer.tokenize(entry))
  frequence = nltk.FreqDist(bigrams)
  return dict(sorted(frequence.items(), key = lambda item: item[0]))

generate_bigram_from_entry("4 tablespoons grated orange peel")


{('4', 'tablespoons'): 1,
 ('grated', 'orange'): 1,
 ('orange', 'peel'): 1,
 ('tablespoons', 'grated'): 1}

Combine individual bigram

In [244]:
from collections import Counter

all_bigrams = {}

for name in ingres_and_names:
  try:
    all_bigrams = dict(Counter(all_bigrams)+Counter(generate_bigram_from_entry(name)))
  except Exception as e:
    pass

all_bigrams

{('German', 'apple'): 7,
 ('apple', 'streusel'): 1,
 ('easy', 'German'): 3,
 ('streusel', 'pie'): 1,
 ('Turkish', 'dumplings'): 1,
 ('traditional', 'Turkish'): 1,
 ('tofu', 'hiyayakko'): 1,
 ('A', "Scotsman's"): 1,
 ("Scotsman's", 'Shepherd'): 1,
 ('Shepherd', 'Pie'): 1,
 ('tamale', 'pie'): 6,
 ('cheesecake', 'dessert'): 1,
 ('sopapilla', 'cheesecake'): 2,
 ('Portuguese', 'chourico'): 2,
 ('and', 'peppers'): 5,
 ('chourico', 'and'): 1,
 ('easy', 'bruschetta'): 1,
 ('Pierogi', 'with'): 1,
 ('Russian', 'Pierogi'): 1,
 ('and', 'mushrooms'): 5,
 ('potatoes', 'and'): 4,
 ('with', 'potatoes'): 7,
 ('Garlic', 'Wings'): 1,
 ('Honey', 'Garlic'): 1,
 ("Krista's", 'Sticky'): 1,
 ('Sticky', 'Honey'): 1,
 ('Indian', 'Samosas'): 1,
 ('baked', 'Indian'): 1,
 ('easy', 'baked'): 4,
 ('.', "Patrick's"): 1,
 ('Day', 'Cake'): 1,
 ("Patrick's", 'Day'): 1,
 ('Perfect', 'St'): 1,
 ('St', '.'): 1,
 ('Italian', 'limoncello'): 1,
 ('traditional', 'Italian'): 2,
 ('rabbit', 'stew'): 2,
 ("Cara's", 'Creamy'): 1,


Defining function to generate unigram from entry

In [245]:
def generate_unigram_from_entry(entry):
  tokenizer = nltk.TweetTokenizer()
  unigrams = nltk.ngrams(tokenizer.tokenize(entry),1)
  frequence = nltk.FreqDist(unigrams)
  return dict(sorted(frequence.items(), key = lambda item: item[0]))

generate_unigram_from_entry("4 tablespoons grated orange peel")

{('4',): 1, ('grated',): 1, ('orange',): 1, ('peel',): 1, ('tablespoons',): 1}

Combining unigram generated for each entry into dictionary

In [246]:
from collections import Counter

all_unigrams = {}

for name in ingres_and_names:
    all_unigrams = dict(Counter(all_unigrams)+Counter(generate_unigram_from_entry(name)))
    
all_unigrams

{('German',): 83,
 ('apple',): 100,
 ('easy',): 166,
 ('pie',): 115,
 ('streusel',): 2,
 ('Turkish',): 16,
 ('dumplings',): 21,
 ('traditional',): 36,
 ('hiyayakko',): 1,
 ('tofu',): 75,
 ('A',): 4,
 ('Pie',): 10,
 ("Scotsman's",): 1,
 ('Shepherd',): 1,
 ('tamale',): 9,
 ('cheesecake',): 14,
 ('dessert',): 8,
 ('sopapilla',): 3,
 ('Portuguese',): 29,
 ('and',): 2058,
 ('chourico',): 2,
 ('peppers',): 400,
 ('bruschetta',): 7,
 ('Pierogi',): 12,
 ('Russian',): 30,
 ('mushrooms',): 236,
 ('potatoes',): 381,
 ('with',): 505,
 ('Garlic',): 8,
 ('Honey',): 4,
 ("Krista's",): 1,
 ('Sticky',): 1,
 ('Wings',): 1,
 ('Indian',): 68,
 ('Samosas',): 3,
 ('baked',): 81,
 ('.',): 87,
 ('Cake',): 6,
 ('Day',): 1,
 ("Patrick's",): 1,
 ('Perfect',): 1,
 ('St',): 1,
 ('Italian',): 323,
 ('limoncello',): 5,
 ('rabbit',): 5,
 ('stew',): 107,
 ("Cara's",): 1,
 ('Creamy',): 1,
 ('Shells',): 1,
 ('Stuffed',): 5,
 ('Dominican',): 4,
 ('chicken',): 1453,
 ('chunks',): 156,
 ('fried',): 94,
 ('enchiladas',): 69

Calculating bigram probabilities for the bigram language model

In [247]:
bigram_prob = {}

x=0
for e in all_bigrams:
  w1 = list(all_bigrams.keys())[x][0]
  w2 = list(all_bigrams.keys())[x][1]
  bigram_freq = list(all_bigrams.values())[x]
  unigram_freq = all_unigrams[(w1, )]
  # Round to 2 decimal places
  bigram_prob[e] = round(bigram_freq/unigram_freq, 2)
  x += 1

bigram_prob



{('German', 'apple'): 0.08,
 ('apple', 'streusel'): 0.01,
 ('easy', 'German'): 0.02,
 ('streusel', 'pie'): 0.5,
 ('Turkish', 'dumplings'): 0.06,
 ('traditional', 'Turkish'): 0.03,
 ('tofu', 'hiyayakko'): 0.01,
 ('A', "Scotsman's"): 0.25,
 ("Scotsman's", 'Shepherd'): 1.0,
 ('Shepherd', 'Pie'): 1.0,
 ('tamale', 'pie'): 0.67,
 ('cheesecake', 'dessert'): 0.07,
 ('sopapilla', 'cheesecake'): 0.67,
 ('Portuguese', 'chourico'): 0.07,
 ('and', 'peppers'): 0.0,
 ('chourico', 'and'): 0.5,
 ('easy', 'bruschetta'): 0.01,
 ('Pierogi', 'with'): 0.08,
 ('Russian', 'Pierogi'): 0.03,
 ('and', 'mushrooms'): 0.0,
 ('potatoes', 'and'): 0.01,
 ('with', 'potatoes'): 0.01,
 ('Garlic', 'Wings'): 0.12,
 ('Honey', 'Garlic'): 0.25,
 ("Krista's", 'Sticky'): 1.0,
 ('Sticky', 'Honey'): 1.0,
 ('Indian', 'Samosas'): 0.01,
 ('baked', 'Indian'): 0.01,
 ('easy', 'baked'): 0.02,
 ('.', "Patrick's"): 0.01,
 ('Day', 'Cake'): 1.0,
 ("Patrick's", 'Day'): 1.0,
 ('Perfect', 'St'): 1.0,
 ('St', '.'): 1.0,
 ('Italian', 'limoncell

Group bigrams with same first word together into a dictionary

In [248]:
def find_dict_tuple_key(search):
    entry = {
        "token": search,
        "bigrams": []
    }
    bigrams = {x: bigram_prob[x] for x in bigram_prob.keys() if x[0] == search}
    for key, value in bigrams.items():
        newDict = {key[1]: value}
        entry["bigrams"].append(newDict)
    return entry

find_dict_tuple_key('spicy')

{'bigrams': [{'rice': 0.02},
  {'pork': 0.05},
  {'eggplant': 0.03},
  {'mango': 0.01},
  {'yellowtail': 0.01},
  {'beef': 0.03},
  {'Chinese': 0.02},
  {'calabrian': 0.01},
  {'Vietnamese': 0.02},
  {'thai': 0.04},
  {'Indian': 0.04},
  {'orange': 0.04},
  {'tuna': 0.02},
  {'green': 0.01},
  {'banana': 0.01},
  {'Asian-Style': 0.01},
  {'Korean': 0.03},
  {'African': 0.01},
  {'cabbage': 0.01},
  {'Peruvian': 0.01},
  {'shrimp': 0.03},
  {'vegan': 0.01},
  {'Asian': 0.01},
  {'Italian': 0.04},
  {'Sinterklass': 0.01},
  {'yogurt': 0.01},
  {'pesto': 0.01},
  {'chicken': 0.04},
  {'stir-fry': 0.01},
  {'feta': 0.01},
  {'fried': 0.01},
  {'tomato': 0.01},
  {'penyet': 0.01},
  {'peach': 0.01},
  {'salmon': 0.01},
  {'noodles': 0.01},
  {'red': 0.02},
  {'crispy': 0.01},
  {'Mexican-American': 0.01},
  {'avocado': 0.01},
  {'himalayan': 0.01},
  {'bok': 0.01},
  {'Southwest': 0.01},
  {'dipping': 0.01},
  {'szechuan': 0.01},
  {'marinated': 0.01},
  {'and': 0.01},
  {'sushi': 0.01},
  

In [249]:
all_tokens = recipe_tokens + ingre_tokens
all_tokens = list(set(all_tokens))

len(all_tokens)

5511

In [250]:
bigram_in_list = []
for value in all_tokens:
    bigram_in_list.append(find_dict_tuple_key(value))
    
bigram_in_list

[{'bigrams': [{'Pizzaiola': 0.33}, {'Seasoning': 0.33}], 'token': 'Steak'},
 {'bigrams': [], 'token': 'drink'},
 {'bigrams': [{',': 0.48}, {'in': 0.05}], 'token': 'chilies'},
 {'bigrams': [{'Jammin': 1.0}], 'token': 'Be'},
 {'bigrams': [{'tofu': 0.01},
   {'kielbasa': 0.01},
   {'ramen': 0.08},
   {'chicken': 0.03},
   {'active': 0.04},
   {'sliced': 0.02},
   {'cream': 0.06},
   {'wide': 0.01},
   {'ladyfingers': 0.01},
   {'frozen': 0.09},
   {'pastry': 0.01},
   {'buckwheat': 0.01},
   {'cheese': 0.02},
   {'corn': 0.04},
   {'crispy': 0.01},
   {'firm': 0.02},
   {'sazon': 0.03},
   {'phyllo': 0.01},
   {'fresh': 0.03},
   {'egg': 0.01},
   {'fat-free': 0.01},
   {'fudge': 0.01},
   {'dried': 0.04},
   {'thin': 0.01},
   {'brown': 0.01},
   {'potato': 0.01},
   {'wonton': 0.01},
   {'rice': 0.01},
   {'dry': 0.06},
   {'raspberry-flavored': 0.01},
   {'beef': 0.01},
   {'garlic': 0.01},
   {'Oriental': 0.01},
   {'refrigerated': 0.04},
   {'unflavored': 0.01},
   {'shredded': 0.01}

## Add Phonetics

In [251]:
!pip install eng-to-ipa 

Collecting eng-to-ipa
  Downloading eng_to_ipa-0.0.2.tar.gz (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 29.1 MB/s 
[?25hBuilding wheels for collected packages: eng-to-ipa
  Building wheel for eng-to-ipa (setup.py) ... [?25l[?25hdone
  Created wheel for eng-to-ipa: filename=eng_to_ipa-0.0.2-py3-none-any.whl size=2822640 sha256=f54c4e898f05a49a327d931d856aca4234b38ee55974bd412accb50b9c1811d4
  Stored in directory: /root/.cache/pip/wheels/96/c0/dd/aeddfbebc2c3301c3dd09670d9954b0574ac4cd982664c1110
Successfully built eng-to-ipa
Installing collected packages: eng-to-ipa
Successfully installed eng-to-ipa-0.0.2


In [252]:
import eng_to_ipa as eng_to_ipa

eng_to_ipa.convert("hey!")

'heɪ!'

In some cases, when ipa conversion fails, the original spelling is used instead. So, for file efficiency, only those that are converted successfully should be kept.

In [253]:
eng_to_ipa.convert("bite-sized")

'bite-sized*'

In [259]:
unigrams_ipa = []
for i, unigram in enumerate(all_tokens):
  entry = {"token": all_tokens[i]}
  try:
      ipa =  eng_to_ipa.convert(eng_to_ipa.convert(unigram))
      if unigram not in ipa:
          entry["ipa"] = ipa
  except Exception as e:
      pass
  unigrams_ipa.append(entry)

unigrams_ipa

[{'ipa': 'steɪk*', 'token': 'Steak'},
 {'ipa': 'drɪŋk*', 'token': 'drink'},
 {'ipa': 'ˈʧɪˈʧɪliz*', 'token': 'chilies'},
 {'ipa': 'baɪ', 'token': 'Be'},
 {'ipa': 'ˈˈpækɪʤɪz*', 'token': 'packages'},
 {'token': 'ramen'},
 {'ipa': 'ˈˈsɪmpəl*', 'token': 'Simple'},
 {'ipa': 'ˈˈbɛli*', 'token': 'belly'},
 {'ipa': 'ˈˈkrɪstən*', 'token': 'Kristen'},
 {'ipa': 'ˈˈkɪʧən*', 'token': 'kitchen'},
 {'token': 'brewis'},
 {'ipa': 'ˈˈlɪkwɪd*', 'token': 'liquid'},
 {'ipa': 'ʃʃeɪvd*', 'token': 'shaved'},
 {'ipa': 'mangonada**', 'token': 'Mangonada'},
 {'token': 'prizewinning'},
 {'token': 'pignoli'},
 {'ipa': 'sɛt*', 'token': 'set'},
 {'ipa': 'haɪnz*', 'token': 'Hines'},
 {'token': 'kaiserschmarrn'},
 {'token': 'flourless'},
 {'token': 'oil-poached'},
 {'ipa': 'meɪoʊ*ʊ', 'token': 'mayo'},
 {'ipa': 'blæk*', 'token': 'Black'},
 {'ipa': 'ˈˈnəkəlz*', 'token': 'knuckles'},
 {'ipa': 'ˈˈpætɪd*', 'token': 'patted'},
 {'ipa': 'samosas**', 'token': 'Samosas'},
 {'ipa': 'kɔŋg*', 'token': 'Kong'},
 {'ipa': 'rɪbz*', 't

Save bigram list, which contain IPA symbol and bigrams of each token into a json file

In [260]:
import json

with open('bigrams.json', 'w') as f:
    json.dump(bigram_in_list, f)

In [261]:
import json

with open('unigrams.json', 'w') as f:
    json.dump(unigrams_ipa, f)

# Create edit distance

In [None]:
from nltk import edit_distance

#Defining a function to get similar words with edit_distance < 2 from the corpus
def get_similar_words(word): 

  similar_words = []
  corpus = all_tokens
  if word in corpus:
    return
  else:
    w2 = word
    similar_words.append([w1 for w1 in corpus if edit_distance(w1, w2) < 2])

  return(similar_words)

In [None]:
#Testing the function
get_similar_words('gren')

[['green', 'grey']]

In [None]:
#Defining a function to get similar words from phrases with edit_distance < 2 (returns a list)
def get_similar_words_phrase(phrase): 

  similar_words = []
  corpus = all_tokens
  phrase = phrase.split()
  for w1 in phrase:
    if w1 in corpus:
      pass
    else:
      similar_words.append([w2 for w2 in corpus if edit_distance(w1, w2) < 2])
    
  return(similar_words)

In [None]:
#Defining a function to get similar words from phrases with edit_distance < 2 (returns a dictionary) 
def get_similar_words_phrase(phrase): 

  dictionary = dict()
  corpus = all_tokens
  phrase = phrase.split()
  for w1 in phrase:
    if w1 in corpus:
      pass
    else:
      similar_words = [w2 for w2 in corpus if edit_distance(w1, w2) < 2]
      dictionary[w1] = similar_words
    
  return(dictionary)

In [None]:
phrase = ('Frenchi blue chese')

get_similar_words_phrase(phrase)

{'Frenchi': ['French'], 'chese': ['cheese', 'chee']}

#Bigrams

Getting bigrams from bigram list using frequency of bigrams


In [None]:
def get_bigrams(tppl):
  w = tppl[0] #getting the first word in a tuple
  #Checking whether the first word exist in the dictionary
  cont = False 
  for dic in bigram_in_list:
    if w in dic['token']:
      cont = True
  if cont is True:
    pass
  else:
    return
  #Extracting bigrams with the highest frequency from the dictionary for the first word in tuple
  for dic in bigram_in_list:
    if dic['token'] == w:
      bigrams = dic['bigrams']
  bigrams = sorted(bigrams, key=lambda x: list(x.values())[0], reverse = True)
  h_entry = [bigrams[0]]
  for e in bigrams:
    if list(e.keys()) != list(h_entry[0].keys()):
      if list(e.values())[0] == list(h_entry[0].values())[0]:
        h_entry.append(e)
  
  return(h_entry)


In [None]:
sent = 'french blue cheese'

tokenizer= TweetTokenizer()

bi = list(nltk.bigrams(tokenizer.tokenize(sent)))

bi

[('french', 'blue'), ('blue', 'cheese')]

In [None]:
get_bigrams(bi[1])

[{'cheese': 0.6666666666666666}]

Check to see whether the function really picks the bigram with the highest probability

In [None]:
x = 0
for entry in bigram_prob:
  w1 = list(bigram_prob.keys())[x][0]
  w2 = list(bigram_prob.keys())[x][1]
  if w1 == 'blue':
    print(list(bigram_prob.keys())[x],":", list(bigram_prob.values())[x])
  x += 1

('blue', 'cheese') : 0.6666666666666666
('blue', 'food') : 0.16666666666666666
('blue', 'crabs') : 0.16666666666666666


#Saving the bigram_in_list, all_bigrams and all_token

In [None]:
import pickle 

object = bigram_in_list
filehandler = open('/content/drive/MyDrive/NLP Assignment/bigram_in_list', 'wb')
pickle.dump(object, filehandler)


In [None]:
filehandler = open('/content/drive/MyDrive/NLP Assignment/bigram_in_list', 'rb')
bigram_in_list = pickle.load(filehandler)

In [None]:
object = all_bigrams
filehandler = open('/content/drive/MyDrive/NLP Assignment/all_bigrams', 'wb')
pickle.dump(object, filehandler)

In [None]:
filehandler = open('/content/drive/MyDrive/NLP Assignment/all_bigrams', 'rb')
all_bigrams = pickle.load(filehandler)

In [None]:
object = all_unigrams
filehandler = open('/content/drive/MyDrive/NLP Assignment/all_unigrams', 'wb')
pickle.dump(object, filehandler)

In [None]:
filehandler = open('/content/drive/MyDrive/NLP Assignment/all_unigrams', 'rb')
all_unigrams = pickle.load(filehandler)

In [None]:
object = all_tokens
filehandler = open('/content/drive/MyDrive/NLP Assignment/all_tokens', 'wb')
pickle.dump(object, filehandler)

In [None]:
filehandler = open('/content/drive/MyDrive/NLP Assignment/all_tokens', 'rb')
all_tokens = pickle.load(filehandler)