# SNLP project

## Ingredient substitution task 

**Objective:**

produce a substitution for certain ingredients, e,g, “vegan” and “pesto chicken pasta” might give us “pesto tofu pasta”

## 1. Text preprocessing

In [1]:
import pandas as pd
import numpy as np
from typing import List
import gensim
from gensim.models import Word2Vec
import spacy
import nltk
from nltk.corpus import stopwords
from nltk import download
from nltk.stem import WordNetLemmatizer
import time 

  from .autonotebook import tqdm as notebook_tqdm


In [96]:
download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [97]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [145]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [2]:
df = pd.read_csv("./data/cleaneddata.csv", index_col=0)

In [3]:
df.head()

Unnamed: 0,title,summary,instructions,ingredients,ingredient types,diets,vegetarian,vegan,glutenFree,dairyFree,veryHealthy,cheap,veryPopular,sustainable,healthScore,pricePerServing,readyInMinutes,servings
0,orange fig teacake with caramel glaze,orange fig teacake with caramel glaze is a veg...,you will need a 9 springform pan or a cake ...,ap flour; baking powder; cardamom; eggs; fresh...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,True,False,False,False,False,False,False,False,3.0,75.55,45,10
1,poached eggs on a bed of fried mushrooms and c...,poached eggs on a bed of fried mushrooms and c...,in a frying pan heat up oil then add mushroom...,bread; butter; eggs; eggs; mushrooms; oil; sal...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,True,False,False,False,False,False,False,False,15.0,147.7,45,2
2,pandan chiffon cake,for 26 cents per serving this recipe covers ...,preheat the oven to 170c blend the pandan le...,all purpose flour; bay leaves; coconut milk; c...,Ethnic Foods Produce Spices and Seasonings Bev...,dairy free; lacto ovo vegetarian,True,False,False,True,False,False,False,False,1.0,26.06,45,9
3,pork chop with honey mustard and apples,pork chop with honey mustard and apples might...,pre heat your oven to 200c 400f line a roa...,apples; dijon mustard; garlic cloves; honey; j...,Meat Spices and Seasonings Condiments Oil Vine...,gluten free; dairy free; paleolithic; primal,False,False,True,True,False,False,False,False,17.0,242.23,45,4
4,beet gnocchi with steak and brown butter sauce,the recipe beet gnocchi with steak and brown b...,cooking beets heat oven to 400 degrees wash be...,gnocchi; beets; olive oil; s p; goat cheese; r...,Produce Spices and Seasonings Meat Spices and ...,,False,False,False,False,False,False,False,False,12.0,417.69,45,4


#### 1.1 Drop column we don't need and renaming columns

In [100]:
df = df.drop(["healthScore", "pricePerServing", "readyInMinutes", "servings"], axis=1)


In [101]:
df = df.rename(columns={
    "glutenFree": "gluten-free", 
    "dairyFree": "dairy-free", 
    "veryHealthy":"very-healthy", 
    "veryPopular": "very-popular", 
})

#### 1.2 Transform the TRUE classification labels to class name

In [102]:
classes = ['vegetarian', 
           'vegan', 
           'gluten-free', 
           'dairy-free', 
           'very-healthy', 
           'cheap', 
           'very-popular', 
           'sustainable']

In [103]:
for c in classes:
    df[c] = df[c].replace(to_replace=[True, False], value=[c, np.nan])

In [104]:
df.head()

Unnamed: 0,title,summary,instructions,ingredients,ingredient types,diets,vegetarian,vegan,gluten-free,dairy-free,very-healthy,cheap,very-popular,sustainable
0,orange fig teacake with caramel glaze,orange fig teacake with caramel glaze is a veg...,you will need a 9 springform pan or a cake ...,ap flour; baking powder; cardamom; eggs; fresh...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,
1,poached eggs on a bed of fried mushrooms and c...,poached eggs on a bed of fried mushrooms and c...,in a frying pan heat up oil then add mushroom...,bread; butter; eggs; eggs; mushrooms; oil; sal...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,
2,pandan chiffon cake,for 26 cents per serving this recipe covers ...,preheat the oven to 170c blend the pandan le...,all purpose flour; bay leaves; coconut milk; c...,Ethnic Foods Produce Spices and Seasonings Bev...,dairy free; lacto ovo vegetarian,vegetarian,,,dairy-free,,,,
3,pork chop with honey mustard and apples,pork chop with honey mustard and apples might...,pre heat your oven to 200c 400f line a roa...,apples; dijon mustard; garlic cloves; honey; j...,Meat Spices and Seasonings Condiments Oil Vine...,gluten free; dairy free; paleolithic; primal,,,gluten-free,dairy-free,,,,
4,beet gnocchi with steak and brown butter sauce,the recipe beet gnocchi with steak and brown b...,cooking beets heat oven to 400 degrees wash be...,gnocchi; beets; olive oil; s p; goat cheese; r...,Produce Spices and Seasonings Meat Spices and ...,,,,,,,,,


#### 1.3 Creating two copies of the diets and ingredients columns

The ingredients and diets are going to be important for this tasks. To make our model more flexible, we will expand the vocabularies in these categories. 

For diets, we want to include both `"gluten free"` and `"gluten-free"` in the corpus. For ingredients, we want to include both `"extra virgin olive oil"` and `"extra", "virgin", "olive", "oil"`. 

In [105]:
# df["ingredient split"] = df["ingredients"].str.replace("; ", " ")
df["diet split"] = df["diets"].str.replace(" ", "-").str.replace(";-", "; ")

In [106]:
df.head()

Unnamed: 0,title,summary,instructions,ingredients,ingredient types,diets,vegetarian,vegan,gluten-free,dairy-free,very-healthy,cheap,very-popular,sustainable,diet split
0,orange fig teacake with caramel glaze,orange fig teacake with caramel glaze is a veg...,you will need a 9 springform pan or a cake ...,ap flour; baking powder; cardamom; eggs; fresh...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,,lacto-ovo-vegetarian
1,poached eggs on a bed of fried mushrooms and c...,poached eggs on a bed of fried mushrooms and c...,in a frying pan heat up oil then add mushroom...,bread; butter; eggs; eggs; mushrooms; oil; sal...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,,lacto-ovo-vegetarian
2,pandan chiffon cake,for 26 cents per serving this recipe covers ...,preheat the oven to 170c blend the pandan le...,all purpose flour; bay leaves; coconut milk; c...,Ethnic Foods Produce Spices and Seasonings Bev...,dairy free; lacto ovo vegetarian,vegetarian,,,dairy-free,,,,,dairy-free; lacto-ovo-vegetarian
3,pork chop with honey mustard and apples,pork chop with honey mustard and apples might...,pre heat your oven to 200c 400f line a roa...,apples; dijon mustard; garlic cloves; honey; j...,Meat Spices and Seasonings Condiments Oil Vine...,gluten free; dairy free; paleolithic; primal,,,gluten-free,dairy-free,,,,,gluten-free; dairy-free; paleolithic; primal
4,beet gnocchi with steak and brown butter sauce,the recipe beet gnocchi with steak and brown b...,cooking beets heat oven to 400 degrees wash be...,gnocchi; beets; olive oil; s p; goat cheese; r...,Produce Spices and Seasonings Meat Spices and ...,,,,,,,,,,


In [107]:
ingredients = df["ingredients"].to_list()
# ingredients_split = df["ingredient split"].to_list()

all_ingredients = []
for i in ingredients: 
    items = i.split("; ")
    all_ingredients += items

# for i in ingredients_split: 
#     items = i.split()
#     all_ingredients += items

all_ingredients = list(set(all_ingredients))

In [108]:
len(all_ingredients)

1635

#### 1.4 Create sentences from recipes

In [154]:
class Sentence(object):
    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.n_rows = df.shape[0]
        self.lemmatizer = WordNetLemmatizer()
        
    def __iter__(self):
        for n in range(self.n_rows):
            row = self.df.iloc[n]
            sentence : List[str] = []
            for i, cell in enumerate(row):
                if type(cell) == str:
                    sent = cell.split("; ") if df.columns[i] in ["ingredients", "diets"] else cell.split()
                    sent = [w.lower() for w in sent if w not in stop_words]
                    if len(sent) > 0: sent[-1] = self.lemmatizer.lemmatize(sent[-1])
                    sentence += sent
            yield sentence

In [155]:
sentences = Sentence(df)

In [156]:
start = time.time()
model = Word2Vec(
    sentences, 
    min_count=1,
    vector_size=300, 
    workers=2,
    window=2,
    epochs=30, 
    sg=0
)
end = time.time()
model.save("word2vec.model")
print(f"Training comleted in {end-start}s.")

Training comleted in 83.4347779750824s.


In [163]:
def get_ingredients(ingredients: List[str], 
                    original_recipe: List[str], 
                    positive: List[str], 
                    negative: List[str] = [], 
                    topn=20):
    
    pos = original_recipe.split() + positive*3
    candidates = model.wv.most_similar(positive=pos, negative=negative*2, topn=200)
    
    # for c in candidates:
    #     print(c)
    
    substitutions = []
    for ingredient_name, _ in candidates:
        if len(substitutions) >= topn:
            break
        if ingredient_name in ingredients:
            substitutions.append(ingredient_name)
    return substitutions

In [164]:
get_ingredients(all_ingredients, 
                original_recipe="pesto chicken pasta", 
                positive=["vegan"], 
                negative=[], topn=20)

['feta',
 'mushroom',
 'bulgur',
 'gorgonzola',
 'arugula',
 'mayo',
 'orzo',
 'fresh mozzarella cheese',
 'chickpeas',
 'cornbread',
 'crepes',
 'honey',
 'turkey',
 'bread',
 'broccoli',
 'chipotle',
 'buckwheat',
 'cauliflower',
 'baguettes']

## BERT

In [2]:
from tokenizers import ByteLevelBPETokenizer
from transformers import RobertaTokenizerFast, RobertaConfig, RobertaForMaskedLM, AdamW, pipeline, FillMaskPipeline
import os 
import torch
from tqdm.auto import tqdm
import glob

#### Create the text file for training

In [3]:
df1 = df.apply(lambda x: " ".join([cell for cell in x if type(cell)==str]), axis=1)

recipes = "\n".join(df1.to_list())
with open("recipes.txt", 'w', encoding="utf-8") as f:
    f.write(recipes)

NameError: name 'df' is not defined

#### Train tokenizer

In [6]:
tokenizer = ByteLevelBPETokenizer()

In [7]:
tokenizer.train(files=["recipes.txt"], 
                vocab_size=30_522, 
                min_frequency=1, 
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])






In [8]:
tokenizer_path = "recipe_tokenizer"
# os.mkdir("recipe_tokenizer")
tokenizer.save_model(tokenizer_path)

['recipe_tokenizer/vocab.json', 'recipe_tokenizer/merges.txt']

In [10]:
tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_path)

In [11]:
tokenizer.save_pretrained(tokenizer_path)

('recipe_tokenizer/tokenizer_config.json',
 'recipe_tokenizer/special_tokens_map.json',
 'recipe_tokenizer/vocab.json',
 'recipe_tokenizer/merges.txt',
 'recipe_tokenizer/added_tokens.json',
 'recipe_tokenizer/tokenizer.json')

In [12]:
tokenizer("pesto chich pasta")

{'input_ids': [0, 8327, 14297, 974, 2], 'attention_mask': [1, 1, 1, 1, 1]}

#### Input pipeline

In [13]:
def mlm(tensor):
    rand = torch.rand(tensor.shape)
    mask_arr = (rand < 0.15) * (tensor > 2)
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist()
        tensor[i, selection] = 4 # mask token
    return tensor

In [14]:
input_ids = []
attn_mask = [] 
labels = []

with open("recipes.txt", 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
    
sample = tokenizer(lines, 
                   max_length=512, 
                   padding="max_length", 
                   truncation=True, 
                   return_tensors='pt')
labels.append(sample.input_ids)
attn_mask.append(sample.attention_mask)
input_ids.append(mlm(sample.input_ids.detach().clone()))





In [15]:
input_ids = torch.cat(input_ids)
attn_mask = torch.cat(attn_mask)
labels = torch.cat(labels)

In [16]:
encodings = {
    'input_ids': input_ids,
    'attention_mask': attn_mask,
    'labels': labels
}

In [17]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    
    def __len__(self):
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        return {key: tensor[i] for key, tensor in self.encodings.items()}

In [18]:
dataset = Dataset(encodings)

In [19]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)

In [20]:
config = RobertaConfig(
    vocab_size = tokenizer.vocab_size,
    max_position_embeddings=514, 
    hidden_size = 768, 
    num_attention_heads=12, 
    num_hidden_layers= 6, 
    type_vocab_size = 1
)

In [21]:
model = RobertaForMaskedLM(config)

In [22]:
device = torch.device("cuda"  if torch.cuda.is_available() else "cpu")

In [23]:
model.to(device)

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(17725, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [24]:
optim = AdamW(model.parameters(), lr=1e-4)



In [25]:
epochs = 2

loop = tqdm(dataloader, leave=True)

for batch in loop:
    optim.zero_grad()
    
    input_ids = batch['input_ids'].to(device)
    mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)
    
    outputs = model(input_ids, attention_mask=mask, labels=labels)
    loss = outputs.loss
    loss.backward()
    optim.step()
    
    loop.set_description(f"Epoch: {epochs}")
    loop.set_postfix(loss=loss.item())

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch: 2: 100%|█████████████████| 270/270 [6:18:06<00:00, 84.02s/it, loss=0.999]


In [27]:
model.save_pretrained('./model')

#### Testing

In [28]:
fill = pipeline("fill-mask", 
                model="model", 
                tokenizer="recipe_tokenizer", 
                top_k=5)

In [38]:
fill(f"yummy {fill.tokenizer.mask_token} pasta")

[{'score': 0.10166112333536148,
  'token': 225,
  'token_str': ' ',
  'sequence': 'yummy  pasta'},
 {'score': 0.042107418179512024,
  'token': 274,
  'token_str': ' and',
  'sequence': 'yummy and pasta'},
 {'score': 0.02661365643143654,
  'token': 291,
  'token_str': ' to',
  'sequence': 'yummy to pasta'},
 {'score': 0.026019858196377754,
  'token': 278,
  'token_str': ' the',
  'sequence': 'yummy the pasta'},
 {'score': 0.025529174134135246,
  'token': 31,
  'token_str': ';',
  'sequence': 'yummy; pasta'}]

In [35]:
fill2 = FillMaskPipeline(model=model, 
                tokenizer=tokenizer, 
                top_k=5)

In [39]:
fill2(f"vegan {fill.tokenizer.mask_token} pasta")

[{'score': 0.06521295756101608,
  'token': 225,
  'token_str': ' ',
  'sequence': 'vegan  pasta'},
 {'score': 0.05326294153928757,
  'token': 274,
  'token_str': ' and',
  'sequence': 'vegan and pasta'},
 {'score': 0.04045328125357628,
  'token': 31,
  'token_str': ';',
  'sequence': 'vegan; pasta'},
 {'score': 0.03585265949368477,
  'token': 291,
  'token_str': ' to',
  'sequence': 'vegan to pasta'},
 {'score': 0.02414020150899887,
  'token': 278,
  'token_str': ' the',
  'sequence': 'vegan the pasta'}]

## Using pre-trained embedding

In [24]:
nlp = spacy.load('en_core_web_lg')

In [262]:
all_docs = []
for recipe in sentences:
    # for word in recipe:
    recipe = " ".join(recipe)
    all_docs.append(nlp(recipe))

In [28]:
nlp.vocab["pasta"].vector

array([-0.47621  , -0.12565  ,  0.57201  , -0.41622  ,  0.087897 ,
        0.94786  ,  0.17435  ,  0.38734  ,  0.076891 ,  0.54134  ,
       -0.83674  ,  0.89006  ,  0.16874  , -0.20026  ,  0.3535   ,
       -0.52456  , -0.48566  ,  1.4866   ,  0.12526  ,  0.18175  ,
        0.13386  , -0.6219   ,  0.14722  ,  0.25752  ,  0.01989  ,
       -0.26677  , -0.52299  , -0.26495  ,  0.26422  , -0.46492  ,
        0.52974  ,  0.17449  ,  0.23151  , -0.5197   ,  0.50632  ,
        0.38656  ,  0.30754  , -0.20152  ,  0.38965  ,  1.3301   ,
       -0.11162  ,  0.48363  ,  0.14996  , -0.25947  , -0.10856  ,
        0.87353  , -0.19608  ,  0.86464  , -0.10013  ,  0.049008 ,
       -0.36184  , -0.29676  ,  0.48032  , -0.073478 ,  0.02197  ,
       -0.51162  ,  0.19295  ,  0.27889  ,  0.095985 ,  0.4911   ,
        0.048671 , -0.20332  , -0.27704  , -0.39282  , -0.19507  ,
       -1.3053   ,  0.23504  ,  0.33168  , -0.10465  ,  0.77648  ,
       -0.21061  ,  0.77946  , -0.22824  ,  0.14612  , -0.8556

In [149]:
def most_similar(word, topn=5):
    word = nlp.vocab[str(word)]
    
    # for w in word.vocab:
        # print(w.is_lower, word.is_lower, w.prob)
    queries = [
        w for w in word.vocab 
        if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
    ]

    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

most_similar("dog", topn=3)

[]

In [17]:
nlp.vocab["pesto"]

<spacy.lexeme.Lexeme at 0x7fed41b99ac0>

In [290]:
test = [s for s in sentences]

In [293]:
" ".join(test[0])

'roasted butternut squash bisque roasted butternut squash bisque could gluten free lacto ovo vegetarian recipe looking recipe serves 6 costs 5 1 per serving one serving contains 575 calories 6g protein 28g fat mixture coursely garlic salt bay leaves handful ingredients takes make recipe delicious recipe foodista 1 fans preparation plate recipe takes approximately approximately 45 minutes works well expensive soup taking factors account recipe earns spoonacular score 63 solid users liked recipe also liked roasted butternut squash bisque frangelico roasted butternut squash bisque sage cream butternut squash bisque garlic broth add ingredients stock bring boil reduce heat cover simmer 30 minutes strain cheesecloth use immediately soup freeze individual portions tip freezing broth ice cube trays allows use small portions time butternut squash soup cut squash half lengthwise scoop seeds quarter squash large place olive oil coated squash flesh side tray roasting roast top rack oven 425 degre

In [299]:
from nltk.tokenize import word_tokenize, sent_tokenize
txt = " ".join(test[0])
 
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
 
tokenized = sent_tokenize(txt)
for i in txt:
     
    # Word tokenizers is used to find the words
    # and punctuation in a string
    wordsList = nltk.word_tokenize(i)
 
    # removing stop words from wordList
    wordsList = [w for w in wordsList if not w in stop_words]
 
    #  Using a Tagger. Which is part-of-speech
    # tagger or POS-tagger.
    tagged = nltk.pos_tag(wordsList)
print(tagged)

[('e', 'NN')]


In [300]:
txt

'roasted butternut squash bisque roasted butternut squash bisque could gluten free lacto ovo vegetarian recipe looking recipe serves 6 costs 5 1 per serving one serving contains 575 calories 6g protein 28g fat mixture coursely garlic salt bay leaves handful ingredients takes make recipe delicious recipe foodista 1 fans preparation plate recipe takes approximately approximately 45 minutes works well expensive soup taking factors account recipe earns spoonacular score 63 solid users liked recipe also liked roasted butternut squash bisque frangelico roasted butternut squash bisque sage cream butternut squash bisque garlic broth add ingredients stock bring boil reduce heat cover simmer 30 minutes strain cheesecloth use immediately soup freeze individual portions tip freezing broth ice cube trays allows use small portions time butternut squash soup cut squash half lengthwise scoop seeds quarter squash large place olive oil coated squash flesh side tray roasting roast top rack oven 425 degre