# SNLP project

## Ingredient recommendation task 

**Objective:**

Produce some suggestions for substitution for certain ingredients, e,g, “vegan” and “pesto pasta” might give us “tofu”.

In [1]:
import pandas as pd
import numpy as np
from typing import List
import gensim
from gensim.models import Word2Vec
import spacy
import nltk
from nltk.corpus import stopwords
from nltk import download
from nltk.stem import WordNetLemmatizer
import time 

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df = pd.read_csv("../data/cleaneddata.csv", index_col=0)

In [3]:
# df.head()

## 1. Text preprocessing

### 1.1 Drop column we don't need and renaming columns

In [4]:
df = df.drop(["healthScore", "pricePerServing", "readyInMinutes", "servings"], axis=1)


In [5]:
df = df.rename(columns={
    "glutenFree": "gluten-free", 
    "dairyFree": "dairy-free", 
    "veryHealthy":"very-healthy", 
    "veryPopular": "very-popular", 
})

### 1.2 Transform the TRUE classification labels to class name

In [6]:
classes = ['vegetarian', 
           'vegan', 
           'gluten-free', 
           'dairy-free', 
           'very-healthy', 
           'cheap', 
           'very-popular', 
           'sustainable']

In [7]:
for c in classes:
    df[c] = df[c].replace(to_replace=[True, False], value=[c, np.nan])
# df = df.drop(["instructions", "summary"], axis=1)

In [8]:
df.head()

Unnamed: 0,title,summary,instructions,ingredients,ingredient types,diets,vegetarian,vegan,gluten-free,dairy-free,very-healthy,cheap,very-popular,sustainable
0,orange fig teacake with caramel glaze,orange fig teacake with caramel glaze is a veg...,you will need a 9 springform pan or a cake ...,ap flour; baking powder; cardamom; eggs; fresh...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,
1,poached eggs on a bed of fried mushrooms and c...,poached eggs on a bed of fried mushrooms and c...,in a frying pan heat up oil then add mushroom...,bread; butter; eggs; eggs; mushrooms; oil; sal...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,
2,pandan chiffon cake,for 26 cents per serving this recipe covers ...,preheat the oven to 170c blend the pandan le...,all purpose flour; bay leaves; coconut milk; c...,Ethnic Foods Produce Spices and Seasonings Bev...,dairy free; lacto ovo vegetarian,vegetarian,,,dairy-free,,,,
3,pork chop with honey mustard and apples,pork chop with honey mustard and apples might...,pre heat your oven to 200c 400f line a roa...,apples; dijon mustard; garlic cloves; honey; j...,Meat Spices and Seasonings Condiments Oil Vine...,gluten free; dairy free; paleolithic; primal,,,gluten-free,dairy-free,,,,
4,beet gnocchi with steak and brown butter sauce,the recipe beet gnocchi with steak and brown b...,cooking beets heat oven to 400 degrees wash be...,gnocchi; beets; olive oil; s p; goat cheese; r...,Produce Spices and Seasonings Meat Spices and ...,,,,,,,,,


### 1.3 Creating two copies of the diets and ingredients columns

The ingredients and diets are going to be important for this tasks. To make our model more flexible, we will expand the vocabularies in these categories. 

For diets, we want to include both `"gluten free"` and `"gluten-free"` in the corpus. For ingredients, we want to include both `"extra virgin olive oil"` and `"extra", "virgin", "olive", "oil"`. 

In [9]:
# df["ingredient split"] = df["ingredients"].str.replace("; ", " ")
df["diet-split"] = df["diets"].str.replace(" ", "-").str.replace(";-", "; ")
modify_cols = ["gluten-free", "dairy-free", "very-healthy", "very-popular"]
for col in modify_cols:
    i = df.columns.get_loc(col)
    df.iloc[:, i] = df.iloc[:, i] + "; " + df.iloc[:, i].str.replace("-", " ")

In [10]:
df.head()

Unnamed: 0,title,summary,instructions,ingredients,ingredient types,diets,vegetarian,vegan,gluten-free,dairy-free,very-healthy,cheap,very-popular,sustainable,diet-split
0,orange fig teacake with caramel glaze,orange fig teacake with caramel glaze is a veg...,you will need a 9 springform pan or a cake ...,ap flour; baking powder; cardamom; eggs; fresh...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,,lacto-ovo-vegetarian
1,poached eggs on a bed of fried mushrooms and c...,poached eggs on a bed of fried mushrooms and c...,in a frying pan heat up oil then add mushroom...,bread; butter; eggs; eggs; mushrooms; oil; sal...,Beverages Milk Eggs Other Dairy Spices and Sea...,lacto ovo vegetarian,vegetarian,,,,,,,,lacto-ovo-vegetarian
2,pandan chiffon cake,for 26 cents per serving this recipe covers ...,preheat the oven to 170c blend the pandan le...,all purpose flour; bay leaves; coconut milk; c...,Ethnic Foods Produce Spices and Seasonings Bev...,dairy free; lacto ovo vegetarian,vegetarian,,,dairy-free; dairy free,,,,,dairy-free; lacto-ovo-vegetarian
3,pork chop with honey mustard and apples,pork chop with honey mustard and apples might...,pre heat your oven to 200c 400f line a roa...,apples; dijon mustard; garlic cloves; honey; j...,Meat Spices and Seasonings Condiments Oil Vine...,gluten free; dairy free; paleolithic; primal,,,gluten-free; gluten free,dairy-free; dairy free,,,,,gluten-free; dairy-free; paleolithic; primal
4,beet gnocchi with steak and brown butter sauce,the recipe beet gnocchi with steak and brown b...,cooking beets heat oven to 400 degrees wash be...,gnocchi; beets; olive oil; s p; goat cheese; r...,Produce Spices and Seasonings Meat Spices and ...,,,,,,,,,,


## 2. Word2Vec

Here I use the Gensim library to create a word2vec embedding. 

Reference: 
- Gensim Word2Vec documentation: https://radimrehurek.com/gensim/models/word2vec.html
- DAS, P, 2020. How to train word2vec model using gensim library [online]. Medium. [viewed 26/03/2020]. Available from: https://medium.com/swlh/how-to-train-word2vec-model-using-gensim-library-115b35440c90

#### 2.1 Trainsform data into the required format

In [11]:
trainWord2Vec = False
word2vec_path = "../Word2Vec/word2vec.model"

In [12]:
if trainWord2Vec:
    download('stopwords')
    stop_words = stopwords.words('english')
    
    nltk.download('averaged_perceptron_tagger')
    nltk.download('wordnet')
    nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/sannalun/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [13]:
class Sentence(object):
    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.n_rows = df.shape[0]
        self.lemmatizer = WordNetLemmatizer()
        # these columns are strings with values separated by ;
        self.semi_colon_cols = ["ingredients", "diets", "diet-split", "gluten-free", 
                                "dairy-free", "very-healthy", "very-popular"]
        
    def __iter__(self):
        for n in range(self.n_rows):
            row = self.df.iloc[n]
            sentence : List[str] = []
            for i, cell in enumerate(row):
                if type(cell) == str:
                    sent = cell.split("; ") if df.columns[i] in self.semi_colon_cols else cell.split()
                    sent = [w.lower() for w in sent if w not in stop_words]
                    if len(sent) > 0: sent[-1] = self.lemmatizer.lemmatize(sent[-1])
                    sentence += sent
            yield sentence

In [14]:
# initiase
if trainWord2Vec:
    sentences = Sentence(df)

### 2.2 Training

After experimenting with different values, the following setup seem to give reasonable results:
- vector size 300
- min count 1
- window 5
- epoch 30

Training time: ~80 seconds

In [15]:
if trainWord2Vec:
    start = time.time()
    model = Word2Vec(sentences, min_count=1, vector_size=300, 
                     workers=2, window=2, epochs=30, sg=0)
    end = time.time()
    model.save(word2vec_path)
    print(f"Training comleted in {end-start}s.")

Training comleted in 84.1069917678833s.


### 2.3 Generate ingredient suggestions

To get ingredient suggestions, we will provide the desired postiive (and maybe also negetive) keywords.

In [16]:
w2v_model = Word2Vec.load(word2vec_path)

In [17]:
results = w2v_model.wv.most_similar(positive=["vegan", "pesto", "pasta"], topn=10)
results = [res[0] for res in results]
results

['feta',
 'fresh mozzarella cheese',
 'dressing',
 'gluten-free',
 'pescatarian',
 'rigate',
 'primal',
 'lasagna',
 'gluten free',
 'gorgonzola']

The results are satisfactory. For example, even though "feta" and "fresh mozzarella cheese" are not vegan, they are somewhat resonable suggestions. There are at least no meat on the list. 

However, we also observed that there are a lot of non-ingredient words in the list of suggestions, such as "free". Therefore, we decided to find out to exclude suggestions that are not in the list of ingredients. 

### 2.4 Improved ingredient suggestion generator

Our strategy is:
1. Obtain a list of candicate incredients, save it in a list
2. Generate suggestions like above, but filter out suggestions that are not in the ingredient list

In [18]:
ingredients = df["ingredients"].to_list()
# ingredients_split = df["ingredient split"].to_list()

all_ingredients = []
for i in ingredients: 
    items = i.split("; ")
    all_ingredients += items

# for i in ingredients_split: 
#     items = i.split()
#     all_ingredients += items

all_ingredients = list(set(all_ingredients))

In [19]:
def get_ingredients(ingredients: List[str], 
                    recipe: List[str], 
                    positive: List[str], 
                    negative: List[str] = [], 
                    topn=20):
    
    pos = recipe.split() + positive*3
    candidates = model.wv.most_similar(positive=pos, negative=negative*2, topn=200)
    
    substitutions = []
    for ingredient_name, _ in candidates:
        if len(substitutions) >= topn:
            break
        if ingredient_name in ingredients:
            substitutions.append(ingredient_name)
    return substitutions

In [20]:
get_ingredients(all_ingredients, 
                recipe="pesto pasta", 
                positive=["vegan"], 
                negative=[], topn=20)

['feta',
 'fresh mozzarella cheese',
 'mushroom',
 'gorgonzola',
 'honey',
 'cherry',
 'bread',
 'arugula',
 'seasoning blend',
 'bulgur',
 'crepes',
 'mayo',
 'buckwheat',
 'ricotta',
 'nutella']

## 3. BERT

### 3.1 Using pretrained Bert model

There are a lot of powerful pretrained BERT models and we would like to see whether the expressive power of BERT would help it success this task better than Word2Vec.

In [21]:
from transformers import pipeline
import torch

In [22]:
fill_pretrained = pipeline("fill-mask", 
                model="bert-base-uncased", 
                tokenizer="bert-base-uncased", 
                top_k=10)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
mask = fill_pretrained.tokenizer.mask_token
results = fill_pretrained(f"vegan pesto {mask} pasta")
results = [r['token_str'] for r in results]
results

['and', '&', 'salad', 'with', '-', ',', 'fried', ':', 'or', 'chicken']

The results are not very satisfactory as none of the outputs seem to capture our requirements. The first ingredient output is chicken which is clearly not vegan and not even vegetarian. We would therefore like to explore whether custom trained BERT would perform better in this task. 

### 3.2 Custom - RoBerta model

There are multiple BERT-like models out there. We have chosen to train a RoBerta model, whose implementation is essentially the same with a few tweaks. In particular, it removes the next-sentence pretraining objective, and focus on the masked language modelling objective. This makes us believe that this is more suitable as we believe that the latter objective is more relevant to the task at hand. 

Reference:
- Roberta documentation [online]. Hugging Face. [viewed 30/03/2020]. Available from:. https://huggingface.co/docs/transformers/model_doc/roberta
- Briggs, J., How to Train a BERT Model From Scratch [online]. Medium. [viewed 30/30/2020]. Available from: https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6

Training time: 6 hours 18 mins

In [24]:
# Set train to True to train model 
trainBert = False

In [25]:
from tokenizers import ByteLevelBPETokenizer
from transformers import RobertaModel, RobertaTokenizerFast, RobertaConfig, RobertaForMaskedLM, AdamW
import os 
from tqdm.auto import tqdm
import glob

In [26]:
tokenizer_path = "../BERT/recipe_tokenizer"
bert_model_path = "../BERT/model"
bert_recipe_path = "../BERT/recipes.txt"

#### Create the text file for training

In [27]:
if trainBert:
    df1 = df.apply(lambda x: " ".join([cell for cell in x if type(cell)==str]), axis=1)

    recipes = "\n".join(df1.to_list())
    with open(bert_recipe_path, 'w', encoding="utf-8") as f:
        f.write(recipes)

#### Train tokenizer

In [28]:
if trainBert:
    tokenizer = ByteLevelBPETokenizer()
    
    # Train the tokenizer with text
    tokenizer.train(files=[bert_recipe_path], 
                vocab_size=30_522, 
                min_frequency=1, 
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])
    os.mkdir(tokenizer_path)
    tokenizer.save_model(tokenizer_path)

In [29]:
tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_path)

In [30]:
tokenizer.save_pretrained(tokenizer_path)

('../BERT/recipe_tokenizer/tokenizer_config.json',
 '../BERT/recipe_tokenizer/special_tokens_map.json',
 '../BERT/recipe_tokenizer/vocab.json',
 '../BERT/recipe_tokenizer/merges.txt',
 '../BERT/recipe_tokenizer/added_tokens.json',
 '../BERT/recipe_tokenizer/tokenizer.json')

In [31]:
tokenizer("pesto pasta")

{'input_ids': [0, 8327, 974, 2], 'attention_mask': [1, 1, 1, 1]}

### Input pipeline

In [32]:
def mlm(tensor):
    """
    input: tensor of sentences
    output: tensor of sentences with masks words
    """
    rand = torch.rand(tensor.shape)
    mask_arr = (rand < 0.15) * (tensor > 2)
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist()
        tensor[i, selection] = 4 # mask token
    return tensor

In [33]:
if trainBert:
    input_ids = []
    attn_mask = [] 
    labels = []

    with open(bert_recipe_path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')

    sample = tokenizer(lines, 
                       max_length=512, 
                       padding="max_length", 
                       truncation=True, 
                       return_tensors='pt')
    labels.append(sample.input_ids)
    attn_mask.append(sample.attention_mask)
    input_ids.append(mlm(sample.input_ids.detach().clone()))

In [34]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    
    def __len__(self):
        return self.encodings['input_ids'].shape[0]

    def __getitem__(self, i):
        return {key: tensor[i] for key, tensor in self.encodings.items()}

In [35]:
if trainBert:
    # set up for training
    input_ids = torch.cat(input_ids)
    attn_mask = torch.cat(attn_mask)
    labels = torch.cat(labels)
    
    encodings = {
    'input_ids': input_ids,
    'attention_mask': attn_mask,
    'labels': labels
    }
    
    dataset = Dataset(encodings)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
    
    config = RobertaConfig(
    vocab_size = tokenizer.vocab_size,
    max_position_embeddings=514, 
    hidden_size = 768, 
    num_attention_heads=12, 
    num_hidden_layers= 6, 
    type_vocab_size = 1
    )

In [36]:
if trainBert:
    device = torch.device("cuda"  if torch.cuda.is_available() else "cpu")
    optim = AdamW(model.parameters(), lr=1e-4)
    model.to(device)

In [37]:
if trainBert:
    loop = tqdm(dataloader, leave=True)

    for batch in loop:
        optim.zero_grad()

        input_ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optim.step()

        loop.set_description(f"Epoch: {epochs}")
        loop.set_postfix(loss=loss.item())
    
    model.save_pretrained(bert_model_path)

#### Testing

In [38]:
model = RobertaModel.from_pretrained(bert_model_path)

Some weights of the model checkpoint at ../BERT/model were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at ../BERT/model and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [42]:
fill = pipeline("fill-mask", model=bert_model_path, tokenizer=tokenizer_path, top_k=10)

results = fill(f"vegan pesto {fill.tokenizer.mask_token} pasta")
results = [c['token_str'] for c in results]
results

[' ', ' and', ';', ' the', ' to', ' a', ' with', '<pad>', ' of', '  ']

## 4. Evaluation

1. Came up with 13 random recipe requirements. 
2. Generate prediction with all three models
3. Evaluate by two evaluators, score 0...5
4. Calculate the average score for each model

In [43]:
# Custom Word2Vec
w2v_model = Word2Vec.load(word2vec_path)

# Pretrained BERT
bert_pretrained = fill_pretrained = pipeline(
    "fill-mask", 
    model="bert-base-uncased", 
    tokenizer="bert-base-uncased",
    top_k=10)

# Custom BERT
tokenizer_path = "../BERT/recipe_tokenizer"
bert_model_path = "../BERT/model"
bert_recipe_path = "../BERT/recipes.txt"
bert_custom = pipeline(
    "fill-mask", 
    model=bert_model_path, 
    tokenizer=tokenizer_path, 
    top_k=10)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### 4.1 Came up with 13 random recipe requirements

In [77]:
w2v_test = [
    "budget vegan curry rice", 
    "vegan soup",
    "quick one pot pasta", 
    "vegetarian fried rice", 
    "asian soup noodles", 
    "quick and easy fried noodles",
    "quick vegan pasta",
    "cheap dairy-free soup", 
    "gluten-free cheesecake", 
    "quick vegetarian salad with mozzarella", 
    "vegetarian creamy pasta", 
    "cheap vegetarian pasta", 
    "gluten-free tomato pasta"
]

In [64]:
def generate_bert_test_set(mask):
    bert_test = [
        f"budget vegan curry {mask} rice", 
        f"vegan {mask} soup",
        f"quick one pot {mask} pasta", 
        f"vegetarian {mask} fried rice", 
        f"asian {mask} soup noodles", 
        f"quick and easy {mask} fried noodles",
        f"quick vegan {mask} pasta",
        f"cheap dairy-free {mask} soup", 
        f"gluten-free {mask} cheesecake", 
        f"quick vegetarian {mask} salad with mozzarella", 
        f"vegetarian creamy {mask} pasta", 
        f"cheap vegetarian {mask} pasta", 
        f"gluten-free tomato {mask} pasta"
        ]
    return bert_test

#### 4.2 Generate prediction with all three models

In [78]:
w2v_recommendations = {}
for recipe in w2v_test:
    words = recipe.split()
    words = [word for word in words if w2v_model.wv.has_index_for(word)]
    results = w2v_model.wv.most_similar(words, topn=10)
    results = [r[0] for r in results]
    w2v_recommendations[recipe] = results
df = pd.DataFrame(w2v_recommendations)
df.to_csv("Task3_evaluations/w2v_output.csv")

In [73]:
bert_pretrained_test = generate_bert_test_set(bert_pretrained.tokenizer.mask_token)
bert_pretrained_recommendations = {}
for recipe in bert_pretrained_test:
    results = bert_pretrained(recipe)
    results = [res['token_str'] for res in results]
    bert_pretrained_recommendations[recipe] = results

df = pd.DataFrame(bert_pretrained_recommendations)
df.to_csv("Task3_evaluations/bert_pretrained_output.csv")

In [76]:
bert_custom_test = generate_bert_test_set(bert_custom.tokenizer.mask_token)
bert_custom_recommendations = {}
for recipe in bert_custom_test:
    results = bert_custom(recipe)
    results = [res['token_str'] for res in results]
    bert_custom_recommendations[recipe] = results
    
df = pd.DataFrame(bert_custom_recommendations)
df.to_csv("Task3_evaluations/bert_custom_output.csv")

#### Calculate the average score for each model

In [85]:
import glob

In [99]:
w2v_evaluation_files = glob.glob("Task3_evaluations/w2v_evaluation_*.csv")
dfs = map(lambda path: pd.read_csv(path, index_col=0, sep=";"), w2v_evaluation_files)
w2v_evaluations = pd.concat(dfs)

In [104]:
pretrained_bert_eval_files = glob.glob("Task3_evaluations/bert_pretrained_evaluation_*.csv")
dfs = map(lambda path: pd.read_csv(path, index_col=0, sep=";"), pretrained_bert_eval_files)
pretrained_bert_eval = pd.concat(dfs)


In [108]:
custom_bert_eval_files = glob.glob("Task3_evaluations/bert_custom_evaluation*.csv")
dfs = map(lambda path: pd.read_csv(path, index_col=0, sep=";"), custom_bert_eval_files)
custom_bert_eval = pd.concat(dfs)


In [111]:
pretrained_bert_eval.mean().mean()

1.2038461538461538

In [112]:
w2v_evaluations.mean().mean()

1.5923076923076924

In [113]:
custom_bert_eval.mean().mean()

0.0

In [9]:
w2v_output = pd.read_csv("Task3_evaluations/bert_custom_output.csv", index_col=0)
w2v_output

Unnamed: 0,budget vegan curry <mask> rice,vegan <mask> soup,quick one pot <mask> pasta,vegetarian <mask> fried rice,asian <mask> soup noodles,quick and easy <mask> fried noodles,quick vegan <mask> pasta,cheap dairy-free <mask> soup,gluten-free <mask> cheesecake,quick vegetarian <mask> salad with mozzarella,vegetarian creamy <mask> pasta,cheap vegetarian <mask> pasta,gluten-free tomato <mask> pasta
0,,,,,,,,,,,,,
1,and,and,and,and,and,the,and,and,and,and,and,and,and
2,the,;,the,the,;,;,;,;,;,;,the,;,;
3,to,to,;,;,the,and,the,the,the,the,;,the,the
4,a,a,to,to,a,to,a,a,to,to,to,to,with
5,with,the,a,with,to,a,to,to,a,a,a,a,to
6,;,of,of,a,of,with,with,of,of,with,with,with,a
7,of,<pad>,with,of,<pad>,<pad>,of,with,with,of,of,of,of
8,<pad>,with,<pad>,<pad>,with,of,<pad>,recipe,<pad>,<pad>,<pad>,<pad>,<pad>
9,in,,for,,,for,for,<pad>,recipe,for,,for,in
