## Data preprocessing
Ricette definite in un formato semistrutturato, raggruppamento degli step e ingredienti in un unica stringa, funzioni di utilità

### Funzioni util e import

In [35]:
from pathlib import Path
import pandas as pd
from IPython.display import display
from ast import literal_eval
import re
from nltk.tokenize import word_tokenize, sent_tokenize


def string_recipe(i):
    return dataset.iloc[i]['title'] + "\n\n" + dataset.iloc[i]['ingredients'] + "\n\n" + dataset.iloc[i]['step'] 

### Caricamento del dataset

In [37]:
dataset = pd.read_csv(
    Path("../data/test_dataset.csv").resolve(), 
    index_col=[0], 
    names=["index", "title","ingredients","step"], 
    usecols=[0,1,2,3]
    )

for index in range(len(dataset)):
    dataset.iloc[index]['ingredients'] = ".\n".join(literal_eval(dataset.iloc[index]['ingredients']))
    dataset.iloc[index]['step'] = "".join(literal_eval(dataset.iloc[index]['step']))

display(dataset.head())
print(string_recipe(10))


Unnamed: 0_level_0,title,ingredients,step
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Jewell Ball'S Chicken,"1 small jar chipped beef, cut up.\n4 boned chi...",Place chipped beef on bottom of baking dish.Pl...
2,Creamy Corn,2 (16 oz.) pkg. frozen corn.\n1 (8 oz.) pkg. c...,"In a slow cooker, combine all ingredients. Cov..."
3,Chicken Funny,1 large whole chicken.\n2 (10 1/2 oz.) cans ch...,Boil and debone chicken.Put bite size pieces i...
4,Reeses Cups(Candy),1 c. peanut butter.\n3/4 c. graham cracker cru...,Combine first four ingredients and press in 13...
5,Cheeseburger Potato Soup,6 baking potatoes.\n1 lb. of extra lean ground...,Wash potatoes; prick several times with a fork...


Buckeye Candy

1 box powdered sugar.
8 oz. soft butter.
1 (8 oz.) peanut butter.
paraffin.
12 oz. chocolate chips

Mix sugar, butter and peanut butter.Roll into balls and place on cookie sheet.Set in freezer for at least 30 minutes. Melt chocolate chips and paraffin in double boiler.Using a toothpick, dip balls 3/4 of way into chocolate chip and paraffin mixture to make them look like buckeyes.


### Estrazione delle abbreviazioni

Fase iniziale di ritrovamento del set di abbreviazioni

In [38]:
abbrv_dataset = pd.read_csv(
    Path("../data/test_dataset.csv").resolve(), 
    index_col=[0], 
    names=["index", "title","ingredients","step"], 
    usecols=[0,1,2,3]
    )

abbrv = set()
for index in range(len(dataset)):
    abbrv_dataset.iloc[index]['ingredients'] = " ".join(literal_eval(abbrv_dataset.iloc[index]['ingredients']))
    for element in re.findall(r"[A-Za-z]*\.", abbrv_dataset.iloc[index]['ingredients']):
        abbrv.add(element)
    
print(abbrv)

{'No.', 'tbsp.', 'gal.', 'c.', 'oz.', 'pt.', 'tsp.', 'Tbsp.', 'sq.', 'lb.', 'qt.', 'pkg.'}


Rimozione delle abbreviazioni in quanto possono essere dannose per il processo di tokenizzazione. Es pkg. ---> package

In [39]:
def expand_abbreviations(ingredients_string):
    __ABBREVIATIONS__ = {
        'pkg.'  :   'package',
        'tsb.'  :   'tablespoon',
        'no.'   :   'number',
        'pt.'   :   'pint',
        'no.'   :   'number',
        'gal.'  :   'gallon',
        'tbsp.' :   'tablespoon',
        'sq.'   :   'square',
        'oz.'   :   'ounce',
        'lb.'   :   'pound',
        'qt.'   :   'quart',
        'c.'    :   'cup',
        'tsp.'  :   'teaspoon'
    }
    for item, value in __ABBREVIATIONS__.items():
        ingredients_string = ingredients_string.lower().replace(item, value)
    return ingredients_string


for index in range(len(dataset)):
    dataset.iloc[index]['ingredients'] = expand_abbreviations(dataset.iloc[index]['ingredients'])

display(dataset.head())

Unnamed: 0_level_0,title,ingredients,step
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Jewell Ball'S Chicken,"1 small jar chipped beef, cut up.\n4 boned chi...",Place chipped beef on bottom of baking dish.Pl...
2,Creamy Corn,2 (16 ounce) package frozen corn.\n1 (8 ounce)...,"In a slow cooker, combine all ingredients. Cov..."
3,Chicken Funny,1 large whole chicken.\n2 (10 1/2 ounce) cans ...,Boil and debone chicken.Put bite size pieces i...
4,Reeses Cups(Candy),1 cup peanut butter.\n3/4 cup graham cracker c...,Combine first four ingredients and press in 13...
5,Cheeseburger Potato Soup,6 baking potatoes.\n1 pound of extra lean grou...,Wash potatoes; prick several times with a fork...


### Sentence splitting

In [42]:
for index in range(len(dataset)):
     dataset.iloc[index]['ingredients'] = (sent_tokenize(dataset.iloc[index]['ingredients']))

In [45]:
display(dataset.head())
print(dataset.iloc[10]['ingredients'])

Unnamed: 0_level_0,title,ingredients,step
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Jewell Ball'S Chicken,"[1 small jar chipped beef, cut up., 4 boned ch...",Place chipped beef on bottom of baking dish.Pl...
2,Creamy Corn,"[2 (16 ounce) package frozen corn., 1 (8 ounce...","In a slow cooker, combine all ingredients. Cov..."
3,Chicken Funny,"[1 large whole chicken., 2 (10 1/2 ounce) cans...",Boil and debone chicken.Put bite size pieces i...
4,Reeses Cups(Candy),"[1 cup peanut butter., 3/4 cup graham cracker ...",Combine first four ingredients and press in 13...
5,Cheeseburger Potato Soup,"[6 baking potatoes., 1 pound of extra lean gro...",Wash potatoes; prick several times with a fork...


['1 box powdered sugar.', '8 ounce soft butter.', '1 (8 ounce) peanut butter.', 'paraffin.', '12 ounce chocolate chips']
