In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Analysis: Frame the problem and look at the big picture.

## Objective
Enabling people at home to generate new and tasty recipies based on thee ingredients they currently have at home.

## Framing
It will be an unsupervised problem, since the model will generate new recipes (text generation) and we have no target / label to predict / classify.

## Performance Measuring
We do not have the necessary domain knowledge to "judge" if the recipes are good, i.e. we do not know if the recipies generated will taste <em>good</em>, but we can evaluate if it generates gibberish.

We will train the model on all of our data and then test the model on custom inputs.


# 2. Get the data 

We will use the `RecipeNLG` data set, which can be downloaded through this link: https://recipenlg.cs.put.poznan.pl/dataset

The data set is ~2 GB containing 2231142 recipies with `title`, `directions`, necessary ingredients and more. The data set has been downloaded to this folder with the name `full_dataset.csv`

In [23]:
data = pd.read_csv("full_dataset.csv").drop('Unnamed: 0', axis=1)

In [26]:
data.head()

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [27]:
data.shape

(2231142, 6)

In [28]:
data.columns

Index(['title', 'ingredients', 'directions', 'link', 'source', 'NER'], dtype='object')

## Features

### title
Title of the recipe

### Ingredients
Vector of strings descriping the amount of each ingredient required.

### directions
Vector of sentences containing the step by step actions necessary to reproduce the recipe. 

### link
Where the recipe has been scraped from

### source
If the recipe is from the `Recipes1M` data set or scraped. 

### NER
Vector of the ingredients in string format 

# 3. Explore the data

In [31]:
eda_copy = data.copy()

## Exploring the features in the data set

In [32]:
eda_copy.head()

Unnamed: 0,title,ingredients,directions,link,source,NER
0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [39]:
eda_copy.dtypes

title          object
ingredients    object
directions     object
link           object
source         object
NER            object
dtype: object

Every feature is an object, i.e. string. The ingredients, directions and NER must be processed into array.

The data set has already been cleansed in terms of duplicate recipes where the whitespace was 