## Recipe-maker: Part 1  <img style="float: left;" src="assets/potatochef.png" width="150" height="150"> 

#### Introduction

In this notebook, I'm going to explore recipe data from _allrecipes.com_. It will be part 1 of the two parts I have split my project into:

* **Part 1**  
  
  My main question is 'What ingredients typically go well together?'.
  
  But there are some problems with this question:  
  The data I am currently using includes recipes for a vast array of food types: breads, cakes, desserts, pasta, soups etc. It might be difficult to divide/classify the data, and without dividing the data, it might be difficult to answer the question well. 
  
  Possible solutions:  
  * Only use a subset of the data that will be easy to identify recipes belonging to the category. E.g. using a subset of the data that only includes soup dishes may be as easy as including recipes with the word 'soup' or synonymous words in their title.
  * Divide the data into a subset that includes single dishes cooked on the hob only. 
  * Tag the data with categories
  * Go back later and use a different dataset that I scrape myself to include only a subsection of _allrecipes.com_. 
  
  What will be the measure of how well ingredients go together?  
  **(I haven't figured this out yet)**
  * The ratings may be an indicator but ratings would reflect many other things, including how complicated the steps are, how long it takes to make the dish etc.  
  * The fact that a combination exists in the first place, or how frequently it exists, may be another indicator, but again there would be many factors contributing to a recipe's frequency.
  
  * A recipe involving cauliflower, chocolate and olives is less likely to make it on allrecipes and if it were, I'd think the reviews would be bad, so it might be an idea to create a combination of features to be the 'outcome'. Naive Bayes may be useful in this scenario.
  * Another possible solution: use a different source of data, for which recipes receive ratings for different factors seperately - I would be interested in a 'taste' rating.
  
  


* **Part 2**  
  
  My plan is a little half-baked but the ultimate goal is to make a program that could do one of more or the following:  
  * Given some user-inputted ingredients, gives recommendation of other ingredients to use
  * Given a user-inputted list of ingredients, creates a recommended recipe for the user to follow (seems more difficult)
  * Generates recommended recipes and tunes recommendations based on users ratings of other recipes

#### Imports

In [61]:
import pandas as pd
import numpy as np


from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

#### Read the data

In [62]:
#recipes1 = pd.read_csv('data/recipes.csv')
recipes = pd.read_csv('data/recipes.csv', delimiter=';')
reviews = pd.read_csv('data/reviews.csv', delimiter=',')

In [54]:
recipes.shape

(12351, 10)

In [55]:
reviews.shape

(1563566, 3)

In [56]:
recipes.head(6)

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID
0,Golden Crescent Rolls Recipe,304,https://images.media-allrecipes.com/userphotos...,Mike A.,25 m,15 m,3 h 10 m,"yeast,water,white sugar,salt,egg,butter,flour,...","Dissolve yeast in warm water.**Stir in sugar, ...",7000
1,Poppy Seed Bread with Glaze Recipe,137,https://images.media-allrecipes.com/userphotos...,Christina Jun,15 m,1 h,1 h 20 m,"flour,salt,baking powder,poppy,butter,vegetabl...",'Preheat oven to 350 degrees F (175 degrees C)...,7001
2,Applesauce Bread I Recipe,124,https://images.media-allrecipes.com/userphotos...,GAF55,10 m,1 h 20 m,1 h 30 m,"flour,egg,white sugar,vegetable oil,applesauce...",Preheat oven to 350 degrees F (175 degrees C)....,7003
3,Apple Raisin Bread Recipe,39,https://images.media-allrecipes.com/userphotos...,Helen Hanson,15 m,1 h,1 h 15 m,"flour,baking powder,baking soda,salt,cinnamon,...",Preheat oven to 350 degrees F (175 degrees C)....,7006
4,Buttermilk Oatmeal Bread Recipe,41,https://images.media-allrecipes.com/userphotos...,Helen Hanson,10 m,1 h,1 h 40 m,"oat,buttermilk,vegetable oil,egg,brown sugar,f...",Mix oats with buttermilk. Let stand for 1/2 h...,7007
5,Kolaches II Recipe,27,https://images.media-allrecipes.com/userphotos...,Nan,30 m,20 m,2 h 5 m,"shortening,white sugar,salt,milk,egg,lemon,yea...",Cream shortening and sugar together. Stir in ...,7008


Questions:

* Do all of the recipes end with the word 'Recipe'?
* Shall I drop the recipe photo column?
* Shall I take into condideration Author and groups of recipes with the same author?
* Should I take into consideration prepare time and cook time, and if so, how? Shall I include in the dataset only total time within a specified range? 
* Shall I take into consideration whether the recipe calls for use of oven/hob/microwave etc?

Things to explore: 
* Use association rules on ingredients

In [130]:
ingredients = recipes[["Recipe Name","Ingredients"]]
ingredients = ingredients[ingredients["Recipe Name"].str.contains('pasta', regex=False)]
#df = df[df.ids != "aball"]
ingredients.head(1000)


#onehot_ingredients = pd.concat([ingredients.drop('Ingredients', 1), ingredients['Ingredients'].str.get_dummies(sep=",")], 1)

#print(onehot_ingredients)

Unnamed: 0,Recipe Name,Ingredients


# Mining Association Rules

In [90]:
frq_ingredients = apriori(onehot_ingredients, min_support = 0.04, use_colnames = True)

# Collecting the inferred rules in a dataframe. Outline the minimum Lift metric threshold
rules = association_rules(frq_ingredients, metric='lift', min_threshold = 2)

#Sort the values by highest performing Confidence, then lift. In Descending order
ingredients_rules = rules.sort_values(['lift','confidence'],ascending = [False,False])

ingredients_rules[:20]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1266,"(salt, vanilla, flour)","(baking soda, egg)",0.090681,0.096915,0.044774,0.49375,5.094659,0.035985,1.783871
1283,"(baking soda, egg)","(salt, vanilla, flour)",0.096915,0.090681,0.044774,0.461988,5.094659,0.035985,1.690147
212,"(salt, onion)",(black pepper),0.077726,0.115294,0.04534,0.583333,5.059515,0.036379,2.123294
217,(black pepper),"(salt, onion)",0.115294,0.077726,0.04534,0.393258,5.059515,0.036379,1.520043
1165,"(baking powder, egg)","(salt, vanilla, flour)",0.089871,0.090681,0.040159,0.446847,4.927683,0.032009,1.643883
1148,"(salt, vanilla, flour)","(baking powder, egg)",0.090681,0.089871,0.040159,0.442857,4.927683,0.032009,1.633564
1356,"(salt, vanilla, flour)","(baking soda, white sugar)",0.090681,0.093029,0.040968,0.451786,4.856402,0.032532,1.65441
1373,"(baking soda, white sugar)","(salt, vanilla, flour)",0.093029,0.090681,0.040968,0.440383,4.856402,0.032532,1.624895
128,"(flour, brown sugar)",(baking soda),0.080074,0.115456,0.04445,0.555106,4.807936,0.035205,1.988213
133,(baking soda),"(flour, brown sugar)",0.115456,0.080074,0.04445,0.384993,4.807936,0.035205,1.495797


In [91]:
#High Support = Present in many recipes, insights drawn will be meaningful
#High Confidence = Ingredients in itemset very often appear in the same recipe
#High Lift = Antecedent almost always appears with consequent, and rarely in recipes without coscequent. 

In [26]:
recipes["Suffix"] = recipes["Recipe Name"].str.split().str[-1]
recipes["Suffix"].value_counts()

Recipe        10962
'              1015
Restaurant       17
Old              17
Sun              13
              ...  
Sixty             1
Butter            1
She               1
Child             1
Paris             1
Name: Suffix, Length: 194, dtype: int64

Recipe        11977
Old              17
Restaurant       17
Sun              13
Upside           10
              ...  
Child             1
Blackberry        1
Lime              1
Habanero          1
Paris             1
Name: Suffix, Length: 182, dtype: int64

In [28]:
recipes.loc[recipes['Suffix'] != 'Recipe'].head()

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID,Suffix
9,'Ruby''s Special Cornbread Recipe ',4,https://images.media-allrecipes.com/userphotos...,Mitzi Lyons,10 m,45 m,55 m,"cornmeal,milk,egg,bell pepper,onion,garlic,sal...",Preheat oven to 350 degrees F (175 degrees C)....,7014,'
21,'Mary Anne''s Moist and Nutty Carrot Loaf Reci...,36,https://images.media-allrecipes.com/userphotos...,P.Weiss,15 m,45 m,1 h,"baking soda,salt,cinnamon,nutmeg,flour,carrot,...",Preheat oven to 375 degrees F (190 degrees C)....,7033,'
26,'J.P.''s Big Daddy Biscuits Recipe ',3k,https://images.media-allrecipes.com/userphotos...,John Pickett,30 m,15 m,45 m,"flour,baking powder,salt,white sugar,shortenin...",Preheat oven to 425 degrees F (220 degrees C)....,7040,'
32,'Steve''s Whole Wheat Recipe ',11,https://images.media-allrecipes.com/userphotos...,Steve Lockhart,5 m,3 h,3 h 5 m,"water,milk,molasses,honey,margarine,white suga...",Place the ingredients in the pan of the bread ...,7051,'
40,Golly Gee Gluten,12,https://images.media-allrecipes.com/userphotos...,Kevin Ryan,10 m,15 m,25 m,"egg,apple juice,butter,flour,tapioca,flour,cin...","In a medium mixing bowl, beat the egg with the...",7060,Gluten


It looks like recipes including an apostrophe have an additional apostrophe an inverterted comma either side enclosing the recipe

In [31]:
recipes['Recipe Name'] = recipes['Recipe Name'].str.replace(r"\''", "\'")
recipes['Recipe Name'] = recipes['Recipe Name'].str.strip("'")
recipes.loc[recipes['Suffix'] != 'Recipe'].head()

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID,Suffix
9,Ruby's Special Cornbread Recipe,4,https://images.media-allrecipes.com/userphotos...,Mitzi Lyons,10 m,45 m,55 m,"cornmeal,milk,egg,bell pepper,onion,garlic,sal...",Preheat oven to 350 degrees F (175 degrees C)....,7014,'
21,Mary Anne's Moist and Nutty Carrot Loaf Recipe,36,https://images.media-allrecipes.com/userphotos...,P.Weiss,15 m,45 m,1 h,"baking soda,salt,cinnamon,nutmeg,flour,carrot,...",Preheat oven to 375 degrees F (190 degrees C)....,7033,'
26,J.P.'s Big Daddy Biscuits Recipe,3k,https://images.media-allrecipes.com/userphotos...,John Pickett,30 m,15 m,45 m,"flour,baking powder,salt,white sugar,shortenin...",Preheat oven to 425 degrees F (220 degrees C)....,7040,'
32,Steve's Whole Wheat Recipe,11,https://images.media-allrecipes.com/userphotos...,Steve Lockhart,5 m,3 h,3 h 5 m,"water,milk,molasses,honey,margarine,white suga...",Place the ingredients in the pan of the bread ...,7051,'
40,Golly Gee Gluten,12,https://images.media-allrecipes.com/userphotos...,Kevin Ryan,10 m,15 m,25 m,"egg,apple juice,butter,flour,tapioca,flour,cin...","In a medium mixing bowl, beat the egg with the...",7060,Gluten


In [33]:
recipes["Suffix"] = recipes["Recipe Name"].str.split().str[-1]
recipes["Suffix"].value_counts()

Recipe        11977
Old              17
Restaurant       17
Sun              13
Upside           10
              ...  
Child             1
Blackberry        1
Lime              1
Habanero          1
Paris             1
Name: Suffix, Length: 182, dtype: int64

In [36]:
pd.set_option('display.max_rows', 500)
recipes.loc[recipes['Suffix'] != 'Recipe'].head(100)

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID,Suffix
40,Golly Gee Gluten,12,https://images.media-allrecipes.com/userphotos...,Kevin Ryan,10 m,15 m,25 m,"egg,apple juice,butter,flour,tapioca,flour,cin...","In a medium mixing bowl, beat the egg with the...",7060,Gluten
60,The All,3,https://images.media-allrecipes.com/images/795...,Michelle L.,5 m,3 h,3 h 5 m,"water,salt,whole wheat,quinoa,flour,rosemary,y...",Add ingredients in order suggested by your man...,7084,All
63,Yeast,39,https://images.media-allrecipes.com/userphotos...,Andrew Chin,5 m,45 m,50 m,"flour,white sugar,lemon",Preheat oven to 350 degrees F (175 degrees C)....,7087,Yeast
83,Banana Bread,345,https://images.media-allrecipes.com/userphotos...,Dee,15 m,50 m,1 h 15 m,"flour,baking powder,baking soda,white sugar,ve...",Place ingredients in the pan of the bread mach...,7116,Bread
95,Pear,24,https://images.media-allrecipes.com/userphotos...,MARBALET,X,X,X,"flour,whole wheat,baking soda,cinnamon,baking ...",Preheat oven to 375 degrees F (190 degrees C)....,7133,Pear
133,Old,7,https://images.media-allrecipes.com/userphotos...,MARBALET,X,X,X,"flour,baking powder,salt,shortening,white suga...","In a medium bowl mix together the flour, bakin...",7182,Old
136,Gluten,170,https://images.media-allrecipes.com/userphotos...,Aaron Atkinson,X,X,X,"egg,vinegar,olive,honey,buttermilk,salt,1 tabl...",Place ingredients in the pan of the bread mach...,7185,Gluten
162,Muesli,34,https://images.media-allrecipes.com/userphotos...,KLODE,X,X,X,"applesauce,vegetable oil,white sugar,egg,water...",Preheat oven to 375 degrees F (190 degrees C)....,7220,Muesli
275,Fabulous Oatmeal,46,https://images.media-allrecipes.com/userphotos...,Carol Farrington,X,X,X,"water,oat,cereal,white sugar,brown sugar,short...","Pour boiling water over oats and bran cereal, ...",7359,Oatmeal
278,Old,10,https://images.media-allrecipes.com/userphotos...,Winona,30 m,30 m,1 h,"raisin,water,shortening,egg,flour,cinnamon,nut...",Preheat oven to 350 degrees F (175 degrees C)....,7363,Old


In [37]:
recipes.describe(include="all")

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID,Suffix
count,12351,12351.0,12351,12351,12351,12351,12351,12351,12345,12351.0,12351
unique,11964,741.0,11248,6582,73,110,301,12097,12184,,182
top,Old,4.0,https://images.media-allrecipes.com/images/795...,sal,X,X,X,"chocolate,chocolate",'Preheat oven to 350 degrees F (175 degrees C).,,Recipe
freq,17,347.0,971,274,4156,5877,4091,4,5,,11977
mean,,,,,,,,,,16149.829326,
std,,,,,,,,,,5745.770833,
min,,,,,,,,,,7000.0,
25%,,,,,,,,,,11150.5,
50%,,,,,,,,,,15564.0,
75%,,,,,,,,,,20889.5,


In [7]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12351 entries, 0 to 12350
Data columns (total 10 columns):
Recipe Name     12351 non-null object
Review Count    12351 non-null object
Recipe Photo    12351 non-null object
Author          12351 non-null object
Prepare Time    12351 non-null object
Cook Time       12351 non-null object
Total Time      12351 non-null object
Ingredients     12351 non-null object
Directions      12345 non-null object
RecipeID        12351 non-null int64
dtypes: int64(1), object(9)
memory usage: 965.0+ KB


* Looking at count, there are no null values for any of the columns other than RecipeID
* We can't assess 'unique', 'top' and 'frequency' from RecipeIDs as RecipeIDs is of type int65, so I am going to convert this column to type string.
* There seem to be some entries (6) without any directions - I may remove these from the dataset.

In [15]:
recipes.loc[recipes['Directions'].isna()]

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID
292,Sunshine Cake Recipe,9,https://images.media-allrecipes.com/userphotos...,Helga,X,X,X,"egg,white sugar,water,flour,baking powder,salt...",,7378
3818,Cherry Pie I Recipe,11,https://images.media-allrecipes.com/userphotos...,Cali,X,X,X,"vanilla,gelatin,water,cherry,white sugar,corns...",,12251
4320,Quick Clam Chowder Recipe,42,https://images.media-allrecipes.com/userphotos...,Lew Sweet,X,X,X,"england,potato,celery,clam,onion,celery,margar...",,12981
6238,Hot Clam Dip II Recipe,30,https://images.media-allrecipes.com/userphotos...,lara,X,X,X,"bread,cream cheese,onion,beer,worcestershire s...",,15648
9330,Chocolate Sausage (Salame di Cioccolato) Recipe,2,https://images.media-allrecipes.com/images/795...,Manuela,X,X,X,"white sugar,egg,egg,butter,cocoa powder,cookie...",,21002
10676,Oatmeal Kiss Cookies Recipe,6,https://images.media-allrecipes.com/userphotos...,HersheysKitchens.com,X,X,X,"chocolate,butter,shortening,white sugar,brown ...",,24031


In [8]:
recipes.loc[recipes['Recipe Name'] == 'Old'].head(5)

Unnamed: 0,Recipe Name,Review Count,Recipe Photo,Author,Prepare Time,Cook Time,Total Time,Ingredients,Directions,RecipeID
133,Old,7,https://images.media-allrecipes.com/userphotos...,MARBALET,X,X,X,"flour,baking powder,salt,shortening,white suga...","In a medium bowl mix together the flour, bakin...",7182
278,Old,10,https://images.media-allrecipes.com/userphotos...,Winona,30 m,30 m,1 h,"raisin,water,shortening,egg,flour,cinnamon,nut...",Preheat oven to 350 degrees F (175 degrees C)....,7363
1634,Old,171,https://images.media-allrecipes.com/userphotos...,Cali,X,X,X,"cream,egg,salt,butter,paprika,black pepper",Preheat oven to 350 degrees F (175 degrees C)....,9166
1803,Old,119,https://images.media-allrecipes.com/userphotos...,Juanita,15 m,2 h 30 m,2 h 45 m,"egg,milk,white sugar,rice,butter,vanilla,raisi...",Preheat oven to 300 degrees F (150 degrees C)....,9402
1814,Old,522,https://images.media-allrecipes.com/userphotos...,BOOK_WORM,X,X,X,"white sugar,cocoa,milk,butter,vanilla",Grease an 8x8 inch square baking pan. Set asid...,9420


In [9]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12351 entries, 0 to 12350
Data columns (total 10 columns):
Recipe Name     12351 non-null object
Review Count    12351 non-null object
Recipe Photo    12351 non-null object
Author          12351 non-null object
Prepare Time    12351 non-null object
Cook Time       12351 non-null object
Total Time      12351 non-null object
Ingredients     12351 non-null object
Directions      12345 non-null object
RecipeID        12351 non-null int64
dtypes: int64(1), object(9)
memory usage: 965.0+ KB


In [10]:
reviews.head(10)

Unnamed: 0,RecipeID,profileID,Rate
0,7000,675719,5.0
1,7000,1478626,5.0
2,7000,608663,5.0
3,7000,2785736,5.0
4,7000,594474,5.0
5,7000,5468,5.0
6,7000,2926455,5.0
7,7000,1896099,5.0
8,7000,25495,4.0
9,7000,539102,5.0


In [11]:
reviews.describe(include="all")

Unnamed: 0,RecipeID,profileID,Rate
count,1563566.0,1563566.0,1563566.0
mean,16433.07,4255404.0,4.506039
std,5617.319,4841387.0,0.8861726
min,7000.0,16.0,1.0
25%,11815.0,1123636.0,4.0
50%,16080.0,2517790.0,5.0
75%,21135.0,5094301.0,5.0
max,27511.0,24896380.0,5.0


In [12]:
#change number format
with pd.option_context('float_format', '{:f}'.format): display(reviews.describe(include=[np.number]))

Unnamed: 0,RecipeID,profileID,Rate
count,1563566.0,1563566.0,1563566.0
mean,16433.072111,4255403.731482,4.506039
std,5617.319181,4841387.469674,0.886173
min,7000.0,16.0,1.0
25%,11815.0,1123636.0,4.0
50%,16080.0,2517790.5,5.0
75%,21135.0,5094301.0,5.0
max,27511.0,24896382.0,5.0


I'd like to know how many unique recipe IDs there are in both datasets, recipes and reviews. 

In [13]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1563566 entries, 0 to 1563565
Data columns (total 3 columns):
RecipeID     1563566 non-null int64
profileID    1563566 non-null int64
Rate         1563566 non-null float64
dtypes: float64(1), int64(2)
memory usage: 35.8 MB


In [14]:
dupRecipes = reviews[reviews.duplicated(['RecipeID'],keep=False)]
dupRecipes = reviews.sort_values(by ='RecipeID' )
dupRecipes.head(15)

Unnamed: 0,RecipeID,profileID,Rate
0,7000,675719,5.0
168,7000,681370,5.0
169,7000,1532140,4.0
170,7000,2724635,4.0
171,7000,2691767,5.0
172,7000,2554450,5.0
173,7000,1968152,5.0
174,7000,888201,5.0
175,7000,1694031,5.0
176,7000,1241884,5.0


In [1]:
from time import sleep
from selenium import webdriver

ModuleNotFoundError: No module named 'selenium'