# Finding Meals that Conform to Your Taste Within Budget

## Introduction 

According to a recent [survey](https://www.iol.co.za/lifestyle/food-drink/many-cookbooks-yet-same-old-supper-1824825), although Americans have on average 6 cookbooks in the home, they make the same 9 meals on rotation. Part of this lack of diversity is likely the fear of preparing a new meal that does not conform to your taste profile and goes to waste. 

### Project Objective

In this project, my aim is to help discover new meals that will conform to a users specific tastes within their budget. 

### Datasets

1. **Instacart Market Basket Analysis Datasets**
    - order
    - ailes: 132 unique store ailes
    - departments: 24 unique departments
    - products: 49.7k unique products. 
    
    
2. **Mariano's Grocery Prices**: 


3. **Simply Recipes recipe database**

### Summery of Results

blah blah blah blah 

## Data Collection - Web Crawling

In order to start our project we will need to collect three different types of data: 
* First, we will need a dataset filled with different users food preferences. Because food rating data is difficult to come by, we can instead use point of purchase grocery store data for users and utilize implicit feedback (i.e., assume that customers that bought an item liked the item). 

* Second, we need a repository of diverse recipes. Although a number of recipe datasets are available online, no dataset that I found has all the required attributes. Because of the lack of appropriate available data, this data will need to be collected from a website via webscraping. 

* Third, we will need a database of grocery prices for pricing our recipes. Because datasets with prices are and far between, and quickly outdated, we will need to manually collect grocery pricing data from a store website. 

In [35]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sys
import warnings
warnings.filterwarnings('ignore')

from selenium import webdriver

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

from d00_utils import utils
from d01_data import clean_data
from d01_data.web_scraping import sr_scraping, marianos_insta_scraping

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Recipe Web Scraping - simplyrecipes.com

After conducting research on a number of sites, I chose simplyrecipes.com to scrape for a number of reasons. 

* The website contained diverse recipes that could appeal to a number of different pallets
* Each recipe came pre-tagged with meal and dietary preferences

If you would like to scrape the website yourself please run ```sr_scraping()``` in a cell within this notebook. The full script takes around 1.5 hours to run and all files are saved to the ```data/01_raw/simply_recipes``` folder. I'm reading in the scraped and concatenated dataset for convenience. 

In [10]:
recipes_sr_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/simply_recipes/simply_recipes*')
recipes_sr_orig.drop(columns='Unnamed: 0', inplace=True)

In [18]:
recipes_sr_inter = clean_data.intermediate_clean_recipes_sr(recipes_sr_orig)

At the moment we have 1752 different recipe entries. From a quick glance, some of the things marked as recipes are actually how-to guides. These will need to come out because they don't provide ingredients to base our model off of. 

In [22]:
recipes_sr_inter.head(3)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...
2,How to Make Bacon in the Oven,5 minutes,20 minutes,"['Tips', 'Breakfast and Brunch', 'Baking', 'Ho...","[12 strips bacon, 1/2 teaspoon ground black pe...",12 strips,Nick Evans,https://www.simplyrecipes.com/recipes/how_to_m...


### Grocery Price Web Scraping - Chicago's Marianos

If you would like to scrape the website yourself please run ```marianos_insta_scraping()``` in a cell within this notebook. The full script takes around 5 hours to run and all files are saved to the ```data/01_raw folder``` as ```prod_aile_*```. Once you run the script selenium will open and you'll need to sign into instacart wit your username and password. After 80 seconds the script will began scraping the site on it's own. 

Let's go ahead and read in the concatenated files from marianos. 

In [38]:
grocery_prices_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/grocery_prices_marianos/prod_aile*')

In [39]:
grocery_prices_orig.head()

Unnamed: 0,product,unit_price,item_size,prod_aile
0,"Halls Defense Dietary Supplement Drops, Assort...",$1.79,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"
1,Halls Suppressant/Oral Anesthetic Halls Relief...,$1.79,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"
2,Kroger Co. Mucus Relief Expectorant & Cough Su...,$9.29,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"
3,Ricola Sugar Free Lemon Mint Herb Throat Drops,$2.29,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"
4,Benadryl Allergy Ultratabs Tablets,$4.99,"<li class=""item-card"" data-radium=""true""><div ...","Cold, Flu & Allergy"


In [40]:
grocery_prices_inter = clean_data.intermediate_clean_marianos_prices(grocery_prices_orig)

In [42]:
grocery_prices_inter.head(3)

Unnamed: 0,product,main_price,prod_aile,price_per_lb,measure_words_main_price,item_weight_count_vol,date_collected,store,location
0,"Halls Defense Dietary Supplement Drops, Assort...",$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615
1,Halls Suppressant/Oral Anesthetic Halls Relief...,$1.79,"Cold, Flu & Allergy",,,30 count,2019-08-28,Marianos,60615
2,Kroger Co. Mucus Relief Expectorant & Cough Su...,$9.29,"Cold, Flu & Allergy",,,14 count,2019-08-28,Marianos,60615


## Exploratory Data Analysis

In [2]:
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')
order = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')
order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')

Let's combine our instacart kaggle dataset in a way that allows us to see what is in each users basket

In [6]:
instacart_baskets = clean_data.combine_instacart_kaggle_datasets(aisles, 
                                                                 departments, 
                                                                 order, 
                                                                 order_products__prior, 
                                                                 products)
instacart_baskets.head()

In [8]:
instacart_baskets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  int64
product_id                int64
add_to_cart_order         int64
reordered                 int64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
product_name              object
aisle_id                  int64
department_id             int64
aisle                     object
department                object
dtypes: float64(1), int64(10), object(4)
memory usage: 3.9+ GB


The highest number of baskets by customer is 99 (this means 99 different orders for each of the 5 customers below). 

In [9]:
pd.DataFrame(instacart_baskets.groupby('user_id')['order_id']\
             .nunique()).sort_values('order_id', ascending=False)\
             .head(5)

Unnamed: 0_level_0,order_id
user_id,Unnamed: 1_level_1
152340,99
185641,99
185524,99
81678,99
70922,99
