This notebook will be for creating a scraper to get meals from Tina's favorite places. Be sure to add a tag to each for where the meals are from.

This will just be used to save us the time of manually uploading them all up front.

In [1]:
import json
import pickle
import os

In [2]:
os.listdir()

['.git',
 '.ipynb_checkpoints',
 'Beebes Meal Planner',
 'beebes_apps_logo',
 'debug.log',
 'functions.py',
 'meals.pkl',
 'meal_planner_app.ipynb',
 'meal_scraper.ipynb',
 'README.txt',
 'Troubleshooting.ipynb',
 '__pycache__']

In [3]:
# Load meals so far
f = open('meals.pkl', 'rb')
meals = pickle.load(f)

In [4]:
len(meals)

13

In [163]:
# Remember format of meal data
display(meals[0])

{'name': 'better than takeout sweet thai basil chicken',
 'ingredients': {'ground chicken': '1 pound',
  'seasame oil': '2 tablespoons',
  'black pepper': 'to taste',
  'garlic': '4 cloves',
  'ginger': '1 inch, grated',
  'bell pepper, red or orange': '2, chopped',
  'cashews': '1/2 cup',
  'soy sauce, low sodium': '1/2 cup',
  'fish sauce': '2 tablespoons',
  'honey': '1/4 cup',
  'chili paste, sambal oelek': '2-3 tablespoons',
  'basil': '1 cup, torn',
  'mint': '1/4 cup, torn',
  'rice, white or brown': 'as needed',
  'mango': '1, sliced or diced'},
 'instructions': {1: 'Heat the oil in a large skillet over medium heat. When the oil shimmers, add the chicken. Season with black pepper and brown all over, breaking the chicken up as it cooks, about 5 minutes. Add the garlic, ginger, peppers, and cashews, cook another 2-3 minutes, until the garlic is fragrant. Pour in the soy sauce, fish sauce, chili paste, and honey. Bring the sauce to a boil over medium-high heat and cook until the s

In [5]:
# Imports
import urllib.request
from bs4 import BeautifulSoup

In [58]:
# Specify url to scrape
url = 'https://www.skinnytaste.com/recipes/dinner-recipes/'

In [59]:
# User Agent (provides details of web browser to server so it knows traffic is legit)
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

In [60]:
# Request the page data with user_agent
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': user_agent
    }
)

In [61]:
# open the url data and save to page
page = urllib.request.urlopen(req)

In [62]:
# Use BeautifulSoup to parse the HTMl data
soup = BeautifulSoup(page, 'lxml')

Now that I have my soup object, need to examine the HTML set up of the page and find the HTML objects I'm looking for (aka the meals). The goal will be to create a list of all the meals on the page, then go through each meal and extract the details, storing them in a dictionary and storing each dictionary in a list.

__Note to future self:__
Use the Google Inspect tool (right click an element on webpage and choose "Inspect" to highlight it in HTML code)

In [81]:
# Initalize storage

#This is where I'll store the initial list of all the meals on the page
meals_to_scrape = []

# This is where I'll store the meal dicts to be combined with the rest of my meals
meals_scraped = []

In [64]:
# Identify the HTML objects on the page that I need
    ### USED Google Chrome right-click "Inspect" to find the class of div I need

# this is storing all of the archived posts on the page (there are many pages of archives though...)
table = soup.findAll('div', 
                          {'class' :'archive-post'})

In [82]:
# Create a list of all the meals on the page

# In general, you make a table of all the elements you want, then can access various parts of them in a loop as below:
# Here I am finding the first link ('a') and accessing the text portion of it (which is the name of the meal)
for x in table:
    # for troubleshooting as I set this up - to make sure I was getting desired details
    #print(x.find('a').text)
    meals_to_scrape.append(x.find('a').text)

## Meal Details Function

I'll need a function that takes the name of a meal, then goes to the website and parses the details for that meal into my dictionary format.

To do this I'll need to spend time understanding the layout of each meal page on the site.

I'll also need to keep in mind the desired end format of my dictionary for meal details.

---
Then this function will be used to loop through the list, meals_to_scrape, to get all the details of the meals.

In [91]:
# Create a function that takes the meal name, and returns the format for the website link
# this will be used inside my get_meal_details() function
def name_to_link(name):
    return name.strip().lower().replace(' ','-')

In [166]:
# Create a function that takes in the name of a meal, then returns a dictionary with the details of that meal

### This function took a while to build...don't lose it!

def get_meal_details(name, blog_name):    
    # turn the name of the meal into the website link
    link = name_to_link(name)
    
    ### The problem with having this as an input is that each blog is set up differently, probably need a 
    ### separate function for each.
    blogs_dict={'skinny taste': 'https://www.skinnytaste.com/'}
    
    # go to that webpage
    url = blogs_dict[blog_name.lower()]+link
    
    
    # Request the page data with user_agent
    req = urllib.request.Request(
        url, 
        data=None, 
        headers={
            'User-Agent': user_agent
                }
    )
    
    # open the url data and save to page
    page = urllib.request.urlopen(req)
    
    # Use BeautifulSoup to parse the HTMl data
    soup = BeautifulSoup(page, 'lxml')
    
    # Get the class 'wprm-recipe-container'
    x = soup.find('div',{'class':'wprm-recipe-container'})
    
    ### Details to return
    
    # Meal name:
    name = x.h2.text
    
    # Ingredients:
    ingredients = []

    # Access the ingredient area (a ul)
    table = soup.find('div',{'class':'wprm-recipe-ingredient-group'})

    # For each li in ul, append to ingredients
    for x in table:
        for li in x.findAll('li'):
            ingredients.append(li.text)
    
    # Instructions
    instructions = []

    table = soup.find('div',class_='wprm-recipe-instructions-container')

    for x in table:
        for li in x.findAll('li'):
            instructions.append(li.text)
    
    # Cuisine
    cuisine = []
    cuisine.append(soup.find('span',{'class':'wprm-recipe-cuisine wprm-block-text-normal'}).text.lower())
    
    # Tags
    tags = ['skinny taste']
    
    
    # Build it all into a dictionary
    meal_dict = {'name':name,
                'ingredients':ingredients,
                'instructions':instructions,
                'cuisine':cuisine,
                'tags':tags}
    
    return meal_dict

In [167]:
# Testing my function
display(get_meal_details('instant pot baked ziti','skinny taste'))

{'name': 'Instant Pot Baked Ziti',
 'ingredients': ['1 teaspoon olive oil',
  '3 garlic cloves, smashed with the side of a knife',
  '2 cups chopped baby spinach',
  '2 cups water',
  '3/4 teaspoon Kosher salt',
  '10 ounces Delallo whole wheat pasta such as ziti or cavatappi, about 3 cups ',
  '2 cups homemade or jarred marinara sauce',
  '1/2 cup part skim ricotta',
  '1/4 cup grated Pecorino Romano',
  '1 cup part-skim mozzarella cheese, grated'],
 'instructions': ['Using the saute button, when hot add the oil and garlic; stir 1 minute, or until golden.',
  'Add water and salt to the pot to deglaze, making sure the garlic is not stuck to the bottom of the pot.',
  'Add spinach and pasta and stir.',
  "Pour the marinara sauce evenly over the uncooked pasta, making sure it's covering all the pasta. Do not stir.",
  'Cover and cook high pressure 7 minutes.',
  'Quick release, then open the lid, stir the pasta, dollop in the ricotta, top with Pecorino and the mozzarella.',
  'Cover the 

## Awesome! - Loop through list of meals

So that function works. 

Now just need to loop through the names from original list and call the function on each name.

In [170]:
# Loop through all the meals in my list of meals to scrape
for meal in meals_to_scrape:
    try:
        meals_scraped.append(get_meal_details(meal, 'skinny taste'))
    except:
        continue

In [172]:
print(len(meals_scraped))
display(meals_scraped)

26


[{'name': 'Instant Pot Baked Ziti',
  'ingredients': ['1 teaspoon olive oil',
   '3 garlic cloves, smashed with the side of a knife',
   '2 cups chopped baby spinach',
   '2 cups water',
   '3/4 teaspoon Kosher salt',
   '10 ounces Delallo whole wheat pasta such as ziti or cavatappi, about 3 cups ',
   '2 cups homemade or jarred marinara sauce',
   '1/2 cup part skim ricotta',
   '1/4 cup grated Pecorino Romano',
   '1 cup part-skim mozzarella cheese, grated'],
  'instructions': ['Using the saute button, when hot add the oil and garlic; stir 1 minute, or until golden.',
   'Add water and salt to the pot to deglaze, making sure the garlic is not stuck to the bottom of the pot.',
   'Add spinach and pasta and stir.',
   "Pour the marinara sauce evenly over the uncooked pasta, making sure it's covering all the pasta. Do not stir.",
   'Cover and cook high pressure 7 minutes.',
   'Quick release, then open the lid, stir the pasta, dollop in the ricotta, top with Pecorino and the mozzarella

### Note: Change original file data structure

I need to change my meal format so ingredients and instructions are just lists for now.