This notebook will be for creating a scraper to get meals from Tina's favorite places. Be sure to add a tag to each for where the meals are from.

This will just be used to save us the time of manually uploading them all up front.

In [1]:
import json
import pickle
import os

In [2]:
os.listdir()

['.git',
 '.ipynb_checkpoints',
 'Beebes Meal Planner',
 'beebes_apps_logo',
 'debug.log',
 'functions.py',
 'meals.pkl',
 'meal_planner_app.ipynb',
 'meal_scraper.ipynb',
 'README.txt',
 'Troubleshooting.ipynb',
 '__pycache__']

In [82]:
# Load meals so far
f = open('meals.pkl', 'rb')
meals = pickle.load(f)

In [55]:
len(meals)

13

In [56]:
# Remember format of meal data
display(meals[0])

{'name': 'better than takeout sweet thai basil chicken',
 'ingredients': {'ground chicken': '1 pound',
  'seasame oil': '2 tablespoons',
  'black pepper': 'to taste',
  'garlic': '4 cloves',
  'ginger': '1 inch, grated',
  'bell pepper, red or orange': '2, chopped',
  'cashews': '1/2 cup',
  'soy sauce, low sodium': '1/2 cup',
  'fish sauce': '2 tablespoons',
  'honey': '1/4 cup',
  'chili paste, sambal oelek': '2-3 tablespoons',
  'basil': '1 cup, torn',
  'mint': '1/4 cup, torn',
  'rice, white or brown': 'as needed',
  'mango': '1, sliced or diced'},
 'instructions': {1: 'Heat the oil in a large skillet over medium heat. When the oil shimmers, add the chicken. Season with black pepper and brown all over, breaking the chicken up as it cooks, about 5 minutes. Add the garlic, ginger, peppers, and cashews, cook another 2-3 minutes, until the garlic is fragrant. Pour in the soy sauce, fish sauce, chili paste, and honey. Bring the sauce to a boil over medium-high heat and cook until the s

In [6]:
# Imports
import urllib.request
from bs4 import BeautifulSoup

In [7]:
# Specify url to scrape
url = 'https://www.skinnytaste.com/recipes/dinner-recipes/'

In [8]:
# User Agent (provides details of web browser to server so it knows traffic is legit)
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

In [9]:
# Request the page data with user_agent
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': user_agent
    }
)

In [10]:
# open the url data and save to page
page = urllib.request.urlopen(req)

In [11]:
# Use BeautifulSoup to parse the HTMl data
soup = BeautifulSoup(page, 'lxml')

Now that I have my soup object, need to examine the HTML set up of the page and find the HTML objects I'm looking for (aka the meals). The goal will be to create a list of all the meals on the page, then go through each meal and extract the details, storing them in a dictionary and storing each dictionary in a list.

__Note to future self:__
Use the Google Inspect tool (right click an element on webpage and choose "Inspect" to highlight it in HTML code)

In [12]:
# Initalize storage

#This is where I'll store the initial list of all the meals on the page
meals_to_scrape = []

# This is where I'll store the meal dicts to be combined with the rest of my meals
meals_scraped = []

In [13]:
# Identify the HTML objects on the page that I need
    ### USED Google Chrome right-click "Inspect" to find the class of div I need

# this is storing all of the archived posts on the page (there are many pages of archives though...)
table = soup.findAll('div', 
                          {'class' :'archive-post'})

In [14]:
# Create a list of all the meals on the page

# In general, you make a table of all the elements you want, then can access various parts of them in a loop as below:
# Here I am finding the first link ('a') and accessing the text portion of it (which is the name of the meal)
for x in table:
    # for troubleshooting as I set this up - to make sure I was getting desired details
    #print(x.find('a').text)
    meals_to_scrape.append(x.find('a').text)

## Other Pages with Dinner Meals

So far I've got the meals from the first page of dinner meals. Following pages are displayed as:

https://www.skinnytaste.com/recipes/dinner-recipes/page/3/

with the number at the end increasing until there are no more meals.

I'll create a loop that does everything above. There are 21 pages.

In [21]:
# Check meals_to_scrape from first page (to compare to our final list)
len(meals_to_scrape)

30

In [22]:
# Set loop to go through each page
for i in range(1,21):
    # Set URL based on counter
    url = 'https://www.skinnytaste.com/recipes/dinner-recipes/page/{}/'.format(str(i+1))
    
    # Request the page data with user_agent
    req = urllib.request.Request(
        url, 
        data=None, 
        headers={
            'User-Agent': user_agent
        }
    )
    
    # open the url data and save to page
    page = urllib.request.urlopen(req)
    
    # Use BeautifulSoup to parse the HTMl data
    soup = BeautifulSoup(page, 'lxml')
    
    table = soup.findAll('div', 
                        {'class' :'archive-post'})
    
    for x in table:
        meals_to_scrape.append(x.find('a').text)

In [23]:
# Check new list of meals_to_scrape
len(meals_to_scrape)

610

## Meal Details Function

I'll need a function that takes the name of a meal, then goes to the website and parses the details for that meal into my dictionary format.

To do this I'll need to spend time understanding the layout of each meal page on the site.

I'll also need to keep in mind the desired end format of my dictionary for meal details.

---
Then this function will be used to loop through the list, meals_to_scrape, to get all the details of the meals.

In [24]:
# Create a function that takes the meal name, and returns the format for the website link
# this will be used inside my get_meal_details() function
def name_to_link(name):
    return name.strip().lower().replace(' ','-')

In [25]:
# Create a function that takes in the name of a meal, then returns a dictionary with the details of that meal

### This function took a while to build...don't lose it!

def get_meal_details(name, blog_name):    
    # turn the name of the meal into the website link
    link = name_to_link(name)
    
    ### The problem with having this as an input is that each blog is set up differently, probably need a 
    ### separate function for each.
    blogs_dict={'skinny taste': 'https://www.skinnytaste.com/'}
    
    # go to that webpage
    url = blogs_dict[blog_name.lower()]+link
    
    
    # Request the page data with user_agent
    req = urllib.request.Request(
        url, 
        data=None, 
        headers={
            'User-Agent': user_agent
                }
    )
    
    # open the url data and save to page
    page = urllib.request.urlopen(req)
    
    # Use BeautifulSoup to parse the HTMl data
    soup = BeautifulSoup(page, 'lxml')
    
    # Get the class 'wprm-recipe-container'
    x = soup.find('div',{'class':'wprm-recipe-container'})
    
    ### Details to return
    
    # Meal name:
    name = x.h2.text
    
    # Ingredients:
    ingredients = []

    # Access the ingredient area (a ul)
    table = soup.find('div',{'class':'wprm-recipe-ingredient-group'})

    # For each li in ul, append to ingredients
    for x in table:
        for li in x.findAll('li'):
            ingredients.append(li.text)
    
    # Instructions
    instructions = []

    table = soup.find('div',class_='wprm-recipe-instructions-container')

    for x in table:
        for li in x.findAll('li'):
            instructions.append(li.text)
    
    # Cuisine
    cuisine = []
    cuisine.append(soup.find('span',{'class':'wprm-recipe-cuisine wprm-block-text-normal'}).text.lower())
    
    # Tags
    tags = ['skinny taste']
    
    
    # Build it all into a dictionary
    meal_dict = {'name':name,
                'ingredients':ingredients,
                'instructions':instructions,
                'cuisine':cuisine,
                'tags':tags}
    
    return meal_dict

In [26]:
# Testing my function
display(get_meal_details('instant pot baked ziti','skinny taste'))

{'name': 'Instant Pot Baked Ziti',
 'ingredients': ['1 teaspoon olive oil',
  '3 garlic cloves, smashed with the side of a knife',
  '2 cups chopped baby spinach',
  '2 cups water',
  '3/4 teaspoon Kosher salt',
  '10 ounces Delallo whole wheat pasta such as ziti or cavatappi, about 3 cups ',
  '2 cups homemade or jarred marinara sauce',
  '1/2 cup part skim ricotta',
  '1/4 cup grated Pecorino Romano',
  '1 cup part-skim mozzarella cheese, grated'],
 'instructions': ['Using the saute button, when hot add the oil and garlic; stir 1 minute, or until golden.',
  'Add water and salt to the pot to deglaze, making sure the garlic is not stuck to the bottom of the pot.',
  'Add spinach and pasta and stir.',
  "Pour the marinara sauce evenly over the uncooked pasta, making sure it's covering all the pasta. Do not stir.",
  'Cover and cook high pressure 7 minutes.',
  'Quick release, then open the lid, stir the pasta, dollop in the ricotta, top with Pecorino and the mozzarella.',
  'Cover the 

## Awesome! - Loop through list of meals

So that function works. 

Now just need to loop through the names from original list and call the function on each name.

In [27]:
# Loop through all the meals in my list of meals to scrape
for meal in meals_to_scrape:
    try:
        meals_scraped.append(get_meal_details(meal, 'skinny taste'))
    except:
        continue

In [29]:
print(len(meals_scraped))
display(meals_scraped[:5])

387


[{'name': 'Instant Pot Baked Ziti',
  'ingredients': ['1 teaspoon olive oil',
   '3 garlic cloves, smashed with the side of a knife',
   '2 cups chopped baby spinach',
   '2 cups water',
   '3/4 teaspoon Kosher salt',
   '10 ounces Delallo whole wheat pasta such as ziti or cavatappi, about 3 cups ',
   '2 cups homemade or jarred marinara sauce',
   '1/2 cup part skim ricotta',
   '1/4 cup grated Pecorino Romano',
   '1 cup part-skim mozzarella cheese, grated'],
  'instructions': ['Using the saute button, when hot add the oil and garlic; stir 1 minute, or until golden.',
   'Add water and salt to the pot to deglaze, making sure the garlic is not stuck to the bottom of the pot.',
   'Add spinach and pasta and stir.',
   "Pour the marinara sauce evenly over the uncooked pasta, making sure it's covering all the pasta. Do not stir.",
   'Cover and cook high pressure 7 minutes.',
   'Quick release, then open the lid, stir the pasta, dollop in the ricotta, top with Pecorino and the mozzarella

### Note: Change original file data structure

I need to change my meal format so ingredients and instructions are just lists for now.

In [75]:
# Check the data types of my current meals
meals[1]

{'name': 'slow roasted beef',
 'ingredients': {'chuck beef roast': '3-4 pounds',
  'giardinera (mild)': '1 jar',
  'giardinera (hot)': '1 small jar',
  'beef broth': '1 can'},
 'instructions': {1: 'drain the jars of giardinera oil in a strainer, then put everything in crock pot.',
  2: 'cook on high in crock pot for 5 hours.'},
 'cuisine': ['american'],
 'tags': ['slow cooker']}

In [83]:
# Need to change ingredients and instructions into simple lists

# Ingredients - turn them into value + key
for i, meal in enumerate(meals):
    new_ingredients = []
    new_instructions = []
    
    # Take the ingredients, turn them into a string of value + key and append to new_ingredients
    for x,y in zip(meal['ingredients'].values(),meal['ingredients'].keys()):
        new_ingredients.append(x+' '+y)

    meals[i]['ingredients']=new_ingredients
    
    # Take the instructions, turn the values into a printed list
    for x in meal['instructions'].values():
        new_instructions.append(x)
        
    meals[i]['instructions'] = new_instructions

In [77]:
meals[4]

{'name': 'bison lettuce cups with garnet yam home fries',
 'ingredients': ['2 unpeeled yams, well scrubbed',
  '2 teaspoons sunflower oil',
  '1 1/2 teaspoons sea salt',
  '1 teaspoon five spice powder',
  '1 teaspoon coconut oil',
  '1 pound ground bison (buffalo)',
  '1, chopped white onion, small',
  '2 carrots, shredded',
  '1 yellow summer squash or zucchini',
  '2 tablespoons fresh oregano, chopped',
  '2 tablespoons garlic powder',
  '1 teaspoon sweet paprika',
  '1/2 teaspoon black pepper',
  '1/2 teaspoon cayenne pepper',
  '1 large head lettuce',
  '1/4 cup green onions, chopped'],
 'instructions': ['preheat oven to 400',
  'cut yams into fries. toss with sunflower oil, 1 teaspoon of salt, and five spice powder until well coated. spread the fries evenly on a baking pan. bake for 25 to 30 minutes, or until the fries start to brown and edges are becoming crispy. turn off the oven, leaving fries in there until ready to serve.',
  'heat coconut oil in a large skillet over medium-

In [85]:
# Now join the meals and meals_scraped lists and save to file
print(len(meals_scraped))
print(len(meals))

387
13


In [84]:
meals_combined = meals + meals_scraped
print(len(meals_combined))

400


In [88]:
# Save new meals list to file
f = open('meals.pkl', 'wb')
pickle.dump(meals_combined, f)
f.close()

In [90]:
# Confirm that it worked
f = open('meals.pkl', 'rb')
test = pickle.load(f)

print(type(test))
print(len(test))
display(test[28])

<class 'list'>
400


{'name': 'Detox Vegetable Soup',
 'ingredients': ['1 sweet onion, chopped',
  '1 carrot, chopped',
  '3 stalks celery, chopped',
  '5 cups chopped broccoli, florets and stalks',
  '7 cups water, divided',
  '1 teaspoon dried basil',
  '1 teaspoon sea salt',
  '1 cup raw cashews',
  '2 cups cooked green lentils',
  '2 cups packed baby spinach',
  'Olive oil, for drizzling (optional)',
  'Ground pepper, optional'],
 'instructions': ['Place the onion, carrot, celery and broccoli in a large pot. Add 6 cups water, basil and salt to the pot and stir. Bring to a boil over high heat then cover and reduce heat to a low simmer.',
  'Let simmer for 15 to 20 minutes or until broccoli is tender.',
  'Meanwhile, in a blender, create your cashew cream. Blend together the cashews and remaining 1 cup water. (If you adjust the serving size, just keep the cashew to water ratio 1:1.)',
  'Pour the cashew cream into the pot with the veggies and stir.',
  ' Add the green lentils and stir again.',
  'Add the