# Data Collection <a id='data_collection'></a>

Collecting data on each recipe from foodnetwork.com turned out to be a logistical nightmare, considering the time constraints of this project. Additionally, we were unable to use scrapy to extract data from each recipe page (which we will address and explain later).

### A. gather all the recipe urls <a id='get_urls'></a>
The first thing we did to collect our data was gather the recipe urls for all 65,000+ recipes at foodnetwork.com. In our first attempt to collect a list of recipe urls we used scrapy from the terminal, attempting to crawl through foodnetwork.com's a-z directory of recipes (http://www.foodnetwork.com/recipes/a-z.html). Food Network detected the use of our spiders and would either suspend us or redirect us to null pages. Our second attempt and method was to use scrapy in jupyter notebook and do the following:
1. create a list of a-z index urls (e.g. http://www.foodnetwork.com/recipes/a-z.D.1.html)
2. iterate through that list with scrapy, determine the range of pages for each a-z index, then create each url, and finally appending it to a list of urls (e.g. http://www.foodnetwork.com/recipes/a-z.R.23.html)
3. lastly, iterate through that list with scrapy and extract every recipe url for each page in each a-z index

In [3]:
import libraries
import pandas as pd
import numpy as np
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import requests
import time

In [4]:
def get_az_urls(alphabet):
    
    # iterate through each letter in 'alphabet', create the a-z index urls
    for a in alphabet:
        url = "http://www.foodnetwork.com/recipes/a-z.%s.1.html" % a
        
        # extract the page_range for each a-z index url
        response = requests.get(url)
        HTML = response.text
        page_range = Selector(text=HTML).xpath("//div[@class='pagination']/ul/li/a/text()").extract()
        # append the page_range for each letter in alphabet to a list=page_ranges
        page_ranges.append(page_range)
        # sleep
        time.sleep(1)
    
    # zip each letter in 'alphabet' with its corresponding page_range in page_ranges
    for rng in page_ranges:
        letter_ranges.append(int(rng[-1]))
    letters_page_ranges = zip(alphabet, letter_ranges)
    
    # create a url for each a-z index, along with a page number, and append to a final url list=urls
    for element in letters_page_ranges:
        for i in range(element[1]):
            url_ = "http://www.foodnetwork.com/recipes/a-z.%s.%s.html" % (element[0], (i+1))
            urls.append(url_)
            

In [5]:
# create an a-z index for the argument given to get_urls
alphabet = ['123','A','B','C','D','E','F','G','H','I','J','K','L',
            'M','N','O','P','Q','R','S','T','V','W','XYZ']

# get_az_urls(alphabet)

In [6]:
def get_recipe_urls(urls):
    
    # initiate a blank list to collect recipe urls
    recipe_urls = []
    
    response = requests.get(urls)
    HTML = response.text
    
    # extract recipe names
    recipe_name = Selector(text=HTML).xpath("//li[@class='col18']/span[@class='arrow']/a/text()").extract()
    for recipe in recipe_name:
        recipe_names.append(recipe)
    
    # extract recipe urls
    recipe_url = Selector(text=HTML).xpath("//li[@class='col18']/span[@class='arrow']/a/@href").extract()
    for recipe_url_ in recipe_url:
        recipe_urls.append('http://www.foodnetwork.com'+recipe_url_)
    time.sleep(1)


### Running the Function

In [7]:
for url in urls:
    get_recipe_urls(url)

This strategy was successful, rendering a final list with all 65,000+ recipe urls. We now had a starting point for collecting our final data set. The next goal was to write a function that would go through that list and extract all the information from each recipe.  

### B. get the recipe data <a id='get_details'></a>
In our first attempt to extract all the information from each recipe page, we once again wrote a function utilizing scrapy in jupyter notebook. The results and data extracted however, were incomplete. After examining the source code on foodnetwork.com's recipe pages, we determined that there was a significant amount of dynamic content on those pages operating with javascript, including some of the fields we were trying to extract. As a result, our function would move on to the next page before data we were trying to collect had fully rendered. Even after building delays into our function, the results were inconsistent and incomplete. 

To overcome this extraction barrier, we used the selenium webdriver. This program opened each web page and allowed it to fully render before extracting the information we needed. While it proved to be effective and successfully extracted all the fields we desired for our dataset, it took about 8 seconds for each page to fully render. With over 65,000 recipes to collect, at 8 seconds per recipe, and 3600 seconds per hour, that is over 144 hours--just over 6 days if running continuously. This is the "logistical nightmare" we opened this section with. In order to collect the data, we had to run 2-3 of our crawlers nearly continuously in batches, clearing our kernel cache for every 2-3 thousand recipes so the computer would not crash. The whole process took over a week, but the final result was a [63868 x 15] dataframe full of glorious recipe data.

In [11]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chromedriver = '/Users/kylemulle/Downloads/chromedriver' # change path as needed

chop = webdriver.ChromeOptions()
chop.add_extension('/Users/kylemulle/Downloads/AdBlock_v3.1.1.crx') # change path as needed

browser = webdriver.Chrome(executable_path = chromedriver, chrome_options=chop)

In [8]:
def get_details(url):
                
    recipe = []
    recipe_details = []
    
    try: 
        browser.get(url)
        HTML = browser.page_source
        
        # url
        recipe.append(url)

        # recipe name
        name = Selector(text=HTML).xpath("//div[@class='tier-3 title']/h1/text()").extract()
        recipe.append(name)

        # ingredients
        ingredients = Selector(text=HTML).xpath("//ul/li/div[@class='box-block']/text()").extract()
        recipe.append(ingredients)

        # yield
        yield_ = Selector(text=HTML).xpath("//div[@class='difficulty']/dl[1]/dd/text()").extract()
        recipe.append(yield_)

        # difficulty
        difficulty = Selector(text=HTML).xpath("//div[@class='difficulty']/dl[2]/dd/text()").extract()
        recipe.append(difficulty)

        # total time
        time_total = Selector(text=HTML).xpath("//div[@class='cooking-times']/dl/dd[@class='total']/text()").extract()
        recipe.append(time_total)
        
        # preparation time
        time_prep = Selector(text=HTML).xpath("//div[@class='cooking-times']/dl/dd[2]/text()").extract()
        recipe.append(time_prep)
        
        # cook time/inactive time 
        time_cook_3 = Selector(text=HTML).xpath("//div[@class='cooking-times']/dl/dd[3]").extract()
        time_cook_4 = Selector(text=HTML).xpath("//div[@class='cooking-times']/dl/dd[4]").extract()
        # w/ inactive time
        if time_cook_4 != []:
            recipe.append(time_cook_3) # inactive time
            recipe.append(time_cook_4) # cook time
        # NO inactive time
        else:
            recipe.append(time_cook_4) # null result
            recipe.append(time_cook_3) # cook time
        
        # categories
        categories = Selector(text=HTML).xpath("//ul[@class='categories']/li/a/text()").extract()
        recipe.append(categories)

        # rating
        rating = Selector(text=HTML).xpath("//a[contains(@class, 'community-rating-stars')]//@title").extract()
        recipe.append(rating)

        # ratings
        ratings = Selector(text=HTML).xpath("//div[@class='col18']/section[@class='review-rating section']/a[@class='community-rating-stars']/div[@class='gig-rating-ratingsum']/text()").extract()
        recipe.append(ratings)

        # directions
        directions = Selector(text=HTML).xpath("//ul[@class='recipe-directions-list']/li/p").extract()
        recipe.append(directions)

        # chef
        chef = Selector(text=HTML).xpath("//div[@class='col10 directions']/p[@class='copyright']/text()").extract()
        recipe.append(chef)
        
        # photo
        image = Selector(text=HTML).xpath("//section[@class='single-photo-recipe']/a[@class='ico-wrap']/img/@src").extract()
        image_ = Selector(text=HTML).xpath("//div[@class='ico-wrap']/img/@src").extract() 
        if image != []:
            recipe.append(image)
        else:
            recipe.append(image_)

        recipe_details.append(recipe)
        
    except:
        print 'FAIL:  %s -' % (datetime.strftime(datetime.now(), '%H:%M:%S')), url
        pass


In [9]:
a printable variable for keeping track of progress    
indx = 0

# first 1000 recipes in urls
for recipe in urls[0:1000]:
    indx += 1
    print '%s: %s -' % (indx, datetime.strftime(datetime.now(), '%H:%M:%S')), recipe
    get_details(recipe)
    # sleep for a random fraction of a second
    time.sleep(np.random.rand())


In [10]:
recipe_dict = {}

recipe_dict['url']            = [row[0] for row_no, row in enumerate(recipe_details)]
recipe_dict['name']           = [row[1] for row_no, row in enumerate(recipe_details)]
recipe_dict['ingredients']    = [row[2] for row_no, row in enumerate(recipe_details)]
recipe_dict['yield']          = [row[3] for row_no, row in enumerate(recipe_details)]
recipe_dict['difficulty']     = [row[4] for row_no, row in enumerate(recipe_details)]
recipe_dict['time_total']     = [row[5] for row_no, row in enumerate(recipe_details)]
recipe_dict['time_prep']      = [row[6] for row_no, row in enumerate(recipe_details)]
recipe_dict['time_inactive']  = [row[7] for row_no, row in enumerate(recipe_details)]
recipe_dict['time_cook']      = [row[8] for row_no, row in enumerate(recipe_details)]
recipe_dict['categories']     = [row[9] for row_no, row in enumerate(recipe_details)]
recipe_dict['rating']         = [row[10] for row_no, row in enumerate(recipe_details)]
recipe_dict['ratings']        = [row[11] for row_no, row in enumerate(recipe_details)]
recipe_dict['directions']     = [row[12] for row_no, row in enumerate(recipe_details)]
recipe_dict['chef']           = [row[13] for row_no, row in enumerate(recipe_details)]
recipe_dict['photo']          = [row[14] for row_no, row in enumerate(recipe_details)]

recipe_df = pd.DataFrame(recipe_dict)

The only thing left to do at this point was convert appropriate index for each row in the recipe_details list to dictionary key:value pairs, and convert the dictionary into a dataframe.

In [1]:
recipe_df = pd.DataFrame(recipe_dict)

recipe_df.to_csv('../data_raw/recipes_0-1000.csv')