# Capstone Project: Predicting New York Times Cooking Recipe Popularity
### Notebook 01 - Data Scraping and Preliminary Cleaning


_Author: Joe Serigano (jserigano4@gmail.com)_

---

**Objectives:**
- Gather and prepare URLs of [NYT Cooking](https://cooking.nytimes.com/) recipes using an [XML Sitemap creator](https://www.xml-sitemaps.com/).
- Scrape recipe data from 10,000 posts using web scraping tool [recipe_scrapers](https://github.com/hhursev/recipe-scrapers) found on Github.
- Format recipe data into a DataFrame and save for EDA and preprocessing analysis. 

The main question we are trying to answer is: **What characteristics of a recipe are most likely to increase the popularity (and overall site traffic), and what changes can be made to increase recipe popularity before posting?**

In [40]:
from recipe_scrapers import scrape_me
import time
import bs4 as bs
import requests
import pandas as pd
import numpy as np
import re

# We are dealing with large data sets, so setting max number of column and row displays to be unlimited
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

We'll be using a web scraper, defined in the function below, to pull the appropriate recipe data from NYT Cooking.

In [41]:
def recipe_scrape(url):
    '''
    Scrape content from NYT Cooking recipe posts using schema.org.
    **************
    Input:
    url: url link to NYT Cooking recipe
    **************
    Output:
    scraper: Class of scraped data from recipe to be input to DataFrame
    **************
    '''
    scraper = scrape_me(url)
    return scraper

The URL list includes 10,000 posts, but not all are recipes. In order to fix this, we'll filter out the non-recipe links that don't include the appropriate string ('/recipes/') within the url.

In [42]:
# Reading in recipe urls
recipes = pd.read_csv('data/recipe_url.csv')
print(recipes.shape)
recipes = recipes[['recipe_url']]

# Filtering out recipe urls that are not actual recipe posts.
recipes = recipes[recipes['recipe_url'].str.contains('/recipes/')]
recipes.reset_index(inplace = True, drop = True)
print(recipes.shape)
recipes.head()

(10000, 6)
(8276, 1)


Unnamed: 0,recipe_url
0,https://cooking.nytimes.com/recipes/1023386-su...
1,https://cooking.nytimes.com/recipes/1023385-sl...
2,https://cooking.nytimes.com/recipes/1023384-to...
3,https://cooking.nytimes.com/recipes/1019764-po...
4,https://cooking.nytimes.com/recipes/1023380-ba...


In total 1,724 of the 10,000 NYT Cooking posts that were pulled were not recipes, so we will be continuing with 8,276 recipes.

Next, we will scrape the appropriate data from each recipe link:

In [43]:
start = time.time()
recipes['scraped'] = recipes['recipe_url'].apply(recipe_scrape)
end = time.time()
print("The time of execution of above program is :", end-start)

The time of execution of above program is : 4996.33850979805


The DataFrame now consists of the recipe URLs and a class of the scraped data from each link. Next, we'll create a for loop to extract the desired content from each recipe.

In [226]:
recipes.head()

Unnamed: 0,recipe_url,scraped
0,https://cooking.nytimes.com/recipes/1023386-su...,<recipe_scrapers.nytimes.NYTimes object at 0x7...
1,https://cooking.nytimes.com/recipes/1023385-sl...,<recipe_scrapers.nytimes.NYTimes object at 0x7...
2,https://cooking.nytimes.com/recipes/1023384-to...,<recipe_scrapers.nytimes.NYTimes object at 0x7...
3,https://cooking.nytimes.com/recipes/1019764-po...,<recipe_scrapers.nytimes.NYTimes object at 0x7...
4,https://cooking.nytimes.com/recipes/1023380-ba...,<recipe_scrapers.nytimes.NYTimes object at 0x7...


In [227]:
recipes.reset_index(inplace = True, drop = True)

Features from each recipe include:
- Title of recipe
- Author of recipe
- Date of posting
- Recipe description
- Total time for recipe
- Recipe yield
- Number of steps
- Text for all steps
- Number of ingredients
- Text for all ingredients
- Number of ratings for recipe
- Recipe tags
- Cuisine type
- Recipe categories
- Recipe ratings

The number of ratings will ultimately be our target variable. The other features will be used for EDA and modelling purposes.

In [228]:
start = time.time()

for i in range(len(recipes)):
    dict_ = recipes.loc[i, 'scraped'].__dict__['page_data'].decode("utf8")
    dict_ = str(dict_)
    
    try:
        recipes.loc[i, 'title'] = recipes.loc[i, 'scraped'].title()
    except:
        recipes.loc[i, 'title'] = np.nan
        
    try:
        recipes.loc[i, 'author'] = recipes.loc[i, 'scraped'].author()
    except:
        recipes.loc[i, 'author'] = np.nan

    date = re.search(r'(?<=https://static01.nyt.com/images/).[^a-z]*', dict_)
    try:
        recipes.loc[i, 'date'] = date.group(0)[:-1]
    except:
    #Getting rid of incorrect dates with 3019 instead of 2019
        recipes.loc[i, 'date'] = np.nan
    if int(recipes.loc[i, 'date'][0]) > 2:
        recipes.loc[i, 'date'] = np.nan
        
    description = re.search(r'(?<="description":").[^@]*', dict_)
    try:
        recipes.loc[i, 'description'] = description.group(0)[:-14]
    except:
        recipes.loc[i, 'description'] = np.nan

    try:
        recipes.loc[i, 'total_time'] = recipes.loc[i, 'scraped'].total_time()
    except:
        recipes.loc[i, 'total_time'] = np.nan

    try:
        recipes.loc[i, 'yields'] = recipes.loc[i, 'scraped'].yields()
    except:
        recipes.loc[i, 'yields'] = np.nan

    try:
        recipes.loc[i, 'n_steps'] = len(recipes.loc[i, 'scraped'].instructions_list())
    except:
        recipes.loc[i, 'n_steps'] = np.nan

    try:
        recipes.loc[i, 'steps'] = ''.join(recipes.loc[i, 'scraped'].instructions_list())
    except:
        recipes.loc[i, 'steps'] = np.nan

    try:
        recipes.loc[i, 'n_ingredients'] = len(recipes.loc[i, 'scraped'].ingredients())
    except:
        recipes.loc[i, 'n_ingredients'] = np.nan

    try:
        recipes.loc[i, 'ingredients'] = ''.join(recipes.loc[i, 'scraped'].ingredients())
    except:
        recipes.loc[i, 'ingredients'] = np.nan

    rating_count = re.search(r'(?<=ratingCount":).[^}]*', dict_)
    try:
        recipes.loc[i, 'rating_count'] = rating_count.group(0)
    except:
        recipes.loc[i, 'rating_count'] = np.nan
        
    tags = re.search(r'(?<=keywords":").[^"]*', dict_)
    try:
        recipes.loc[i, 'tags'] = tags.group(0)
    except:
        recipes.loc[i, 'tags'] = np.nan
    
    cuisine = re.search(r'(?<=recipeCuisine":").[^"]*', dict_)
    try:
        recipes.loc[i, 'cuisine'] = cuisine.group(0)
        if recipes.loc[i, 'cuisine'] == '",':
            recipes.loc[i, 'cuisine'] = 'NA'
    except:
        recipes.loc[i, 'cuisine'] = np.nan
        
    category = re.search(r'(?<=recipeCategory":").[^"]*', dict_)
    try:
        recipes.loc[i, 'category'] = category.group(0)
    except:
        recipes.loc[i, 'category'] = np.nan
        
    try:
        recipes.loc[i, 'rating'] = recipes.loc[i, 'scraped'].ratings()
    except:
        recipes.loc[i, 'rating'] = np.nan

    
end = time.time()
print("The time of execution of above program is :", end-start)

The time of execution of above program is : 30.727094173431396


In [229]:
recipes.head(10)

Unnamed: 0,recipe_url,scraped,title,author,date,description,total_time,yields,n_steps,steps,n_ingredients,ingredients,rating_count,tags,cuisine,category,rating
0,https://cooking.nytimes.com/recipes/1023386-su...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Summer Fruit Compote,David Tanis,2022/08/13,"In another era, this kind of chopped fruit sal...",5.0,6 servings,2.0,"In a large bowl, combine melon, apricots, nect...",9.0,"2 cups melon, such as cantaloupe or honeydew, ...",38,"apricot, melon, nectarine",,"parfaits and trifles, dessert",4.0
1,https://cooking.nytimes.com/recipes/1023385-sl...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Slow-Cooked Lamb Shoulder With Green Beans,David Tanis,2022/08/13,There are many ways to achieve a succulent bra...,240.0,6 servings,6.0,Prepare a covered gas or charcoal grill for me...,13.0,1 (3-pound) boneless lamb shoulder roastSalt a...,15,"green beans, herbs, lamb shoulder",,"dinner, meat, roasts, main course",3.0
2,https://cooking.nytimes.com/recipes/1023384-to...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Tomato Salad With Smoky Eggplant Flatbread,David Tanis,2022/08/13,Buy lavash or pita at a local Middle Eastern m...,40.0,6 servings,6.0,"Set the whole, unpeeled eggplant directly over...",19.0,1 large eggplant (about 1 pound)4 tablespoons ...,8,"eggplant, lavash, tomato, summer, vegetarian",,"lunch, salads and dressings, vegetables, appet...",4.0
3,https://cooking.nytimes.com/recipes/1019764-po...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Pork Meatballs With Ginger and Fish Sauce,Kay Chun,2018/11/16,These nuoc cham-inspired meatballs are perfect...,20.0,4 servings,3.0,"Heat oven to 425 degrees. In a large bowl, com...",7.0,2 tablespoons peeled and minced ginger1 tables...,3271,"fish sauce, ginger, ground pork, ritz crackers...",vietnamese,"dinner, lunch, weekday, weeknight, meatballs, ...",5.0
4,https://cooking.nytimes.com/recipes/1023380-ba...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Basil and Tomato Fried Rice,Hetty McKinnon,2022/08/01,Summer’s dynamic duo of tomato and basil make ...,15.0,4 servings,5.0,"In a bowl, whisk the eggs with 1/2 teaspoon sa...",10.0,4 eggsKosher salt (such as Diamond Crystal) an...,136,"basil, egg, rice, tomatoes, summer, vegetarian",,"dinner, easy, lunch, quick, weekday, grains an...",4.0
5,https://cooking.nytimes.com/recipes/1023369-sn...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Snapper Escovitch,Millie Peartree,2022/07/27,"A Caribbean favorite, this light, tender and f...",25.0,4 servings,5.0,"Mix together 1 teaspoon thyme leaves, garlic p...",18.0,2 teaspoons fresh thyme leaves1 teaspoon garli...,26,"bell pepper, fish, dairy-free",,"dinner, weekday, weeknight, seafood, main course",4.0
6,https://cooking.nytimes.com/recipes/1017463-su...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,"Summer Pasta With Zucchini, Ricotta and Basil",David Tanis,2021/06/22,"A summer pasta should be simple and fresh, ide...",30.0,6 servings,4.0,Put a pot of water on to boil. In a large skil...,11.0,"Extra-virgin olive oil1 small onion, finely di...",6831,"basil, parmesan, ricotta, ziti, zucchini, summ...",,"dinner, lunch, quick, weeknight, pastas, main ...",5.0
7,https://cooking.nytimes.com/recipes/1022662-fr...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Fried Oysters With Tartar Sauce,Gabrielle Hamilton,2021/10/24,Getting fried oysters from your summer seafood...,45.0,24 servings,5.0,Prepare the oysters: Season flour with salt an...,15.0,1/2 cup all-purpose flourKosher salt and finel...,228,"cornichons, mayonnaise, oyster",,appetizer,4.0
8,https://cooking.nytimes.com/recipes/1023204-fi...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Fish Sticks With Peas,Naz Deravian,2022/05/17,Two childhood classics — fish sticks and green...,35.0,6 servings,5.0,Prepare your dredging station and make the fis...,17.0,1/4 cup all-purpose flour1/2 teaspoon ground t...,176,"bread crumbs, fish, peas, spring",,"dinner, finger foods, seafood, main course",4.0
9,https://cooking.nytimes.com/recipes/1022442-bl...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Blackberry Crisp With Cardamom Custard Sauce,David Tanis,2021/08/04,You could use a combination of berries (raspbe...,80.0,6 servings,5.0,"Make the topping: Put flour, sugar, butter and...",10.0,1 cup/128 grams all-purpose flour1/2 cup/100 g...,554,"blackberries, cardamom, half and half, summer",,"custards and puddings, dessert",4.0


Removing 647 recipes with null values:

In [230]:
recipes.isnull().sum()

recipe_url       0
scraped          0
title            0
author           0
date             0
description      0
total_time       0
yields           0
n_steps          0
steps            0
n_ingredients    0
ingredients      0
rating_count     0
tags             0
cuisine          0
category         0
rating           0
dtype: int64

In [201]:
recipes.dropna(inplace = True)
recipes.shape

(7629, 16)

Converting date to a datetime and creating word count columns.

In [241]:
recipes['date'] = pd.to_datetime(recipes['date'],infer_datetime_format=True)
recipes['title_wordcount'] = [len(x.split()) for x in recipes['title'].tolist()]
recipes['description_wordcount'] = [len(x.split()) for x in recipes['description'].tolist()]
recipes['steps_wordcount'] = [len(x.split()) for x in recipes['steps'].tolist()]

In [242]:
recipes.head()

Unnamed: 0,recipe_url,scraped,title,author,date,description,total_time,yields,n_steps,steps,n_ingredients,ingredients,rating_count,tags,cuisine,category,rating,description_wordcount,steps_wordcount,title_wordcount
0,https://cooking.nytimes.com/recipes/1023386-su...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Summer Fruit Compote,David Tanis,2022-08-13,"In another era, this kind of chopped fruit sal...",5.0,6 servings,2.0,"In a large bowl, combine melon, apricots, nect...",9.0,"2 cups melon, such as cantaloupe or honeydew, ...",38,"apricot, melon, nectarine",,"parfaits and trifles, dessert",4.0,67,51,3
1,https://cooking.nytimes.com/recipes/1023385-sl...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Slow-Cooked Lamb Shoulder With Green Beans,David Tanis,2022-08-13,There are many ways to achieve a succulent bra...,240.0,6 servings,6.0,Prepare a covered gas or charcoal grill for me...,13.0,1 (3-pound) boneless lamb shoulder roastSalt a...,15,"green beans, herbs, lamb shoulder",,"dinner, meat, roasts, main course",3.0,96,290,6
2,https://cooking.nytimes.com/recipes/1023384-to...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Tomato Salad With Smoky Eggplant Flatbread,David Tanis,2022-08-13,Buy lavash or pita at a local Middle Eastern m...,40.0,6 servings,6.0,"Set the whole, unpeeled eggplant directly over...",19.0,1 large eggplant (about 1 pound)4 tablespoons ...,8,"eggplant, lavash, tomato, summer, vegetarian",,"lunch, salads and dressings, vegetables, appet...",4.0,94,256,6
3,https://cooking.nytimes.com/recipes/1019764-po...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Pork Meatballs With Ginger and Fish Sauce,Kay Chun,2018-11-16,These nuoc cham-inspired meatballs are perfect...,20.0,4 servings,3.0,"Heat oven to 425 degrees. In a large bowl, com...",7.0,2 tablespoons peeled and minced ginger1 tables...,3271,"fish sauce, ginger, ground pork, ritz crackers...",vietnamese,"dinner, lunch, weekday, weeknight, meatballs, ...",5.0,119,49,7
4,https://cooking.nytimes.com/recipes/1023380-ba...,<recipe_scrapers.nytimes.NYTimes object at 0x7...,Basil and Tomato Fried Rice,Hetty McKinnon,2022-08-01,Summer’s dynamic duo of tomato and basil make ...,15.0,4 servings,5.0,"In a bowl, whisk the eggs with 1/2 teaspoon sa...",10.0,4 eggsKosher salt (such as Diamond Crystal) an...,136,"basil, egg, rice, tomatoes, summer, vegetarian",,"dinner, easy, lunch, quick, weekday, grains an...",4.0,99,234,5


Finally, saving the final DataFrame for further cleaning, preprocessing, and modelling.

In [243]:
recipes.to_csv('data/recipes_cleaned_data.csv')