# Project #2, Part 1: Questions and Dataset Selection  
**Author**: Maricarl Sibal  
**Date**: April 2, 2025  

This notebook outlines the planning and dataset selection process for **Project 2**. It includes:

- A curated set of international recipes from AllRecipes.com  
- A list of data-driven questions to guide the analysis  
- A structured approach for web scraping using JSON-LD metadata  

The objective is to establish a clear and ethical foundation for data extraction and exploration in the next phases of the project.

## Source Website
The dataset for this project will be sourced from **[AllRecipes.com](https://www.allrecipes.com)**, a widely used online platform offering detailed content on a broad range of international dishes.  

Each recipe page includes structured metadata, which makes the site well-suited for programmatic data extraction.

In [4]:
# Source website for recipe data
website = "https://www.allrecipes.com"

## Selected Recipe URLs
The following five recipe pages were selected to represent a diverse set of international cuisines. These pages are content-rich and contain structured metadata embedded in JSON-LD format.

In [6]:
# Dictionary of international recipe URLs selected for scraping
urls = {
    "Traditional Filipino Lumpia": "https://www.allrecipes.com/recipe/35151/traditional-filipino-lumpia/",
    "Beef Bulgogi": "https://www.allrecipes.com/recipe/100606/beef-bulgogi/",
    "Miso Soup": "https://www.allrecipes.com/recipe/13107/miso-soup/",
    "Bánh Mì": "https://www.allrecipes.com/recipe/187342/banh-mi/",
    "Spanish Paella": "https://www.allrecipes.com/recipe/12728/paella-i/"
}

# Display selected URLs
urls

{'Traditional Filipino Lumpia': 'https://www.allrecipes.com/recipe/35151/traditional-filipino-lumpia/',
 'Beef Bulgogi': 'https://www.allrecipes.com/recipe/100606/beef-bulgogi/',
 'Miso Soup': 'https://www.allrecipes.com/recipe/13107/miso-soup/',
 'Bánh Mì': 'https://www.allrecipes.com/recipe/187342/banh-mi/',
 'Spanish Paella': 'https://www.allrecipes.com/recipe/12728/paella-i/'}

## Proposed Data Science Questions
The following data-driven questions are designed to support meaningful and comparative analysis across international recipes. They focus on ingredient complexity, preparation time, user engagement, nutritional content, and popularity.

What is the average number of ingredients used in these recipes?
- Helps assess the relative complexity of each dish.

How does the total preparation time (prep + cook) vary across different cuisines?
- Useful for comparing effort and cooking time across cultures.

What is the relationship between user rating and total preparation time?
- Investigates whether quick or slow recipes are rated more favorably.

How do the nutritional profiles (e.g., calories, fat, protein) differ among these dishes?
- Enables a comparison of health-related aspects of different cuisines.

Which recipes appear to be more popular based on the number of user reviews?
- Popularity can be an indirect indicator of accessibility and appeal.

In [8]:
# List of questions for analysis
questions = [
    "1. What is the average number of ingredients used in these recipes?",
    "2. How does the total preparation time (including both prep and cook time) vary across different cuisines?",
    "3. What is the relationship between user rating and total preparation time?",
    "4. How do the nutritional profiles (e.g., calories, fat, protein) differ among these dishes?",
    "5. Which recipes appear to be more popular based on the number of user reviews?"
]

# Display the questions
for q in questions:
    print(q)

1. What is the average number of ingredients used in these recipes?
2. How does the total preparation time (including both prep and cook time) vary across different cuisines?
3. What is the relationship between user rating and total preparation time?
4. How do the nutritional profiles (e.g., calories, fat, protein) differ among these dishes?
5. Which recipes appear to be more popular based on the number of user reviews?


## Data Scraping Plan
AllRecipes.com embeds recipe metadata using the `<script type="application/ld+json">` tag in JSON-LD format. This structured data format provides consistent access to relevant recipe attributes.

The scraping plan will focus on extracting the following fields:
- Ingredients: Accessible via the "recipeIngredient" array
- Preparation and Cook Times: Found in "prepTime" and "cookTime" (in ISO 8601 duration format)
- User Ratings and Review Counts: Stored under the "aggregateRating" object
- Nutritional Information: Contained in the "nutrition" object (e.g., "calories", "fatContent", "proteinContent", etc.)

The URL format for recipe pages follows a consistent pattern:
`https://www.allrecipes.com/recipe/<id>/<recipe-name>/`

This predictable structure simplifies scraping and supports easy expansion of the dataset.

## Ethical Considerations: Robots.txt Review
Before starting web scraping, it is essential to ensure compliance with the website's terms of use.

A review of AllRecipes.com’s robots.txt file shows:
- No Disallow rules for the /recipe/ directory

In [11]:
# URL of robots.txt file
robots_txt_url = "https://www.allrecipes.com/robots.txt"

**Conclusion:** Scraping the selected recipe pages is permitted under AllRecipes.com's current policy.