# Web Scraping Exercise

## 1. Introduction and Planning

### Objective:
The goal of this exercise is to build a web scraper that collects data from a chosen website. You will learn how to send HTTP requests, parse HTML content, extract relevant data, and store it in a structured format.

### Tasks:
1. Identify the data you want to scrape.
2. Choose the target website(s).
3. Plan the structure of your project.

### Example:
For this exercise, we will scrape job listings from Indeed.com. We will extract job titles, company names, locations, and job descriptions.

## 2. Understanding the Target Website
### Objective:

Analyze the structure of the web pages to be scraped.
### Tasks:

* Inspect the target website using browser developer tools.
* Identify the HTML elements that contain the desired data.

### Instructions:

* Open your browser and navigate to the target website (e.g., Indeed.com).
* Right-click on the webpage and select "Inspect" or press Ctrl+Shift+I.
* Use the developer tools to explore the HTML structure of the webpage.
* Identify the tags and classes of the elements that contain the job titles, company names, locations, and descriptions.

## 3. Writing the Scraper
### Objective:

Develop the code to scrape data from the target website.
### Tasks:

* Send HTTP requests to the target website.
* Parse the HTML content and extract the required data.
* Handle pagination to scrape data from multiple pages.
* Implement error handling.

In [4]:
from bs4 import BeautifulSoup

# Load the HTML file
with open("data/web/granola.html", "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Extracting the recipe title
title = soup.find("meta", {"property": "og:title"})["content"]

# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: Granola Cups
Description: These granola cups, filled with yogurt and fruit, are a great grab-and-go breakfast option.
Ingredients:
- 1/4 cup unsalted butter
- 1/2 cup pure maple syrup
- 1 teaspoon vanilla extract
- 1/2 teaspoon ground cinnamon
- 1/4 teaspoon salt, or to taste
- 1/4 teaspoon ground nutmeg
- 2 cups old fashioned oats
- 1/2 cup sliced almonds
- 1/2 cup sweetened flaked coconut
- 2 tablespoons flax meal
- nonstick baking spray with flour (such as Baker's Joy®)
- 1 cup plain whole milk Greek yogurt, or as needed
- 1 cup fresh berries, or as needed
Instructions:
1. Place butter and maple syrup in a large bowl and place into the microwave. Heat in 30 second intervals, stirring after every 30 seconds, until butter is melted, about 1 to 1 1/2 minutes. Remove bowl from microwave and add in vanilla, cinnamon, salt, and nutmeg. Stir until thoroughly combined. Add in oats, almonds, coconut, and flax meal. Stir until mixture is thoroughly combined.
2. Generously grease

In [5]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Fgranola-cups-recipe-8669258
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Fgranola-cups-recipe-8669258
/account/add-recipe
https://www.magazines.com/allrecipes-magazine.html?utm_source=allrecipes.com&utm_medium=owned&utm_campaign=i111arr1w2661
https://www.magazines.com/allrecipes-magazine.html
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/94/soups-stews-and-chili/
https://www.allrecipes.com/recipes/16099/everyday-cooking/comfort-food/
https://www.allrecipes.com/recipes/80/main-dish/
https://www

In [None]:
import requests

def fetch_recipe_content(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for HTTP errors
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Extract relevant information (title, ingredients, instructions, etc.)
        title = soup.find("meta", {"property": "og:title"})["content"] if soup.find("meta", {"property": "og:title"}) else "No title found"
        description = soup.find("meta", {"name": "description"})["content"] if soup.find("meta", {"name": "description"}) else "No description found"
        
        ingredients_section = soup.find_all("span", class_="recipe-ingredients__ingredient")
        ingredients = [ingredient.get_text(strip=True) for ingredient in ingredients_section] if ingredients_section else ["No ingredients found"]

        instructions_section = soup.find_all("li", class_="mntl-sc-block-group--LI mntl-sc-block mntl-sc-block-html")
        instructions = [instruction.get_text(strip=True) for instruction in instructions_section] if instructions_section else ["No instructions found"]
        
        # Print the extracted information for each recipe
        print("Recipe Title:", title)
        print("Description:", description)
        print("Ingredients:")
        for ingredient in ingredients:
            print("-", ingredient)
        print("Instructions:")
        for i, instruction in enumerate(instructions, 1):
            print(f"{i}. {instruction}")
        print("\n")
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch {url}: {e}")

# Loop through the list of recipe URLs and fetch their contents
for url in recipe_urls[:10]:
    fetch_recipe_content(url)