# Web Scraping Exercise

## 1. Introduction and Planning

### Objective:
The goal of this exercise is to build a web scraper that collects data from a chosen website. You will learn how to send HTTP requests, parse HTML content, extract relevant data, and store it in a structured format.

### Tasks:
1. Identify the data you want to scrape.
2. Choose the target website(s).
3. Plan the structure of your project.

### Example:
For this exercise, we will scrape job listings from Indeed.com. We will extract job titles, company names, locations, and job descriptions.

## 2. Understanding the Target Website
### Objective:

Analyze the structure of the web pages to be scraped.
### Tasks:

* Inspect the target website using browser developer tools.
* Identify the HTML elements that contain the desired data.

### Instructions:

* Open your browser and navigate to the target website (e.g., Indeed.com).
* Right-click on the webpage and select "Inspect" or press Ctrl+Shift+I.
* Use the developer tools to explore the HTML structure of the webpage.
* Identify the tags and classes of the elements that contain the job titles, company names, locations, and descriptions.

## 3. Writing the Scraper
### Objective:

Develop the code to scrape data from the target website.
### Tasks:

* Send HTTP requests to the target website.
* Parse the HTML content and extract the required data.
* Handle pagination to scrape data from multiple pages.
* Implement error handling.

##### Library imports

In [1]:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm
import pandas as pd

#### Reads the content of a local HTML file and processes it with BeautifulSoup.

In [3]:
headers = {
    "User-Agent": "My Python App"
}

with open(r"C:\Users\VELA\Desktop\Recuperacion\Information-Retrieval\webScraping\data\Allrecipes_RecipesA-Z.html", "r", encoding="utf-8") as file: 
    html_content = file.read()

soup = BeautifulSoup(html_content, "html.parser")

##### Retrieves the HTML content from a list of URLs, parses it with BeautifulSoup, and handles request errors and timeouts

In [4]:
def fetch_html_content(urls, headers, timeout=10):
    html_contents = []
    for url in tqdm(urls, "Progress"):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            html_text = response.text
            html_contents.append(BeautifulSoup(html_text, "html.parser"))
        except requests.Timeout:
            print(f"Timeout occurred for URL: {url}")
        except requests.RequestException as e:
            print(f"Request failed for URL: {url} with exception: {e}")
    return html_contents
 

##### Extracts links from a list of HTML elements and returns a list of URLs

In [5]:
def extract_links(url_elements):
    all_links = []
    for link_item in url_elements.find_all("li", class_="comp mntl-link-list__item"):
        anchor_tag = link_item.find("a", href=True)
        if anchor_tag:
            all_links.append(anchor_tag['href'])
    return all_links


##### Extracts links from an HTML page based on specific attributes and returns a list of URLs

In [6]:
def get_links_from_page(page_elements):
    all_links = []
    for link_item in page_elements.find_all("a", class_="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image"):
        if link_item.has_attr('href'):
            all_links.append(link_item['href'])
    return all_links

##### Extracts titles from a list of HTML elements and returns a list of the titles

In [7]:
def get_titles(urls):
    titles = []
    for url in urls:
        titles.append(url.find("title").text)
    return titles

##### Extracts descriptions from a list of HTML elements and returns a list of descriptions, appending np.nan if no description is found

In [8]:
def get_recipe_descriptions(recipe_elements):
    descriptions = []
    for element in recipe_elements:
        description = element.find("p", class_="article-subheading type--dog")
        if description:
            descriptions.append(description.text)
        else:
            descriptions.append(np.nan)
    return descriptions

##### Extracts ingredients from a list of HTML elements, concatenating each ingredient into a text separated by line breaks

In [9]:
def get_recipe_ingredients(ingredient_elements):
    ingredients = []
    for element in ingredient_elements:
        ingredient_text = ''
        for item in element.find_all("li", class_="mm-recipes-structured-ingredients__list-item"):
            ingredient_text += item.text.strip() + '\n'
        ingredients.append(ingredient_text)
    return ingredients

##### Extracts steps from a list of HTML elements, removing certain unwanted tags, and returns a list of steps as texts separated by line breaks

In [10]:
def get_recipe_steps(step_elements):
    steps = []
    for element in step_elements:
        step_text = ''
        for step_item in element.find_all("li", class_="comp mntl-sc-block mntl-sc-block-startgroup mntl-sc-block-group--LI"):
            for tag in step_item.find_all(["figure", "div"]):
                tag.decompose()
            step_text += step_item.text.strip() + '\n'
        steps.append(step_text)
    return steps


##### Extracts a list of links from the BeautifulSoup object using the extract_links function.

In [11]:
links_html = extract_links(soup)

##### Retrieves the HTML content from a list of links using the fetch_html_content function

In [12]:
recipe_links = fetch_html_content(links_html, headers)

Progress: 100%|██████████| 378/378 [03:07<00:00,  2.02it/s]


##### Retrieves the links from each recipe page from a list of recipe links using the get_links_from_page function

In [13]:
links_per_recipe_page = [get_links_from_page(link) for link in recipe_links]

##### Combines all links from recipe pages into a single list

In [14]:
all_links = [] 
for links_in_page in links_per_recipe_page:
    for link in links_in_page:
        all_links.append(link)

##### Retrieves the HTML content from a list of links using the fetch_html_content function

In [16]:
recipe_links = fetch_html_content(all_links, headers)

Progress:  31%|███▏      | 5705/18122 [2:14:07<1:36:42,  2.14it/s]    

Extracts titles, descriptions, ingredients and steps from a list of recipe links and creates a pandas DataFrame with a column named 'Title'

In [None]:
titles = get_titles(recipe_links)
recipe = pd.DataFrame(titles, columns=['Title'])

descriptions = get_recipe_descriptions(recipe_links)
recipe['Description'] = descriptions

ingredients = get_recipe_ingredients(recipe_links)
recipe['Ingredients'] = ingredients

steps = get_recipe_steps(recipe_links)
recipe['Steps'] = steps

recipe

#### Multiprocessing application

![multiprocesamiento](multiprocesamiento_webScraping.jpg)

In [3]:
df = pd.read_csv(r"C:\Users\VELA\Desktop\Recuperacion\Information-Retrieval\AllRecipes.csv")
df

Unnamed: 0.1,Unnamed: 0,Title,Description,Ingredients,Steps
0,0,Air Fryer Buffalo Wings Recipe,These crispy air fryer Buffalo wings are seaso...,2 teaspoons sea salt\n1 teaspoon garlic powder...,Preheat an air fryer to 380 degrees F (190 deg...
1,1,Air Fryer Smashed Potatoes Recipe,"These golden, crispy air fryer smashed potatoe...",8 ounces baby gold potatoes\n1 tablespoon melt...,Preheat an air fryer to 400 degrees F (200 deg...
2,2,Air Fryer Quesadillas Recipe,These air fryer quesadillas are golden and cri...,2 flour tortillas\n1/2 cup shredded cheese\nno...,"Heat tortillas in the microwave until pliable,..."
3,3,Air Fryer Truffle Polenta Fries Recipe,"These air fryer truffle polenta fries, flavore...",1 (18 ounce) tube prepared polenta\n1 1/2 tabl...,Preheat an air fryer to 400 degrees F (200 deg...
4,4,Air Fryer Firecracker Salmon Bites Recipe,These air fryer firecracker salmon bites get a...,1/4 cup balsamic vinegar\n1/4 cup brown sugar\...,"Combine balsamic vinegar, brown sugar, oil, so..."
...,...,...,...,...,...
18117,18117,Vegan Zucchini Banana Bread Recipe,"This yummy, moist, rich zucchini banana bread ...",3 cups all-purpose flour\n1 teaspoon salt\n1 t...,Preheat the oven to 325 degrees F (165 degrees...
18118,18118,Zucchini-Raspberry Bread Recipe,It's a simple zucchini nut bread with a splash...,1 ½ cups self-rising flour\n1 teaspoon ground ...,Preheat an oven to 350 degrees F (175 degrees ...
18119,18119,Healthier Mom's Zucchini Bread Recipe,We packed even more zucchini into Mom's wonder...,1 ½ cups all-purpose flour\n1 ½ cups white who...,Preheat oven to 325 degrees F (165 degrees C)....
18120,18120,"Zucchini Bread, Pumpkin Style Recipe","Although I love zucchini bread, I wanted a new...","3 medium zucchini, cut into chunks\n4 ¾ cups a...",Preheat an oven to 350 degrees F (175 degrees ...
