# Web Scraping Exercise

## 1. Introduction and Planning

### Objective:
The goal of this exercise is to build a web scraper that collects data from a chosen website. You will learn how to send HTTP requests, parse HTML content, extract relevant data, and store it in a structured format.

### Tasks:
1. Identify the data you want to scrape.
2. Choose the target website(s).
3. Plan the structure of your project.

### Example:
For this exercise, we will scrape job listings from Indeed.com. We will extract job titles, company names, locations, and job descriptions.

## 2. Understanding the Target Website
### Objective:

Analyze the structure of the web pages to be scraped.
### Tasks:

* Inspect the target website using browser developer tools.
* Identify the HTML elements that contain the desired data.

### Instructions:

* Open your browser and navigate to the target website (e.g., Indeed.com).
* Right-click on the webpage and select "Inspect" or press Ctrl+Shift+I.
* Use the developer tools to explore the HTML structure of the webpage.
* Identify the tags and classes of the elements that contain the job titles, company names, locations, and descriptions.

## 3. Writing the Scraper
### Objective:

Develop the code to scrape data from the target website.
### Tasks:

* Send HTTP requests to the target website.
* Parse the HTML content and extract the required data.
* Handle pagination to scrape data from multiple pages.
* Implement error handling.

In [None]:
import requests
from bs4 import BeautifulSoup

# Define the URL of the target website
url = 'https://www.indeed.com/jobs?q=data+scientist&l='

# Define a user-agent to make the request look like it's coming from a web browser
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }
# Send an HTTP request to the target URL
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract the job listings
    job_titles = soup.find_all('h2', class_='jobTitle')
    companies = soup.find_all('span', class_='companyName')
    locations = soup.find_all('div', class_='companyLocation')
    descriptions = soup.find_all('div', class_='job-snippet')
    
    # Print the extracted data
    for title, company, location, description in zip(job_titles, companies, locations, descriptions):
        print(f'Job Title: {title.get_text(strip=True)}')
        print(f'Company: {company.get_text(strip=True)}')
        print(f'Location: {location.get_text(strip=True)}')
        print(f'Description: {description.get_text(strip=True)}')
        print('-' * 40)
else:
    print(f'Failed to retrieve the webpage. Status code: {response.status_code}')
