# Web Crawling with Python â€” Project
## Crawling and Analyzing Job Postings

---

### Context

Job listings are a valuable source of data for understanding labor market trends, popular skills, and opportunities by location.

**Ethics Notice:**  
Never scrape real sites without permission or violating their terms of service. For this project, use a public demo site (e.g. https://realpython.github.io/fake-jobs/), a local HTML file, or a permitted sandbox.


---

## Task 1: Planning

Visit https://realpython.github.io/fake-jobs/ and inspect several job postings.  
Sketch a plan for the main steps of your crawler:

- Download HTML
- Parse the HTML
- Extract the relevant fields (title, company, location, description, link)
- Store data in a structured format
- Analyze and visualize the data

_List your plan or make a diagram below:_


### TODO: List your plan here or draw a diagram (you can use Markdown cells for diagrams!)

# Import the libraries and dependencies
import requests
from bs4 import BeautifulSoup


---

## Task 2: Download and Parse the Webpage


In [None]:
def get_soup(url):
    """
    Download the page content from `url` and return a BeautifulSoup object.
    HINT: Use requests and BeautifulSoup.
    """
    # TODO: Implement this function
    
    page = requests.get(url)

    soup = BeautifulSoup(page.content, "html.parser")
    
    return soup 

# Test: Try calling get_soup with the jobs site URL
job_url = "https://realpython.github.io/fake-jobs/"
soup = get_soup(job_url)
print(soup.prettify()[:500])


---

## Task 3: Extract Job Postings


In [None]:
def extract_jobs(soup):
    """
    Find and extract all job postings from the BeautifulSoup object.
    Each job should include: title, company, location, description, and link.
    HINT: Identify the main container for each job posting. Use soup.find_all and element.find.
    Return a list of job dictionaries.
    """
    # TODO: Implement this function
    job_card = results.find_all("div", class_="card-content")
    
    for job_card in job_cards:
        title_element = job_card.find("h2", class_="title")
        company_element = job_card.find("h3", class_="company")
        location_element = job_card.find("p", class_="location")
        ink_element = job_card.find("a", class_="href")
        jobs.append({
             "title": title,
             "company": company,
             "location": location,
             "link": link
         })
    return jobs

# HINT: Print the length of the jobs list and show a sample job.
print(len(jobs))


---

## Task 4: Data Cleaning


In [None]:
def clean_job_data(jobs):
    """
    Clean and normalize job data if necessary.
    HINT: Remove extra whitespace, handle missing fields, standardize case, etc.
    Return a new list of cleaned job dicts.
    """
    # TODO: Implement this function
    for job_card in job_cards:
        title_element = job_card.find("h2", class_="title")
        company_element = job_card.find("h3", class_="company")
        location_element = job_card.find("p", class_="location")
        print(title_element.text.strip())
        print(company_element.text.strip())
        print(location_element.text.strip())
        
    pass
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

# HINT: What fields might need cleaning in this dataset?
python_jobs = results.find_all("h2", string="Python")
print (len(python_jobs))


---

## Task 5: Filtering and Searching


In [None]:
def filter_jobs_by_keyword(jobs, keyword):
    """
    Return all jobs where the keyword appears in the title or description (case-insensitive).
    """
    # TODO: Implement this function
    pass

    python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
    )

    python_job_cards = [
        h2_element.parent.parent.parent for h2_element in python_jobs
    ]
    
    

def count_jobs_by_field(jobs, field):
    """
    Count jobs by a specific field, e.g., 'location' or 'company'.
    Return a dict {field_value: count}
    """
    # TODO: Implement this function
    
    pass 
    for job_card in python_job_cards:
        title_element = job_card.find("h2", class_="title")
        company_element = job_card.find("h3", class_="company")
        location_element = job_card.find("p", class_="location")
        print(title_element.text.strip())
        print(company_element.text.strip())
        print(location_element.text.strip())
        
    

---

## Task 6: Analysis and Visualization


In [None]:
def plot_jobs_by_location(job_counts):
    """
    Plot a bar chart of the number of jobs per location.
    HINT: Use matplotlib.
    """
    # TODO: Implement this function
    pass
    if not job_counts:
        print("No job location data to plot.")
        return

    locations = list(job_counts.keys())
    counts = list(job_counts.values())

    plt.figure(figsize=(10, 6)) # Adjust figure size for better readability
    plt.bar(locations, counts, color='skyblue')
    plt.xlabel("Location")
    plt.ylabel("Number of Job Postings")
    plt.title("Number of Job Postings by Location")
    plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
    plt.tight_layout() # Adjust layout to prevent labels from overlapping
    plt.show()

def plot_top_companies(job_counts, top_n=5):
    """
    Plot a pie chart of the top N companies by number of job postings.
    """
    # TODO: Implement this function
    pass
    if not job_counts:
        print("No data to plot.")
        return

    # Sort companies by job count in descending order
    sorted_companies = sorted(job_counts.items(), key=lambda item: item[1], reverse=True)

    # Get the top N companies
    top_n_companies = sorted_companies[:top_n]
    
    # Separate names and counts for plotting
    company_names = [company[0] for company in top_n_companies]
    job_counts_values = [company[1] for company in top_n_companies]

    # Handle "Other" category if there are more than top_n companies
    if len(sorted_companies) > top_n:
        other_count = sum(company[1] for company in sorted_companies[top_n:])
        company_names.append(f'Other ({len(sorted_companies) - top_n} companies)')
        job_counts_values.append(other_count)

    plt.figure(figsize=(8, 8)) # Make the figure square for a better looking pie chart
    plt.pie(job_counts_values, labels=company_names, autopct='%1.1f%%', startangle=140, pctdistance=0.85)
    plt.title(f"Top {top_n} Companies by Job Postings")
    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    plt.show()



---

## Bonus: Pagination

The example site does not have pagination, but many real job boards do.
How would you modify your crawler to follow "next page" links and collect jobs from multiple pages?
_Describe your approach here:_


This often involves inspecting the HTML structure of the page, locating the "next page" link (which could be text, an image, or a button), and extracting its href attribute (or the equivalent if it's not an anchor tag). Then, the crawler needs to be instructed to add this new URL to the queue of URLs to be processed. 

---

## Bonus: Exporting Data

How would you save your job data to a CSV or JSON file?
_Outline your steps or code below:_


The python libraries contain capabilities that can tranform data into json or CSV files, for a json file which looks like a dictionary, we need to first import a json library  
import json

### extract the data into an array
extracted_data = []

### append the array file for collection
extract_data.append(data)

### print the entire json file that is now clean to be saved ("name.json") 
print(json.dumps(extracted_data, indent=2))

---

## Reflection

- Did you notice any anti-scraping protections?
- How can you make your crawler more polite (e.g., delays, headers, respecting robots.txt)?
- Why is it important to use demo/test data when learning scraping?

_Write your thoughts below:_


No, I not did notice any anti-scraping protection. However, it is important to be careful implementing crawling to avoid IP blocking, user agent detection, and more.
The robots.txt file specifies which parts of a website crawlers are allowed to access. Always check this file before crawling and adhere to its rules. 
Introduce delays between requests to avoid overwhelming the target website's server. This can be a fixed delay or a randomized delay. 
Websites analyze request headers to identify bots. Mimic real browser behavior by using realistic User-Agent strings, Referer headers, and other relevant headers. 
If scraping a large number of pages, consider using IP rotation to distribute requests across multiple IP addresses, making it harder to detect and block your crawler. 
Similar to IP rotation, rotating User-Agent strings can help avoid detection based on user agent analysis. 
Why Use Demo/Test Data?
Scraping real websites without caution can lead to your IP address being blocked, especially if you're sending many requests or violating their terms of service. Using demo data prevents this.
Testing your scraper on demo data allows you to experiment with different approaches and refine your code without the risk of disrupting real website operations.
Demo data can help you understand the structure of the website you want to scrape, allowing you to develop more robust and efficient scraping logic.
Real websites may have dynamic elements, changing structures, or other unexpected behaviors. Testing on demo data allows you to develop comprehensive error handling routines. 

