### Downloading and Importing Necessary Libraries ###

# Data Analytics Job Market Analysis - Web Scraping & Insights

## Overview  
This project focuses on **web scraping job listings** for Data Analytics roles using **Selenium**. I extracted job titles, company names, locations, job types, and salary ranges from SimplyHired to analyze salary distributions and hiring trends.

## Challenges Faced & Fixes  

1. **Website Blocking & CAPTCHAs**  
   - Some sites detected automation, so I switched to a scrape-friendly job portal.  

2. **Dynamic Page Loading Issues**  
   - Used `WebDriverWait` to ensure elements were fully loaded before extraction.  

3. **Data Extraction Errors**  
   - **Company Names Missing**: Adjusted XPath selectors to correctly extract employer names.  
   - **Job Location Extraction**: Ensured the right HTML elements were targeted to capture city/state info.  
   - **Salary Formatting Issues**:  
     - Converted "95.8K" to "95,800" to ensure accurate numerical representation.  
     - Removed "Estimated:" text to extract clean salary values.  

4. **Inconsistent Job Details Structure**  
   - Some job postings did not list salaries, so missing values were handled as "Not Provided."  
   - Job type (Full-time/Part-time) was mixed with salary details, so I split them into separate columns.  

5. **Data Cleaning & Visualization Issues**  
   - Filtered out unrealistic salary values (e.g., below $20,000).  
   - Adjusted bin sizes in histograms to avoid distorted distributions.  
   - Ensured company analysis only included valid employers (not "Not Listed").  

## Key Analyses Conducted  

- **Top Hiring Companies** – Identifying companies with the highest number of job postings.  
- **Salary Distribution** – Analyzing the common salary ranges in Data Analytics roles.  
- **Highest-Paying Companies** – Determining which firms offer the best max salaries.  

This project provides a structured analysis of the Data Analytics job market, offering insights into salary expectations and hiring trends.  


In [3]:
!pip install selenium
!pip install webdriver_manager
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
import matplotlib.pyplot as plt
import re
import numpy as np



### Set-Up Selenium Web Driver ###

In [5]:
# Initialize WebDriver
driver = webdriver.Chrome()

# Open SimplyHired
driver.get("https://www.simplyhired.com/")
driver.maximize_window()

# Set wait time
wait = WebDriverWait(driver, 15)


### Search for "Data Analytics" Jobs ###

In [7]:
# Locate the job title input field and enter "Data Analytics"
search_box = wait.until(EC.presence_of_element_located((By.NAME, "q")))
search_box.clear()
search_box.send_keys("Data Analytics")

# Click the search button
search_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='findJobsSearchSubmit']")))
search_button.click()

# Wait for the page to load
time.sleep(5)


### Scrape Job Titles & Links (Up to 10 Pages) ###

In [9]:
# Dictionary to store job data
job_data = {"Job Title": [], "Job Link": []}

# Set page limit
page_number = 1
max_pages = 10

while page_number <= max_pages:
    print(f"Scraping page {page_number}...")

    job_cards = driver.find_elements(By.XPATH, "//div[@data-testid='searchSerpJob']")

    if not job_cards:
        print("No more job listings found. Stopping scraper.")
        break

    for job in job_cards:
        try:
            # Extract Job Title
            title = job.find_element(By.XPATH, ".//h2[@data-testid='searchSerpJobTitle']").text

            # Extract Job Link
            link_element = job.find_element(By.XPATH, ".//a")
            job_link = link_element.get_attribute("href")

            # Store data
            job_data["Job Title"].append(title)
            job_data["Job Link"].append(job_link)

        except Exception as e:
            print(f"Error extracting job: {e}")

    # Click 'Next Page' if available
    try:
        next_button = driver.find_element(By.XPATH, "//a[contains(@aria-label, 'Next')]")

        if not next_button.is_displayed() or "disabled" in next_button.get_attribute("class"):
            print(f"End of pagination reached. Total jobs extracted: {len(job_data['Job Title'])}")
            break

        next_button.click()
        time.sleep(5)
        page_number += 1

    except Exception:
        print(f"Scraping complete. Total jobs extracted: {len(job_data['Job Title'])}")
        break


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...


### Convert Job Titles & Links to DataFrame ###

In [11]:
# Convert scraped job data to DataFrame
df = pd.DataFrame(job_data)

# Display first 5 rows
df.head()

print(f"Total job listings in the DataFrame: {len(df)}")


Total job listings in the DataFrame: 200


### Limit Dataframe to 50 Jobs ###

In [13]:
df_limited = df.head(50)  # Select only the first 100 jobs


 ### Extract Additional Details (for 100 Jobs) and Convert to CSV ###

In [None]:

# Step 1: Add new columns for extracted details
df_limited["Company Name"] = ""
df_limited["Location"] = ""
df_limited["Expected Pay"] = ""
df_limited["Job Description"] = ""

# Step 2: Set up Selenium WebDriver
driver = webdriver.Chrome()

print("\nExtracting additional job details...")

# Step 3: Loop through the first 50 job links
for index, row in df_limited.iterrows():
    job_link = row["Job Link"]
    try:
        driver.get(job_link)
        time.sleep(random.uniform(5, 10))  # Random delay to avoid detection

        # Extract Company Name
        try:
            company = driver.find_element(By.XPATH, "//span[@data-testid='detailText']").text
        except:
            company = "Not Listed"

        # Extract Location (Updated XPath)
        try:
            location = driver.find_element(By.XPATH, "//span[@data-testid='detailText' and ancestor::span[@data-testid='viewJobCompanyLocation']]").text
        except:
            location = "Not Listed"

        # Extract Expected Pay (Salary)
        try:
            salary = driver.find_element(By.XPATH, "//span[@data-testid='detailText'][preceding::div[@data-testid='viewJobBodyJobCompensation']]").text
        except:
            salary = "Not Provided"

        # Extract Job Description (First 500 characters)
        try:
            job_desc = driver.find_element(By.XPATH, "//div[@data-testid='viewJobBodyJobDetailsContainer']").text[:500]
        except:
            job_desc = "Description Not Available"

        # Store extracted details
        df_limited.at[index, "Company Name"] = company
        df_limited.at[index, "Location"] = location
        df_limited.at[index, "Expected Pay"] = salary
        df_limited.at[index, "Job Description"] = job_desc

        print(f"Extracted details for: {row['Job Title']}")

    except Exception as e:
        print(f"Error extracting details for {job_link}: {e}")

# Step 4: Save the Updated Data to CSV
csv_filename = "data_analytics_jobs_50_fixed.csv"
df_limited.to_csv(csv_filename, index=False)
print(f"\nData saved to {csv_filename}")

# Step 5: Close the Browser
driver.quit()

# Step 6: Display the Top 10 Rows
df_limited.head(10)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_limited["Company Name"] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_limited["Location"] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_limited["Expected Pay"] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = valu


Extracting additional job details...


### Split Job Details to Show 'Job Type' and 'Expected Pay' Separately ###

In [None]:
# Step 1: Split Job Description into Job Type and Expected Pay
df_limited["Job Type"] = df_limited["Job Description"].apply(lambda x: x.split("\n")[0] if isinstance(x, str) else "Not Listed")
df_limited["Expected Pay"] = df_limited["Job Description"].apply(lambda x: x.split("\n")[1] if isinstance(x, str) and "\n" in x else "Not Provided")

# Step 2: Drop the old "Job Description" column
df_limited.drop(columns=["Job Description"], inplace=True)

# Step 3: Save the cleaned data to CSV
csv_filename = "data_analytics_jobs_cleaned.csv"
df_limited.to_csv(csv_filename, index=False)

# Step 4: Display the first 10 rows
df_limited.head(10)


### Load the CSV for Data Analysis ###

In [None]:

# Load the cleaned CSV
df = pd.read_csv("data_analytics_jobs_cleaned.csv")  

# Display first 5 rows
df.head()


### Top Hiring Companies ###

In [None]:
# Filter out "Not Listed" company names
filtered_companies = df[df["Company Name"] != "Not Listed"]

# Count job postings by company (after filtering)
top_companies = filtered_companies["Company Name"].value_counts().head(10)

# Check if we have enough data to plot
if not top_companies.empty:
    # Plot the data
    plt.figure(figsize=(10,5))
    top_companies.plot(kind="bar", color="cornflowerblue")
    plt.xlabel("Company Name")
    plt.ylabel("Number of Job Postings")
    plt.title("Top 10 Hiring Companies for Data Analytics Jobs")
    plt.xticks(rotation=45, ha="right")
    plt.show()
else:
    print("No valid company names available for analysis.")


### Extracting Maximum and Minimum Salaries ###

In [None]:
# Function to clean and extract salary values
def extract_salary(salary):
    if pd.isna(salary):  # Check for NaN values
        return [None, None]

    salary = str(salary).replace("Estimated:", "").strip()  # Remove "Estimated"

    # Extract numeric values (handling "K")
    numbers = re.findall(r"(\d{1,3}(?:\.\d)?)[K]?", salary)  # Capture numbers with optional "K"
    
    # Convert "K" values to full numbers
    clean_numbers = []
    for num in numbers:
        if "K" in salary:  # If "K" is in text, multiply by 1,000
            clean_numbers.append(float(num) * 1000)
        else:
            clean_numbers.append(float(num))
    
    # Ensure we return exactly two values (min & max salary)
    return clean_numbers if len(clean_numbers) == 2 else [None, None]

# Apply extraction
df[["Min Salary", "Max Salary"]] = df["Expected Pay"].apply(lambda x: pd.Series(extract_salary(x)))

# Convert to numeric values
df["Min Salary"] = pd.to_numeric(df["Min Salary"], errors="coerce")
df["Max Salary"] = pd.to_numeric(df["Max Salary"], errors="coerce")

# Display first few rows to confirm the fix
df.head()


### Distribution of Maximum Salaries ###

In [None]:

# Drop NaN values from Max Salary
salary_data = df.dropna(subset=["Max Salary"])

# Filter out unrealistic values (e.g., below $20,000)
salary_data = salary_data[salary_data["Max Salary"] > 20000]

# Define bins dynamically based on data range
bins = np.linspace(salary_data["Max Salary"].min(), salary_data["Max Salary"].max(), 15)  # 15 bins

# Plot histogram for Max Salary only
plt.figure(figsize=(10,5))
plt.hist(salary_data["Max Salary"], bins=bins, alpha=0.7, color="red", edgecolor="black")
plt.xlabel("Max Salary (USD)")
plt.ylabel("Number of Jobs")
plt.title("Distribution of Max Salaries for Data Analytics Jobs")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()


### Top 10 Companies by Average Max Salary ###

In [None]:

# Filter out "Not Listed" companies
filtered_data = df[df["Company Name"] != "Not Listed"]

# Group by company and calculate the average max salary
company_salary = filtered_data.groupby("Company Name")["Max Salary"].mean().sort_values(ascending=False).head(10)

# Check if we have enough data to plot
if not company_salary.empty:
    # Plot the data
    plt.figure(figsize=(10,5))
    company_salary.plot(kind="bar", color="green", edgecolor="black")
    plt.xlabel("Company Name")
    plt.ylabel("Average Max Salary (USD)")
    plt.title("Top 10 Companies by Average Max Salary for Data Analytics Jobs")
    plt.xticks(rotation=45, ha="right")
    plt.grid(axis="y", linestyle="--", alpha=0.7)
    plt.show()
else:
    print("No valid company salary data available for analysis.")
