# Job Scraping and Data Processing

This notebook demonstrates how to scrape job listings from SimplyHired using Selenium, and how to merge and clean the resulting CSV files. The final output is a single merged dataset containing job information like job title, company name, location, estimated salary, job details, and job links. 

## Sections:
1. Scrape job listings for specific roles.
2. Merge scraped data from multiple CSV files.
3. Standardize column names and handle missing data.

## Scrape Job Listings

The `scrape_jobs` function performs web scraping on the SimplyHired website. Given a job title and location, it:
- Constructs the URL.
- Collects job titles, company names, locations, estimated salaries, job details, and job links.
- Handles pagination and error logging.
- Saves the data as a CSV file named based on the job title and location.

**Key Features:**
- Scrapes job listings using Selenium.
- Handles multiple pages and collects job details from each job link.
- Logs errors to a file (`web_scraping_errors.log`) to help with debugging.

In [1]:
import pandas as pd

def scrape_jobs(job_title, location):
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    import time
    from selenium.common.exceptions import NoSuchElementException
    import logging

    # Set up logging to capture errors
    logging.basicConfig(filename='web_scraping_errors.log', level=logging.ERROR)

    # Create a Chrome browser instance
    driver = webdriver.Chrome()

    # Construct the URL
    url = f"https://www.simplyhired.com/search?q={job_title}&l={location}"

    # Initialize data lists
    job_titles, company_names, job_locations, est_salary, job_details, job_hrefs = [], [], [], [], [], []

    total_job_count = 0

    try:
        # Visit the URL
        driver.get(url)

        while True:
            # Extract the total job count from the element
            total_job_element = driver.find_element(By.CSS_SELECTOR, "p.css-gu0het")
            total_job_count = int(total_job_element.text)

            # Find all job title elements and corresponding company name elements on the current page
            job_title_elements = driver.find_elements(By.CSS_SELECTOR, "[data-testid='searchSerpJobTitle'] a")
            company_name_elements = driver.find_elements(By.CSS_SELECTOR, "[data-testid='companyName']")
            job_location_elements = driver.find_elements(By.CSS_SELECTOR, "[data-testid='searchSerpJobLocation']")
            est_salary_elements = driver.find_elements(By.CSS_SELECTOR, "[data-testid='searchSerpJobSalaryEst']")
            job_href_elements = driver.find_elements(By.CSS_SELECTOR, "[data-testid='searchSerpJobTitle'] a")

            # Iterate through the job elements to extract job titles, company names, and href links
            for job_title_element, company_name_element, job_location_element, est_salary_element, job_href_element in zip(job_title_elements, company_name_elements, job_location_elements, est_salary_elements, job_href_elements):
                job_titles.append(job_title_element.text)
                company_names.append(company_name_element.text)
                job_locations.append(job_location_element.text)
                est_salary.append(est_salary_element.text)

                # Click on the job href to get details and open in a new window
                job_href = job_href_element.get_attribute("href")
                job_hrefs.append(job_href)  # Store the href link
                driver.execute_script("window.open('', '_blank');")
                driver.switch_to.window(driver.window_handles[1])
                driver.get(job_href)

                try:
                    job_detail_element = driver.find_element(By.CSS_SELECTOR, "[data-testid='viewJobBodyJobFullDescriptionContent']")
                    job_details.append(job_detail_element.text)
                except NoSuchElementException:
                    job_details.append("N/A")

                driver.close()  # Close the job details tab
                driver.switch_to.window(driver.window_handles[0])  # Switch back to the job listing tab

            # Find the next page button
            next_page_button = driver.find_element(By.XPATH, "//a[@aria-label='Next page']")

            # Scroll to the next page button to make it clickable
            driver.execute_script("arguments[0].scrollIntoView();", next_page_button)

            # Click the next page button
            next_page_button.click()

            # Wait for a short period before loading the next page
            time.sleep(5)  # Wait for 5 seconds

            # Break the loop if we have collected data from all pages
            if len(job_titles) >= total_job_count:
                break

    except NoSuchElementException as e:
        # Handle the specific exception (element not found)
        logging.error(f"Element not found error: {str(e)}")
    except Exception as e:
        # Handle other exceptions and log them
        logging.error(f"An error occurred: {str(e)}")

    finally:
        # Close the browser when done
        driver.quit()

    # Create a dictionary to store the data
    data = {
        'Job Title': job_titles,
        'Company Name': company_names,
        'Job Location': job_locations,
        'Estimated Salary': est_salary,
        'Job details': job_details,
        'Job Href': job_hrefs  # Include the href links
    }

    # Check if all lists have the same length
    lengths = set(len(lst) for lst in data.values())
    if len(lengths) == 1:
        # All lists have the same length, create the DataFrame
        df = pd.DataFrame(data)
    else:
        # Lists have different lengths, handle the error
        print("Error: Lists have different lengths")
        df = None  # Set df to None or handle the error as needed

    # Save the data to a CSV file if the DataFrame is not None
    if df is not None:
        file_name = f'{job_title}_{location}_{time.strftime("%Y-%m-%d")}.csv'
        df.to_csv(file_name, index=False)

    # Return the DataFrame (or None)
    return df


## Scraping Data for Different Job Roles

Here we demonstrate scraping job listings for different roles such as Data Engineer, Software Engineer, and Data Scientist. Each search result is saved into a separate CSV file.

For example:
- **Data Engineer** jobs in California.
- **Software Engineer** jobs in California.
- **Data Scientist** jobs in California.

In [2]:
data_engineer = scrape_jobs("Data Engineer", "California")
data_engineer

Unnamed: 0,Job Title,Company Name,Job Location,Estimated Salary,Job details,Job Href
0,Senior Network Engineer - RRMS Data Center,Rapid Response Monitoring,"Corona, CA",Estimated: $150K - $190K a year,Location\nThis position is 100% in-office and ...,https://www.simplyhired.com/job/CXnvktX5Idn-qe...
1,Data Engineer,RECRUIT ROOTS GLOBAL SERVICES PRIVATE LIMITED,"Glendale, CA",Estimated: $151K - $191K a year,Job Title: Sr Data Engineer\nLocation: Glendal...,https://www.simplyhired.com/job/L8wcbSMF2gGGbM...
2,"Data Engineer, Real Estate and Workplace",OpenAI,"San Francisco, CA",Estimated: $88.7K - $112K a year,"About the Team\nOpenAI has a beautiful, custom...",https://www.simplyhired.com/job/gSntIEgdlm-bZW...
3,Principal Data Modeling Engineer,Oracle,"Pleasanton, CA",Estimated: $173K - $218K a year,This position is onsite/hybrid in our Pleasant...,https://www.simplyhired.com/job/5yV0h_UhLlnrYk...
4,"Lead Data Engineer, Data and Systems",Pendulum™,"San Francisco, CA",Estimated: $118K - $149K a year,About Pendulum\nPendulum® is leading a revolut...,https://www.simplyhired.com/job/2SJyt_v7FSdSA8...
...,...,...,...,...,...,...
251,Application Engineer II - Data Management (Min...,Ansys,"San Diego, CA",Estimated: $116K - $147K a year,Requisition #: 14831\n\nOur Mission: Powering ...,https://www.simplyhired.com/job/KDvCkdU6RSLBMr...
252,Senior Network Engineer - RRMS Data Center,Rapid Response Monitoring,"Corona, CA",Estimated: $113K - $143K a year,Location\nThis position is 100% in-office and ...,https://www.simplyhired.com/job/CXnvktX5Idn-qe...
253,Data Platform Engineer,Joyent,"Mountain View, CA",Estimated: $150K - $190K a year,"Mountain View, CA, Hybrid (3 times per week)\n...",https://www.simplyhired.com/job/47Ex9cyS9zJSmA...
254,Cellular 4G/5G Firmware Data Science & Machine...,Apple,"Sunnyvale, CA",Estimated: $114K - $144K a year,"Summary\n\nPosted: Jul 25, 2024\n\nRole Number...",https://www.simplyhired.com/job/gMY9aLS1SsD1bC...


In [3]:
Software_engineer = scrape_jobs("Software Engineer", "California")
Software_engineer

Unnamed: 0,Job Title,Company Name,Job Location,Estimated Salary,Job details,Job Href
0,Landing Page Developer/Designer,Confidential,"Los Angeles, CA",Estimated: $137K - $174K a year,Description:\nThis is an in house position wit...,https://www.simplyhired.com/job/NUmNZU3i-XDtMe...
1,WordPress Developer (Front-end design & robust...,Rhodes Wolfe,"Palm Springs, CA",Estimated: $103K - $130K a year,WordPress Developer (Web Design & Website Mana...,https://www.simplyhired.com/job/D2a3iro1uKJEJg...
2,Sr. Principal Engineer Software - Simulation (...,Northrop Grumman,"Palmdale, CA",Estimated: $113K - $144K a year,Requisition ID: R10160394\nCategory: Engineeri...,https://www.simplyhired.com/job/adtBmo2BSL-nII...
3,AS400 EDI & RPG Specialist,Amphastar Pharmaceuticals Inc.,"Rancho Cucamonga, CA",Estimated: $119K - $151K a year,A Pharmaceutical Manufacturing company is look...,https://www.simplyhired.com/job/mX2hi4lFGqhiNd...
4,Sr. Principal Software Engineer - Database,Northrop Grumman,"Camarillo, CA",Estimated: $170K - $215K a year,Requisition ID: R10167691\nCategory: Engineeri...,https://www.simplyhired.com/job/cQSQ4F8Kx55j5v...
...,...,...,...,...,...,...
862,Controls / Automation Engineer,Actalent,"San Diego, CA",Estimated: $119K - $151K a year,Title: Automation / Controls Engineer\nPositio...,https://www.simplyhired.com/job/m5Cu7OhrOQo4F2...
863,Visualization Engineer (Unity),Dynamoid,"Oakland, CA",Estimated: $168K - $213K a year,"We are a small, growing team and value those w...",https://www.simplyhired.com/job/3l4J36FPhNXjuH...
864,Senior Software Engineer,"Software Resources, Inc.","Glendale, CA",Estimated: $95.5K - $121K a year,Software Resources has an immediate job opport...,https://www.simplyhired.com/job/WtDmSvYZ7VXE4U...
865,"Principal Software Engineer, ADAS Compute Plat...",General Motors,"Mountain View, CA",Estimated: $138K - $174K a year,Job Description\nIf you are a Principal S oftw...,https://www.simplyhired.com/job/dhRW87m04ODLaV...


In [4]:
Data_Scientist = scrape_jobs("Data Scientist", "California")
Data_Scientist

Unnamed: 0,Job Title,Company Name,Job Location,Estimated Salary,Job details,Job Href
0,Sr. Data Analyst (SQL),MOBIS PARTS AMERICA LLC,"Fountain Valley, CA",Estimated: $159K - $201K a year,****Interested applicants must reside in South...,https://www.simplyhired.com/job/WXIOxG-Z-lU17v...
1,Medicine - Artificial Intelligence Faculty,CEDARS-SINAI,"Los Angeles, CA",Estimated: $99.3K - $126K a year,The Division of Artificial Intelligence in Med...,https://www.simplyhired.com/job/_t1HoPxcMcxPUC...
2,"Sr. WW Specialist, GenAI, Model Training & Inf...","Amazon Web Services, Inc. - A97","Santa Clara, CA",Estimated: $152K - $192K a year,Are you a customer-obsessed builder with a pas...,https://www.simplyhired.com/job/AoiOKM4p3NAK7V...
3,"Senior Data Scientist, AI Foundations",Capital One,"San Francisco, CA",Estimated: $120K - $152K a year,"Center 2 (19050), United States of America, Mc...",https://www.simplyhired.com/job/MPrBjKlXrmVzLl...
4,"AI Models Engineer, Efficient Generative AI","Advanced Micro Devices, Inc","San Jose, CA",Estimated: $165K - $210K a year,Overview:\nWHAT YOU DO AT AMD CHANGES EVERYTHI...,https://www.simplyhired.com/job/93vfawT-eCyU9v...
...,...,...,...,...,...,...
174,Medicine - Artificial Intelligence Faculty,CEDARS-SINAI,"Los Angeles, CA",Estimated: $159K - $201K a year,The Division of Artificial Intelligence in Med...,https://www.simplyhired.com/job/_t1HoPxcMcxPUC...
175,Forensics - Discovery Data Scientist - Manager...,EY,"San Francisco, CA",Estimated: $142K - $180K a year,"At EY, you’ll have the chance to build a caree...",https://www.simplyhired.com/job/knH1yTbFl2eDWy...
176,Machine Learning Lead - Drug Discovery,Vertex Pharmaceuticals,"San Diego, CA",Estimated: $169K - $214K a year,Job Description\nVertex Pharmaceuticals is see...,https://www.simplyhired.com/job/tZ74-gbXNPt6rA...
177,Medicine - Artificial Intelligence Faculty,CEDARS-SINAI,"Los Angeles, CA",Estimated: $69.8K - $88.4K a year,The Division of Artificial Intelligence in Med...,https://www.simplyhired.com/job/_t1HoPxcMcxPUC...


In [7]:
Data_Analyst = scrape_jobs("Data Analyst", "California")
Data_Analyst

Unnamed: 0,Job Title,Company Name,Job Location,Estimated Salary,Job details,Job Href
0,Lab Analyst - C4I WISE,"F2 Systems, LLC","Camp Pendleton, CA",Estimated: $99.3K - $126K a year,ACTIVE DOD CLEARANCE REQUIRED\nPosition Title:...,https://www.simplyhired.com/job/wGUNJXVxNE1ZCn...
1,Healthcare Data Analyst II-Health Services Eva...,Inland Empire Health Plan,"Rancho Cucamonga, CA",Estimated: $80.9K - $102K a year,What You Can You Expect!\n\nFind joy in servin...,https://www.simplyhired.com/job/xLjvAZ5FKMLmEU...
2,Laboratory Data and Inspection Analyst (Associ...,California Department of Public Health (CDPH),"Los Angeles, CA",Estimated: $62.1K - $78.6K a year,The California Department of Public Health (CD...,https://www.simplyhired.com/job/3YhmQ1PfVoNANc...
3,IT Data Analyst,California Commercial Investment Group Inc,"Sacramento, CA",Estimated: $103K - $131K a year,"We make good investments in our people, proper...",https://www.simplyhired.com/job/8hk-ERsohqWOvA...
4,Board Certified Behavioral Analyst- San Diego ...,Redwood Family Care Network,"San Diego, CA",Estimated: $103K - $130K a year,Board Certified Behavioral Analyst- San Diego ...,https://www.simplyhired.com/job/HYrAqWgEFCFNqx...
...,...,...,...,...,...,...
160,Provider Compensation Analyst,Medica Talent Group,"Los Angeles, CA",Estimated: $103K - $131K a year,"Location: HYBRID (3 days onsite Commerce, CA +...",https://www.simplyhired.com/job/5dSSskqDxf3roS...
161,Business Analyst 4,"Ursus, Inc.","San Francisco, CA",Estimated: $73.4K - $93K a year,JOB TITLE: Business Analyst 4\nLOCATION: 100% ...,https://www.simplyhired.com/job/cCQ5-XUdEJtm7Q...
162,Lab Analyst - C4I WISE,"F2 Systems, LLC","Camp Pendleton, CA",Estimated: $99.3K - $126K a year,ACTIVE DOD CLEARANCE REQUIRED\nPosition Title:...,https://www.simplyhired.com/job/wGUNJXVxNE1ZCn...
163,Healthcare Data Analyst II-Health Services Eva...,Inland Empire Health Plan,"Rancho Cucamonga, CA",Estimated: $80.9K - $102K a year,What You Can You Expect!\n\nFind joy in servin...,https://www.simplyhired.com/job/xLjvAZ5FKMLmEU...


In [8]:
Business_Systems_Analyst = scrape_jobs("Business Systems Analyst", "California")
Business_Systems_Analyst

Error: Lists have different lengths


In [9]:
Software_Developer = scrape_jobs("Software_Developer", "California")
Software_Developer

Unnamed: 0,Job Title,Company Name,Job Location,Estimated Salary,Job details,Job Href
0,BIM Manager / Architecture Firm,Tegrastaff,"San Francisco, CA",Estimated: $115K - $146K a year,BIM MANAGER (SAN FRANCISCO)\nWe are looking fo...,https://www.simplyhired.com/job/wBmJcA0LKa1OMQ...
1,IT Specialist,Primus Auditing Ops,"Santa Maria, CA",Estimated: $141K - $178K a year,GENERAL DESCRIPTION\nSoftware developer: Creat...,https://www.simplyhired.com/job/4m9PvrpIq_oD_F...
2,Senior Software Developer – Compute Platform S...,General Motors,"Mountain View, CA",Estimated: $149K - $188K a year,Job Description\nThe Software Defined Vehicle ...,https://www.simplyhired.com/job/bzWI8Tjg_4q9Jp...
3,Jr Software Developer,"Alutiiq, LLC","San Diego, CA",Estimated: $150K - $189K a year,"Job Description:\n\nAlutiiq Career Ventures, L...",https://www.simplyhired.com/job/FOC_fVn2W9pd8Y...
4,Senior Software Developer in Test / Aerospace ...,Motion Recruitment,"El Segundo, CA",Estimated: $89.9K - $114K a year,Job Description A SaaS company that builds pla...,https://www.simplyhired.com/job/EvURDeb00H9Khp...
...,...,...,...,...,...,...
235,SMTS Software Development Eng.,"Advanced Micro Devices, Inc","Santa Clara, CA",Estimated: $89.9K - $114K a year,Overview:\nWHAT YOU DO AT AMD CHANGES EVERYTHI...,https://www.simplyhired.com/job/dsxAmCj8o8Gwsd...
236,Software Developer in Test - Embedded Sensors QE,Apple,"Cupertino, CA",Estimated: $166K - $210K a year,"Summary\n\nPosted: Sep 8, 2024\n\nWeekly Hours...",https://www.simplyhired.com/job/iuznYATd2x2pCo...
237,"Software Dev Engineer II, Last Mile Routing Pl...",Amazon.com Services LLC,"San Luis Obispo, CA",Estimated: $183K - $231K a year,3+ years of non-internship professional softwa...,https://www.simplyhired.com/job/z-hyo3rMavkhsV...
238,Software Engineer,"Harbor Truck Bodies, Inc.","Fontana, CA",Estimated: $151K - $192K a year,Position: Software Developer & Reporting Engin...,https://www.simplyhired.com/job/MOYjZllMfjAbWS...


# Mapping the column names in all the csv files and merge csv files

## Merging CSV Files

After scraping multiple job roles, the data is stored in separate CSV files. In this section, we:
- Load all CSV files from the `data/` folder.
- Standardize the column names to maintain consistency across different datasets.
- Merge all CSV files into a single DataFrame.

**Column Mapping:**
We use a dictionary to map inconsistent column names (like `Job Title`, `Company`, `Location`) to standardized names (`Job_Title`, `Company_Name`, `Job_Location`, etc.).

## Standardize and Merge Data

In this step:
- All columns in the CSV files are renamed according to the `column_name_mapping`.
- The data is concatenated into a single DataFrame, ensuring that all records have a uniform structure.
- The resulting DataFrame contains columns like `Job_Title`, `Company_Name`, `Job_Location`, `Estimated_Salary`, `Job_Details`, and `Hyperlink`.

In [12]:
import os 
# Define the path to the folder containing CSV files
folder_path = 'data/'

# Get a list of all CSV files in the folder
file_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.csv')]

# Define a dictionary to map old column names to new column names
column_name_mapping = {
    'Job Title': 'Job_Title',
    'Company Name': 'Company_Name',
    'Company' : 'Company_Name',
    'Location': 'Job_Location',
    'Job Location': 'Job_Location',
    'Salary': 'Estimated_Salary',
    'Estimated Salary': 'Estimated_Salary',
    'Href': 'Hyperlink',
    'Job Href': 'Hyperlink',
    'Job Description': 'Job_Details',
    'Job details': 'Job_Details'
}

# Read all CSV files into DataFrames and rename columns
dataframes = []

for file_path in file_paths:
    df = pd.read_csv(file_path)
    df.rename(columns=column_name_mapping, inplace=True)
    dataframes.append(df)

# Concatenate the DataFrames row-wise
merged_df = pd.concat(dataframes, ignore_index=True)

# Define the desired column order
desired_columns = [
    'Job_Title',
    'Company_Name',
    'Job_Location',
    'Estimated_Salary',
    'Job_Details',
    'Hyperlink',
]

# Reorder columns to match the desired order
merged_df = merged_df[desired_columns]

# Save the merged DataFrame to a CSV file
merged_df.to_csv('merged_file.csv', index=False)


## Final Merged DataFrame

The merged dataset contains all the scraped job listings across different job roles. Below, we display a preview of the first few rows of the combined dataset.

You can save this final DataFrame to a CSV file for further analysis.

In [13]:
merged_df

Unnamed: 0,Job_Title,Company_Name,Job_Location,Estimated_Salary,Job_Details,Hyperlink
0,Lab Analyst - C4I WISE,"F2 Systems, LLC","Camp Pendleton, CA",Estimated: $99.3K - $126K a year,ACTIVE DOD CLEARANCE REQUIRED\nPosition Title:...,https://www.simplyhired.com/job/wGUNJXVxNE1ZCn...
1,Healthcare Data Analyst II-Health Services Eva...,Inland Empire Health Plan,"Rancho Cucamonga, CA",Estimated: $80.9K - $102K a year,What You Can You Expect!\n\nFind joy in servin...,https://www.simplyhired.com/job/xLjvAZ5FKMLmEU...
2,Laboratory Data and Inspection Analyst (Associ...,California Department of Public Health (CDPH),"Los Angeles, CA",Estimated: $62.1K - $78.6K a year,The California Department of Public Health (CD...,https://www.simplyhired.com/job/3YhmQ1PfVoNANc...
3,IT Data Analyst,California Commercial Investment Group Inc,"Sacramento, CA",Estimated: $103K - $131K a year,"We make good investments in our people, proper...",https://www.simplyhired.com/job/8hk-ERsohqWOvA...
4,Board Certified Behavioral Analyst- San Diego ...,Redwood Family Care Network,"San Diego, CA",Estimated: $103K - $130K a year,Board Certified Behavioral Analyst- San Diego ...,https://www.simplyhired.com/job/HYrAqWgEFCFNqx...
...,...,...,...,...,...,...
1462,Controls / Automation Engineer,Actalent,"San Diego, CA",Estimated: $119K - $151K a year,Title: Automation / Controls Engineer\nPositio...,https://www.simplyhired.com/job/m5Cu7OhrOQo4F2...
1463,Visualization Engineer (Unity),Dynamoid,"Oakland, CA",Estimated: $168K - $213K a year,"We are a small, growing team and value those w...",https://www.simplyhired.com/job/3l4J36FPhNXjuH...
1464,Senior Software Engineer,"Software Resources, Inc.","Glendale, CA",Estimated: $95.5K - $121K a year,Software Resources has an immediate job opport...,https://www.simplyhired.com/job/WtDmSvYZ7VXE4U...
1465,"Principal Software Engineer, ADAS Compute Plat...",General Motors,"Mountain View, CA",Estimated: $138K - $174K a year,Job Description\nIf you are a Principal S oftw...,https://www.simplyhired.com/job/dhRW87m04ODLaV...


In [14]:
merged_df['Job_Title'][0]

'Lab Analyst - C4I WISE'

In [15]:
merged_df.Job_Details.isna().sum()

0