Credit for Selenium: https://www.kaggle.com/code/cristaliss/selenium-on-kaggle-easy-tutorial

# 🎯 Selenium on Kaggle: A Comprehensive Tutorial

This notebook aims to provide a detailed guide on using Selenium effectively within Kaggle environments. Selenium offers powerful web automation capabilities, but specific configurations are often required for successful execution on Kaggle. This tutorial addresses these challenges and presents a step-by-step approach to using Selenium for web scraping and interaction within Kaggle notebooks.


### Introduction
**Selenium** is a powerful web automation tool that allows you to interact with web pages programmatically. It is a popular choice for web scraping, automating repetitive tasks, and testing web applications.

### Why Selenium?

There are several advantages of using Selenium over other web scraping libraries:

* **Wide browser support**: Selenium supports a wide range of browsers, including Chrome, Firefox, and Safari.
* **Easy to use**: Selenium provides a simple and intuitive API for interacting with web pages.
* **Powerful**: Selenium can be used to automate complex tasks, such as filling out forms, clicking buttons, and scrolling through pages.
* **Extensible**: Selenium can be extended with custom code to meet your specific needs.

# 1. Setting Up the Environment

## 1.1. Install dependencies:
This command updates the system's package list and installs various libraries required for running Chrome and Selenium.

In [1]:
!apt-get update -y
!apt-get install -y \
libglib2.0-0 \
libnss3 \
libdbus-glib-1-2 \
libgconf-2-4 \
libfontconfig1 \
libvulkan1 \
gconf2-common \
libwayland-server0 \
libgbm1 \
udev \
libu2f-udev 
!apt --fix-broken install -y  

Get:1 https://packages.cloud.google.com/apt cloud-sdk InRelease [6361 B]
Get:2 http://packages.cloud.google.com/apt gcsfuse-focal InRelease [1225 B]
Get:3 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Get:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:6 https://packages.cloud.google.com/apt cloud-sdk/main amd64 Packages [629 kB]
Hit:7 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Get:8 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.8 kB]
Get:9 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1200 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3686 kB]
Get:11 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3594 kB]
Get:12 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3536 kB]
Get:13 http://archive.ubuntu.com/ubuntu fo

## 1.2. Download and extract Chrome:

To use Selenium, you will need to download and install Chrome and Chromedriver.

* **Chrome**: Chrome is a popular web browser that is known for its speed and security.
* **Chromedriver**: Chromedriver is a tool that allows Selenium to interact with Chrome.

Downloads the latest stable version of Chrome for Linux and extracts it to the /usr/bin directory.

In [2]:
!wget -P /tmp https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chrome-linux64.zip
!unzip /tmp/chrome-linux64.zip -d /usr/bin/

--2024-04-24 22:38:13--  https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chrome-linux64.zip
Resolving edgedl.me.gvt1.com (edgedl.me.gvt1.com)... 34.104.35.123, 2600:1900:4110:86f::
Connecting to edgedl.me.gvt1.com (edgedl.me.gvt1.com)|34.104.35.123|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145898081 (139M) [application/octet-stream]
Saving to: '/tmp/chrome-linux64.zip'


2024-04-24 22:38:16 (54.6 MB/s) - '/tmp/chrome-linux64.zip' saved [145898081/145898081]

Archive:  /tmp/chrome-linux64.zip
  inflating: /usr/bin/chrome-linux64/MEIPreload/manifest.json  
  inflating: /usr/bin/chrome-linux64/MEIPreload/preloaded_data.pb  
  inflating: /usr/bin/chrome-linux64/chrome  
  inflating: /usr/bin/chrome-linux64/chrome-wrapper  
  inflating: /usr/bin/chrome-linux64/chrome_100_percent.pak  
  inflating: /usr/bin/chrome-linux64/chrome_200_percent.pak  
  inflating: /usr/bin/chrome-linux64/chrome_crashpad_handler  


## 1.3. Download and extract Chromedriver:

As it was done in the previous code.

In [3]:
!wget -P /tmp https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chromedriver-linux64.zip
!unzip /tmp/chromedriver-linux64.zip -d /usr/bin/

--2024-04-24 22:38:22--  https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chromedriver-linux64.zip
Resolving edgedl.me.gvt1.com (edgedl.me.gvt1.com)... 34.104.35.123, 2600:1900:4110:86f::
Connecting to edgedl.me.gvt1.com (edgedl.me.gvt1.com)|34.104.35.123|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7271942 (6.9M) [application/octet-stream]
Saving to: '/tmp/chromedriver-linux64.zip'


2024-04-24 22:38:23 (75.5 MB/s) - '/tmp/chromedriver-linux64.zip' saved [7271942/7271942]

Archive:  /tmp/chromedriver-linux64.zip
  inflating: /usr/bin/chromedriver-linux64/LICENSE.chromedriver  
  inflating: /usr/bin/chromedriver-linux64/chromedriver  


## 1.4. Install Python libraries

In [4]:
!apt install -y python3-selenium
!pip install selenium==3.141.0




The following additional packages will be installed:
  apparmor chromium-browser chromium-chromedriver liblzo2-2 snapd
  squashfs-tools
Suggested packages:
  apparmor-profiles-extra apparmor-utils zenity | kdialog
The following NEW packages will be installed:
  apparmor chromium-browser chromium-chromedriver liblzo2-2 python3-selenium
  snapd squashfs-tools
0 upgraded, 7 newly installed, 0 to remove and 112 not upgraded.
Need to get 25.2 MB of archives.
After this operation, 104 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 apparmor amd64 2.13.3-7ubuntu5.3 [502 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 liblzo2-2 amd64 2.10-2 [50.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 squashfs-tools amd64 1:4.4-1ubuntu0.3 [117 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 snapd amd64 2.61.3+20.04 [24.4 MB]
Get:5 http://archive.ubuntu.com/ubuntu focal-upd

# 2.Importing Libraries

You will also need to install the following Python libraries:

* **selenium**: The Selenium library provides the API for interacting with web pages.
* **webdriver**: The webdriver library provides a way to interact with web drivers, such as Chromedriver.
* **BeautifulSoup**: The BeautifulSoup library is used for parsing HTML content.

In [5]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

In [6]:
from retrying import retry
import time
import traceback

# 3. Configuring Chrome Driver

These functions define the locations of Chrome and Chromedriver executables. Additionally, initialize_driver creates a Chrome webdriver instance with specific options:

* *--headless*: Runs Chrome in headless mode, making it invisible.
* *--no-sandbox*: Disables the sandbox for improved performance.
* *--start-fullscreen*: Starts Chrome in fullscreen mode.
* *--allow-insecure-localhost*: Allows access to insecure local websites (if needed).
* *--disable-dev-shm-usage*: Disables shared memory usage for Chrome.
* *user-agent*: Sets the user agent string to mimic a regular browser.

In [7]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

CHROME_BINARY_LOCATION = "/usr/bin/chrome-linux64/chrome"
CHROMEDRIVER_BINARY_LOCATION = "/usr/bin/chromedriver-linux64/chromedriver"

def add_driver_options(options):
    """
    Add configurable options
    """
    chrome_options = Options()
    for opt in options:
        chrome_options.add_argument(opt)
    return chrome_options

def initialize_driver():
    """
    Initialize the web driver
    """
    driver_config = {
        "options": [
            "--headless",
            "--no-sandbox",
            "--start-fullscreen",
            "--allow-insecure-localhost",
            "--disable-dev-shm-usage",
            "user-agent=Chrome/116.0.5845.96"
        ],
    }
    options = add_driver_options(driver_config["options"])
    options.binary_location = CHROME_BINARY_LOCATION
    driver = webdriver.Chrome(
        executable_path=CHROMEDRIVER_BINARY_LOCATION,
        options=options)
    return driver


# 4. Using Selenium: Example

Here's a breakdown of how I'm using Selenium to extract book information from Goodreads, with a practical example:


If you want a full version of this dataset, you can check and vote it in https://www.kaggle.com/datasets/cristaliss/ultimate-book-collection-top-100-books-up-to-2023

In [8]:
def extract_series_info(series_string):
    # Split the series string based on ', #' or ' #'
    if ', #' in series_string:
        series_list = series_string.split(', #')
    elif ' #' in series_string:
        series_list = series_string.split(' #')
    else:
        # If no separator is found, assume the whole string is the series name
        return series_string, ''

    # Extract the series name and release number
    series_name = series_list[0]
    release_number = series_list[1]

    return series_name, release_number

def extract_series_and_release(title_name):
    series_info_temp = title_name.split('(')
    if len(series_info_temp) > 1:
        release_info = series_info_temp[-1].replace(')', '')
        if len(release_info.split(';')) > 1:
            series_temp = []
            release_temp = []
            for b in release_info.split(";"):
                series_list, release_list = extract_series_info(b)
                series_temp.append(series_list)
                release_temp.append(release_list)
            series_list = ','.join(series_temp)
            release_list = ','.join(release_temp)
        else:
            series_list, release_list = extract_series_info(release_info)

        title_name = title_name.replace(f' ({release_info})', '')
    else:
        series_list = ''
        release_list = ''

    return series_list, release_list


def extract_books_info(driver, url):
    """
    Extracts book information from a Goodreads URL using Selenium.

    Args:
        driver (selenium.webdriver.chrome.webdriver.WebDriver): The initialized Chrome driver.
        url (str): The URL of the Goodreads page containing book information.

    Returns:
        pd.DataFrame: A DataFrame containing the extracted book data.

    Raises:
        Exception: If an error occurs during the scraping process.
    """
    try:
        driver.get(url)

        # Wait for the page to load completely
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "bookTitle"))
        )

        # Extract book elements using Selenium
        title_elements = driver.find_elements(By.CLASS_NAME, "bookTitle")
        author_elements = driver.find_elements(By.CLASS_NAME, "authorName")

        book_data = {
            'book_titles': [],
            'series_info': [],
            'release_numbers': [],
            'authors_raw': []
        }

        for title in tqdm(title_elements, total=len(title_elements), desc='Processing Books'):
            book_url = "https://www.goodreads.com" + title.get_attribute('href')

            # Extract title using Selenium
            title_span = title.find_element(By.TAG_NAME, 'span')
            title_name = title_span.get_attribute('innerHTML').strip()

            # Extract series and release information using Selenium
            series_list, release_list = extract_series_and_release(title_name)
            book_data['series_info'].append(series_list)
            book_data['release_numbers'].append(release_list)
            book_data['book_titles'].append(title_name)

            # Extract authors using Selenium
            authors_raw = []
            for author in author_elements:
                author_element = author.find_element(By.TAG_NAME, 'span')
                if author_element:
                    authors_raw.append(author_element.get_attribute('innerHTML').strip())
                else:
                    authors_raw.append('')
            book_data['authors_raw'] = authors_raw

        df_book = pd.DataFrame(book_data)

        return df_book

    except Exception as e:
        print(f"An error occurred during scraping: {e}")
        raise

# Example usage
driver = initialize_driver()
url = "https://www.goodreads.com/list/best_of_year/2023"
books = extract_books_info(driver, url)

Processing Books: 100%|██████████| 100/100 [04:00<00:00,  2.40s/it]


In [9]:
books

Unnamed: 0,book_titles,series_info,release_numbers,authors_raw
0,"Fourth Wing (The Empyrean, #1)",The Empyrean,1,Rebecca Yarros
1,Happy Place,,,Emily Henry
2,Yellowface,,,R.F. Kuang
3,"Divine Rivals (Letters of Enchantment, #1)",Letters of Enchantment,1,Rebecca Ross
4,"Love, Theoretically",,,Ali Hazelwood
...,...,...,...,...
95,"A Fire in the Flesh (Flesh and Fire, #3)",Flesh and Fire,3,Jennifer L. Armentrout
96,Stone Cold Fox,,,Rachel Koller Croft
97,The Heaven &amp; Earth Grocery Store,,,James McBride
98,"Finlay Donovan Jumps the Gun (Finlay Donovan, #3)",Finlay Donovan,3,Elle Cosimano


# Farewell and a Call to Action:

**Thank you** for following along with this **Selenium** tutorial! I hope you found it informative and helpful.

If you enjoyed this tutorial:

* ❤️ **Vote it up!** Your feedback helps me improve my content and create more valuable resources for the community.
* 💬 **Leave a comment below**: Share your thoughts, questions, or suggestions. I'd love to hear from you!
* 🔔 **Check out my other notebooks**: Explore my collection of tutorials on various topics, including web scraping, data analysis, and machine learning.

Keep learning and growing your skills! Selenium is a powerful tool that can unlock a wide range of possibilities in web scraping. With continued practice and exploration, you can master Selenium and use it to achieve your goals.

**Farewell for now, and happy coding!**

_______

In [10]:
import json
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [11]:
# Using this Selenium instance for glassdoor


# Replace the website link of Glassdoor job search if needed
#website = "https://www.glassdoor.com/Job/jobs.htm?sc.occupationParam=%22data+analyst%22"
website = "https://www.glassdoor.com/Job/palo-alto-ca-us-product-analyst-jobs-SRCH_IL.0,15_IC1147434_KO16,31.htm"

driver.get(website)

In [12]:
def close_login_popup(driver):
    """
    Closes the login popup window if it is present.

    Input:
    - driver: Selenium WebDriver object.
    """
    try:
        # Check for login popup, if present then click CloseButton
        close_login_popup = driver.find_element(By.CLASS_NAME,"CloseButton")
        close_login_popup.click()
        close_login_popup.click()
    except:
        time.sleep(0)

In [13]:
def click_show_more(driver, num_iteration):
    """
    Clicks the 'Show more jobs' button a specified number of times.

    Inputs:
    - driver: Selenium WebDriver object.
    - num_iteration (int): Number of times to click the 'Show more jobs' button.
    """
    for i in tqdm(range(num_iteration), desc ="Progress"):

        close_login_popup(driver)

        try:
            time.sleep(3)
            all_matches_button = driver.find_element(By.CSS_SELECTOR, '[data-test="load-more"]')
        except:
            time.sleep(0)
            print("Show more jobs button not found")

        close_login_popup(driver)
        all_matches_button.click()

In [14]:
def glassdoor_web_scraping(driver):
    """
    Scrapes job details from Glassdoor website using the provided Selenium driver.

    Input:
    - driver: Selenium WebDriver object.

    Output:
    - job_dataset: DataFrame containing scraped job details.
    """
    job_records = []

    close_login_popup(driver)
    all_jobs_list = driver.find_elements(By.XPATH, "//li[@class='JobsList_jobListItem__wjTHv']")

    for job in tqdm(all_jobs_list, desc="Progress"):

        job_record = {}

        close_login_popup(driver)
        job.click()
        time.sleep(4)

        try:
            show_more_button = driver.find_element(By.CLASS_NAME, "JobDetails_showMore___Le6L")
            close_login_popup(driver)
            show_more_button.click()
        except:
            time.sleep(0)

        job_details_tab = driver.find_element(By.CLASS_NAME, "JobDetails_jobDetailsContainer__y9P3L")

        job_record["job_id"] = ""
        job_record["company"] = ""
        job_record["job_title"] = ""
        job_record["company_rating"] = ""
        job_record["job_description"] = ""
        job_record["location"] = ""
        job_record["salary_avg_estimate"] = ""
        job_record["salary_estimate_payperiod"] = ""
        job_record["company_size"] = ""
        job_record["company_founded"] = ""
        job_record["employment_type"] = ""
        job_record["industry"] = ""
        job_record["sector"] = ""
        job_record["revenue"] = ""
        job_record["career_opportunities_rating"] = ""
        job_record["comp_and_benefits_rating"] = ""
        job_record["culture_and_values_rating"] = ""
        job_record["senior_management_rating"] = ""
        job_record["work_life_balance_rating"] = ""

        try:
            job_element = job_details_tab.find_element(By.CLASS_NAME, "JobDetails_jobTitle__Xvsha")
            job_id = job_element.get_attribute("id")
            job_record["job_id"] = ''.join(filter(str.isdigit, job_id))
        except:
            pass   

        try:
            company = job_details_tab.find_element(By.CLASS_NAME, "EmployerProfile_employerName__qujuA")
            job_record["company"] = company.text
        except:
            pass

        try:
            job_title = job_details_tab.find_element(By.CLASS_NAME, "JobDetails_jobTitle__Xvsha")
            job_record["job_title"] = job_title.text
        except:
            pass

        try:
            company_rating = job_details_tab.find_element(By.CLASS_NAME, "EmployerProfile_ratingContainer__ul0Ef")
            job_record["company_rating"] = company_rating.text
        except:
            pass

        try:
            job_description = job_details_tab.find_element(By.CLASS_NAME, "JobDetails_jobDescription__uW_fK")
            job_record["job_description"] = job_description.text
        except:
            pass

        try:
            location = job_details_tab.find_element(By.CLASS_NAME, "JobDetails_location__mSg5h")
            job_record["location"] = location.text
        except:
            pass

        try:
            salary_avg_estimate = job_details_tab.find_element(By.CLASS_NAME, "SalaryEstimate_averageEstimate__xIgkL")
            job_record["salary_avg_estimate"] = salary_avg_estimate.text
        except:
            pass

        try:
            salary_estimate_payperiod = job_details_tab.find_element(By.CLASS_NAME, "SalaryEstimate_payPeriod__RsvG_")
            job_record["salary_estimate_payperiod"] = salary_estimate_payperiod.text
        except:
            pass

        try:
            company_overview_values = job_details_tab.find_elements(By.CLASS_NAME, "JobDetails_overviewItemValue__xn8EF")
            if len(company_overview_values) == 6:
                job_record.update({
                    "company_size": company_overview_values[0].text,
                    "company_founded": company_overview_values[1].text,
                    "employment_type": company_overview_values[2].text,
                    "industry": company_overview_values[3].text,
                    "sector": company_overview_values[4].text,
                    "revenue": company_overview_values[5].text
                })
        except:
            pass

        try:
            company_ratings = job_details_tab.find_elements(By.CLASS_NAME, "JobDetails_ratingScore___xSXK")
            if len(company_ratings) == 5:
                job_record.update({
                    "career_opportunities_rating": company_ratings[0].text,
                    "comp_and_benefits_rating": company_ratings[1].text,
                    "culture_and_values_rating": company_ratings[2].text,
                    "senior_management_rating": company_ratings[3].text,
                    "work_life_balance_rating": company_ratings[4].text
                })
        except:
            pass

        job_records.append(job_record)

    job_dataset = pd.concat([pd.DataFrame([record]) for record in job_records], ignore_index=True)
    return job_dataset

In [15]:
# Sometimes no click more button so may need to comment this out
# click_show_more(driver, 40)

In [16]:
job_dataset = glassdoor_web_scraping(driver)

Progress: 100%|██████████| 30/30 [02:24<00:00,  4.82s/it]


In [17]:
df = job_dataset.copy()
df.head()

Unnamed: 0,job_id,company,job_title,company_rating,job_description,location,salary_avg_estimate,salary_estimate_payperiod,company_size,company_founded,employment_type,industry,sector,revenue,career_opportunities_rating,comp_and_benefits_rating,culture_and_values_rating,senior_management_rating,work_life_balance_rating
0,,,,4.1,Company Description\n\nVisa is a world leader ...,"Foster City, CA",,/yr (Employer est.),10000+ Employees,1958,Company - Public,Information Technology Support Services,Information Technology,$10+ billion (USD),3.8,4.0,4.1,3.6,4.1
1,,,,3.9,Position Summary\nAs a Competitive Intelligenc...,"Foster City, CA",,/yr (Employer est.),501 to 1000 Employees,2010,Company - Private,Enterprise Software & Network Solutions,Information Technology,Unknown / Non-Applicable,3.8,3.6,4.0,3.6,4.1
2,,,,4.2,"Summary\n\nPosted: Feb 14, 2024\n\nRole Number...","Sunnyvale, CA",,/yr (Employer est.),10000+ Employees,1976,Company - Public,Computer Hardware Development,Information Technology,$10+ billion (USD),3.7,4.2,4.1,3.6,3.6
3,,,,3.1,Leading the future in luxury electric and mobi...,"Newark, CA",,/yr (Employer est.),5001 to 10000 Employees,2007,Company - Public,Transportation Equipment Manufacturing,Manufacturing,Unknown / Non-Applicable,3.3,3.3,2.9,2.6,2.5
4,,,,3.3,Responsibilities\nTikTok is the leading destin...,"San Jose, CA",,/yr (Employer est.),10000+ Employees,2016,Company - Private,Internet & Web Services,Information Technology,Unknown / Non-Applicable,3.1,3.5,3.0,2.8,2.8


In [18]:
len(df)

30

In [19]:
df['job_description'][0]

'Company Description\n\nVisa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure payments network, enabling individuals, businesses, and economies to thrive while driven by a common purpose – to uplift everyone, everywhere by being the best way to pay and be paid.\nMake an impact with a purpose-driven industry leader. Join us today and experience Life at Visa.\n\nJob Description\n\nVisa Digital Products team is on the forefront of Visa’s innovation responsible for building digital platforms such as Visa Token Services and Visa Click to Pay (a.k.a., Secure Remote Commerce/SRC). This team is also responsible for setting standards in the payment industry level (EMVCo and W3C) for digital commerce. Platform prod

# Let's try and clean up the data upstream

In [20]:
def glassdoor_web_scraping_v2(driver):
    """
    Scrapes job details from Glassdoor website using the provided Selenium driver.

    Input:
    - driver: Selenium WebDriver object.

    Output:
    - job_dataset: DataFrame containing scraped job details.
    """
    job_records = []

    close_login_popup(driver)
    all_jobs_list = driver.find_elements(By.XPATH, "//li[@class='JobsList_jobListItem__wjTHv']")

    for job in tqdm(all_jobs_list, desc="Progress"):

        job_record = {}

        close_login_popup(driver)
        job.click()
        time.sleep(4)

        try:
            show_more_button = driver.find_element(By.CLASS_NAME, "JobDetails_showMore___Le6L")
            close_login_popup(driver)
            show_more_button.click()
        except:
            time.sleep(0)

        job_details_tab = driver.find_element(By.CLASS_NAME, "JobDetails_jobDetailsContainer__y9P3L")

#         job_record["job_id"] = ""
        job_record["company"] = ""
        job_record["job_title"] = ""
        job_record["company_rating"] = ""
        job_record["job_description"] = ""
        job_record["location"] = ""
        job_record["salary_median_estimate"] = ""
        job_record["salary_estimate_payperiod"] = ""
        job_record["company_size"] = ""
        job_record["company_founded"] = ""
        job_record["employment_type"] = ""
        job_record["industry"] = ""
        job_record["sector"] = ""
        job_record["revenue"] = ""
        job_record["career_opportunities_rating"] = ""
        job_record["comp_and_benefits_rating"] = ""
        job_record["culture_and_values_rating"] = ""
        job_record["senior_management_rating"] = ""
        job_record["work_life_balance_rating"] = ""

#         try:
#             job_id = all_jobs_list.get_attribute('data-jobid')
#             job_record["job_id"] = job_id

#         except:
#             pass   

        try:    
            company = job_details_tab.find_element(By.CLASS_NAME, "heading_Heading__BqX5J heading_Subhead__Ip1aW") 
            job_record["company"] = company.text 
        except:
            pass

        try:    
            job_title = job_details_tab.find_element(By.CLASS_NAME, "heading_Heading__BqX5J.heading_Level1__soLZs") 
            job_record["job_title"] = job_title.text 
        except:
            pass

        try:
            company_rating = job_details_tab.find_element(By.CLASS_NAME, "EmployerProfile_ratingContainer__ul0Ef")
            job_record["company_rating"] = company_rating.text
        except:
            pass

        try:
            job_description = job_details_tab.find_element(By.CLASS_NAME, "JobDetails_jobDescription__uW_fK")
            job_record["job_description"] = job_description.text
        except:
            pass

        try:
            location = job_details_tab.find_element(By.CLASS_NAME, "JobDetails_location__mSg5h") 
            job_record["location"] = location.text
        except:
            pass

        try:
            salary_estimate = job_details_tab.find_element(By.CLASS_NAME, "SalaryEstimate_medianEstimate__fOYN1") 
            job_record["salary_median_estimate"] = salary_estimate.text
        except:
            pass

        try:
            salary_estimate_payperiod = job_details_tab.find_element(By.CLASS_NAME, "SalaryEstimate_payPeriod__RsvG_")
            job_record["salary_estimate_payperiod"] = salary_estimate_payperiod.text
        except:
            pass

        try:
            company_overview_values = job_details_tab.find_elements(By.CLASS_NAME, "JobDetails_overviewItemValue__xn8EF")
            if len(company_overview_values) == 6:
                job_record.update({
                    "company_size": company_overview_values[0].text,
                    "company_founded": company_overview_values[1].text,
                    "employment_type": company_overview_values[2].text,
                    "industry": company_overview_values[3].text,
                    "sector": company_overview_values[4].text,
                    "revenue": company_overview_values[5].text
                })
        except:
            pass

        try:
            company_ratings = job_details_tab.find_elements(By.CLASS_NAME, "JobDetails_ratingScore___xSXK")
            if len(company_ratings) == 5:
                job_record.update({
                    "career_opportunities_rating": company_ratings[0].text,
                    "comp_and_benefits_rating": company_ratings[1].text,
                    "culture_and_values_rating": company_ratings[2].text,
                    "senior_management_rating": company_ratings[3].text,
                    "work_life_balance_rating": company_ratings[4].text
                })
        except:
            pass

        job_records.append(job_record)

    job_dataset = pd.concat([pd.DataFrame([record]) for record in job_records], ignore_index=True)
    return job_dataset

In [21]:
df2 = glassdoor_web_scraping_v2(driver)
df2.head()

Progress: 100%|██████████| 30/30 [02:30<00:00,  5.02s/it]


Unnamed: 0,company,job_title,company_rating,job_description,location,salary_median_estimate,salary_estimate_payperiod,company_size,company_founded,employment_type,industry,sector,revenue,career_opportunities_rating,comp_and_benefits_rating,culture_and_values_rating,senior_management_rating,work_life_balance_rating
0,,Product Analyst,4.1,Company Description\n\nVisa is a world leader ...,"Foster City, CA",$137K,/yr (Employer est.),10000+ Employees,1958,Company - Public,Information Technology Support Services,Information Technology,$10+ billion (USD),3.8,4.0,4.1,3.6,4.1
1,,Competitive Intelligence Analyst - Product Str...,3.9,Position Summary\nAs a Competitive Intelligenc...,"Foster City, CA",$90K,/yr (Employer est.),501 to 1000 Employees,2010,Company - Private,Enterprise Software & Network Solutions,Information Technology,Unknown / Non-Applicable,3.8,3.6,4.0,3.6,4.1
2,,Financial Analyst (Product Cost),4.2,"Summary\n\nPosted: Feb 14, 2024\n\nRole Number...","Sunnyvale, CA",$132K,/yr (Employer est.),10000+ Employees,1976,Company - Public,Computer Hardware Development,Information Technology,$10+ billion (USD),3.7,4.2,4.1,3.6,3.6
3,,"Manager, Program and Product Cost Finance",3.1,Leading the future in luxury electric and mobi...,"Newark, CA",$163K,/yr (Employer est.),5001 to 10000 Employees,2007,Company - Public,Transportation Equipment Manufacturing,Manufacturing,Unknown / Non-Applicable,3.3,3.3,2.9,2.6,2.5
4,,Business Analyst - Global Monetization Product...,3.3,Responsibilities\nTikTok is the leading destin...,"San Jose, CA",$209K,/yr (Employer est.),10000+ Employees,2016,Company - Private,Internet & Web Services,Information Technology,Unknown / Non-Applicable,3.1,3.5,3.0,2.8,2.8


In [22]:
# create csv so that don't need to run this again
# df2.to_csv('LLM_df.csv')