# Job Posting from LinkedIn

This is the code provides the script for scraping job posting including job description, skill set, number of applied applicants, type of workplace (Remote, On-site, Hybrid), job level, job type (Contract, Full-time, Part-time, Internship etc.), industry of the hiring company, how long it was posted. 

This is the second step of job market analysis provided in the article: https://orlovtsu.github.io/job_postings_analysis.html.

If you use this code for scraping data from LinkedIn be aware about the LinkedIn Term of Use and be sure that you do not violate it.

## Import Libraries
1. Selenium is a tool for automating web browsers, and these modules allow you to interact with web elements, locate elements by various criteria, and simulate keyboard actions.
2. BeautifulSoup module, which is used for parsing HTML and XML documents, provides convenient methods for extracting data from web pages.
3. Pandas library is a powerful data manipulation and analysis tool. It provides data structures and functions for efficiently handling structured data.
4. Time module provides functions for working with time-related operations, such as delays and timestamps.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import randint

## Initialization 
This next chunk assigns the path to the ChromeDriver executable to the variable chromedriver_path. The ChromeDriver is a separate executable that is required when using Selenium with Google Chrome. It acts as a bridge between the Selenium WebDriver and the Chrome browser, allowing automated interactions with the browser.

In this case, the chromedriver_path is set to './chromedriver', indicating that the ChromeDriver executable is located in the current directory (denoted by '.') and its filename is chromedriver. The specific path may vary depending on the actual location of the ChromeDriver executable on your system.

Make sure to provide the correct path to the ChromeDriver executable file in order for Selenium to work properly with Google Chrome.

For more details: https://chromedriver.chromium.org/downloads

In [8]:
chromedriver_path = './chromedriver.exe'

The next chunk create an instance of ChromeOptions class from the Selenium webdriver module. ChromeOptions allows you to customize the behavior of the Chrome browser when it is launched. In this case, the --start-maximized argument is added to the options, which instructs Chrome to start in maximized window mode.

This code also creates an instance of the webdriver.Chrome class, passing the options object and chromedriver_path as arguments. It initializes the Chrome webdriver, using the ChromeDriver executable located at chromedriver_path and applying the specified options for Chrome's behavior. Code sets an implicit wait time of 10 seconds for the driver object. The implicit wait instructs Selenium to wait for a certain amount of time when trying to locate elements on the web page. It allows the driver to wait for a specified duration before throwing a NoSuchElementException if the element is not immediately available. In this case, the implicit wait is set to 10 seconds.

In [9]:
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")

driver = webdriver.Chrome(options=options)

driver.implicitly_wait(10)

Following chunk opens the page where you should authorize and changes the scale of viewing page.

In [10]:
# Open the web page with the login and password fields
# Enter your username and password to authenticate as a LinkedIn user
driver.get('https://www.linkedin.com/login')


# To speed up the downloading process for scraping the page content, it is recommended to reduce the page scale to 25%.
# This will result in faster download of the majority of the content you require.
driver.execute_script("document.body.style.zoom = '25%'")

## Job Posting Scraping script:

In [3]:
# Read the job list from a CSV file
job_list = pd.read_csv('job_list_all.csv')
data = []

# Iterate through each job in the job list starting from index 526
for index, job in job_list[526:].iterrows():
    URL = 'https://www.linkedin.com' + '/'.join(job['URL'].split('/')[0:4]) # Create the URL for each job
    driver.get(URL)  # Open the URL in the web driver
    time.sleep(randint(4,10))  # Pause for a random time between 4 and 10 seconds
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')# Parse the page source using BeautifulSoup
    
    job_location = job['Location']  # Extract job location from the job list
    job_title = job['Job Title']  # Extract job title from the job list
    company_name = job['Company Name']  # Extract company name from the job list

    try:
        no_longer_message =  soup.find('span', class_ = 'artdeco-inline-feedback__message').get_text(strip = True)
        print(index, no_longer_message)
        no_longer = 'No longer' in no_longer_message  # Check if the job posting is no longer available
    except:
        no_longer = False
        
    if not no_longer:
        job_insights = soup.find_all('li', {'class': 'jobs-unified-top-card__job-insight'})# Find job insights
 
        # Extract job type and job level from job insights
        try:
            job_type = job_insights[0].find('span').get_text(strip = True).split('·')[0]
        except:
            job_type = 'Not defined'
        try:
            job_level = job_insights[0].find('span').get_text(strip = True).split('·')[1]
        except:
            job_level = 'Not defined'

        # Extract company size and job industry from job insights
        try:
            company_size = job_insights[1].find('span').get_text(strip = True).split('·')[0]
        except:
            company_size = 'Not defined'
        try:
            job_industry = job_insights[1].find('span').get_text(strip = True).split('·')[1]
        except:
            job_industry = 'Not defined'

        # Extract number of job applicants
        try:
            for insight in job_insights[2:]:  
                text = insight.find('span').get_text(strip = True)
                if 'applicants' in text:
                    job_applicants = text.split(' ')[5]
                    break
        except:
            try:
                job_applicants = soup.find('span', {'class': 'jobs-unified-top-card__applicant-count'}).get_text(strip = True).split(' ')[0]
            except:
                try:
                    span_element = soup.find('span', class_='jobs-unified-top-card__subtitle-secondary-grouping')
                    job_applicants = span_element.find('span', class_='jobs-unified-top-card__bullet').get_text(strip=True)
                except:
                    job_applicants = 'Not defined'
        
        try:
            posted = soup.find('span', {'class': 'jobs-unified-top-card__posted-date'}).get_text(strip = True)
        except:
            posted = 'Not defined'
        try:
            workplacetype = soup.find('span', {'class': 'jobs-unified-top-card__workplace-type'}).get_text(strip = True)
        except:
            workplacetype = 'Not defined'
            
        # Click the "Skills" button to view job skills
        try:
            button_skills = driver.find_element(by=By.CLASS_NAME, value = 'jobs-unified-top-card__job-insight-text-button')
            button_skills.click()
            time.sleep(randint(4,10))
            soup_skills = BeautifulSoup(driver.page_source, 'html.parser')   
            skills = soup_skills.find_all('li', {'class': 'job-details-skill-match-status-list__unmatched-skill'})
            skill_set = ''
            for skill in skills:
                div_element = skill.find('div', class_='display-flex')  # Example: Locating the div based on the 'display-flex' class
                skill_text = div_element.get_text(strip=True)
                skill_set = skill_set + skill_text + '; '

            button_exit = driver.find_element(By.XPATH, "//span[text()='Done']")
            button_exit.click()
        except:
            skill_set = ''
        
        # Click the "More" button to view full job description
        try:
            button_more = driver.find_element(by=By.CLASS_NAME, value = 'jobs-description__footer-button')
            button_more.click()
            time.sleep(randint(4,10))
            soup_more = BeautifulSoup(driver.page_source, 'html.parser')  
        except:
            button_more = None
                
        # Extract job description 
        try:
            description = soup.find('div', {'class': 'jobs-box__html-content'}).get_text(strip = True)
        except:
            description = 'Not defined'
        
        # Output for checking the progress
        print(index, job_title, company_name)
        
        # Append the extracted job information to the data list
        data.append({
            'Job Title': job_title, #+
            'Company Name': company_name, 
            'Company Size': company_size, 
            'Location': job_location, 
            'URL': URL, 
            'Workplace_Type': workplacetype, 
            'Posted': posted,
            'Applicants': job_applicants,
            'Industry': job_industry, 
            'Job level': job_level,
            'Job type': job_type, 
            'Skillset': skill_set,
            'Description': description
        })


## Results saving
Do not forget to save you results:

In [27]:
# Create a DataFrame from the collected job data
df = pd.DataFrame(data)
df.to_csv('job_postings.csv')