# Job List from LinkedIn 

This is the code for scraping linkedin job list using LinkedIn job search engine which provide Job Title, Company Name, Location and URL of each job.

This is the first step of job market analysis provided in the article: https://orlovtsu.github.io/job_postings_analysis.html.

If you use this code for scraping data from LinkedIn be aware about the LinkedIn Term of Use and be sure that you do not violate it.

## Import Libraries
1. Selenium is a tool for automating web browsers, and these modules allow you to interact with web elements, locate elements by various criteria, and simulate keyboard actions.
2. BeautifulSoup module, which is used for parsing HTML and XML documents, provides convenient methods for extracting data from web pages.
3. Pandas library is a powerful data manipulation and analysis tool. It provides data structures and functions for efficiently handling structured data.
4. Time module provides functions for working with time-related operations, such as delays and timestamps.

In [11]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import randint

## Initialization 
This next chunk assigns the path to the ChromeDriver executable to the variable chromedriver_path. The ChromeDriver is a separate executable that is required when using Selenium with Google Chrome. It acts as a bridge between the Selenium WebDriver and the Chrome browser, allowing automated interactions with the browser.

In this case, the chromedriver_path is set to './chromedriver', indicating that the ChromeDriver executable is located in the current directory (denoted by '.') and its filename is chromedriver. The specific path may vary depending on the actual location of the ChromeDriver executable on your system.

Make sure to provide the correct path to the ChromeDriver executable file in order for Selenium to work properly with Google Chrome.

For more details: https://chromedriver.chromium.org/downloads

In [2]:
chromedriver_path = './chromedriver'

The next chunk create an instance of ChromeOptions class from the Selenium webdriver module. ChromeOptions allows you to customize the behavior of the Chrome browser when it is launched. In this case, the --start-maximized argument is added to the options, which instructs Chrome to start in maximized window mode.

This code also creates an instance of the webdriver.Chrome class, passing the options object and chromedriver_path as arguments. It initializes the Chrome webdriver, using the ChromeDriver executable located at chromedriver_path and applying the specified options for Chrome's behavior. Code sets an implicit wait time of 10 seconds for the driver object. The implicit wait instructs Selenium to wait for a certain amount of time when trying to locate elements on the web page. It allows the driver to wait for a specified duration before throwing a NoSuchElementException if the element is not immediately available. In this case, the implicit wait is set to 10 seconds.

In [4]:

options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")

driver = webdriver.Chrome(options=options, executable_path=chromedriver_path)

driver.implicitly_wait(10)

Following chunk opens the page where you should authorize and changes the scale of viewing page.

In [14]:
# Open the web page with the login and password fields
# Enter your username and password to authenticate as a LinkedIn user
driver.get('https://www.linkedin.com/login')


# To speed up the downloading process for scraping the page content, it is recommended to reduce the page scale to 25%.
# This will result in faster download of the majority of the content you require.
driver.execute_script("document.body.style.zoom = '25%'")

## Job List Scraping script:

In [5]:
# Set the search query and location
query = '"BI"'
location = 'Canada'

# Create an empty list to store the job data
data = []

# Loop through multiple pages
for page_num in range(1, 40):
    # Construct the URL for each page based on the query, location, and page number
    url = f'https://www.linkedin.com/jobs/search/?keywords={query}&location={location}&start={25 * (page_num - 1)}'

    # Decrease the page scale to 25% for faster content downloading    
    driver.execute_script("document.body.style.zoom = '25%'")

    # Open the URL in the web driver
    driver.get(url)
    
    # Pause for a random time between 10 and 20 seconds
    time.sleep(randint(10,20))  
    
    # Parse the page source using BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Find all job postings on the page
    job_postings = soup.find_all('li', {'class': 'jobs-search-results__list-item'})
    
    # Print job titles for each job posting
    for k, job in enumerate(job_postings):
        try:                
            print(k, ':', job.find('a', class_='job-card-list__title').get_text().strip())
        except:
            print(None)
            
    print('-----')
    
    # Extract relevant information from each job posting and store it in a list of dictionaries
    for i in range(len(job_postings)):
        j = 0
        while ((job_postings[i].find('a', class_='job-card-list__title')== None) & (j < 10)):
            print(i,': Attempt again -', j)
            job_postings = soup.find_all('li', {'class': 'jobs-search-results__list-item'})
            for k, job in enumerate(job_postings):
                try:
                    print(k, ':', job.find('a', class_='job-card-list__title').get_text().strip())
                except:
                    print(None)            
            print('-----')
            j += 1
            if j == 4:
                i += 1
                j = 0
        
        job_posting = job_postings[i]

        # Extract job title, company name, location, and URL from the job posting
        try:
            job_title = job_posting.find('a', class_='job-card-list__title').get_text().strip()
        except AttributeError:
            job_title = None
        
        # Extract company name. Depending of personal account settings, name of html tags and html structure may vary. 
        # If this line does not work, uncomment any of other lines to try again
        try:
            #company_name = job_posting.find('span', class_='job-card-container__primary-description ').get_text().strip()
            company_name = job_posting.find('div', class_='job-card-container__company-name').text.strip()
            #company_name = job_posting.find('span', class_='job-card-container__primary-description').get_text(strip=True)
        except AttributeError:
            company_name = None
        print(i, company_name)
        try:
            job_location = job_posting.find('li', class_='job-card-container__metadata-item').get_text().strip()
        except AttributeError:
            job_location = None

        try:
            URL = job_posting.find('a', class_='job-card-container__link', href=True).get('href')
        except AttributeError:
            URL = None

        try:
            job_next = job_postings[i+1].find('a', class_='job-card-list__title').get_text().strip()
        except:
            job_next = None
        
        if not job_next:
            URL1 = '/'.join(URL.split('/')[0:4])
            button = driver.find_element(by=By.XPATH, value = f"//a[contains(@href, '{URL1}')]")
            button.click()
      
            time.sleep(2)
            current_url = driver.current_url   
            
            driver.get(current_url)
            
            if ((i < len(job_postings)-1)):
                time.sleep(randint(10,20))  # Wait for 20 seconds
                soup = BeautifulSoup(driver.page_source, 'html.parser')        
                j = 0
                while ((job_postings[i+1].find('a', class_='job-card-list__title')== None) & (j < 10)):
                    print(i,': Attempt again')
                    job_postings = soup.find_all('li', {'class': 'jobs-search-results__list-item'})
                    for k, job in enumerate(job_postings):
                        try:
                            print(k, ':', job.find('a', class_='job-card-list__title').get_text().strip())
                        except:
                            print(None)                
                    print('-----')
                    j += 1
                    if j == 4:
                        i += 1
                        j = 0
        # Append the extracted job information to the data list   
        data.append({
            'Job Title': job_title,
            'Company Name': company_name,
            'Location': job_location,
            'URL': URL,
         })
        
        # Pause for 5 seconds
        time.sleep(5) 

Depending of an account and some personal settings, the strucure of HTML page may vary and this is why sometimes the previous chink may finish by any type of error. This is why, it should be tune sometimes and DataFrame should be saved to save data and not to repeat the scraping you already did.

## Results saving
Do not forget to save you results:

In [6]:
# Create a DataFrame from the collected job data        
df = pd.DataFrame(data)
df.to_csv('job_list_all.csv')