# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [5]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?currentJobId=4187409133&'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])

    # Create a request to get the data from the server 
    page = requests.get(scrape_url)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    titles = []
    companies = []
    locations = []
    results = soup.select('ul.jobs-search__results-list')[0] # get list with cards
    cards = results.select("div.base-search-card__info") # get each card individualls
    for card in cards:
        title = card.find("h3", class_="base-search-card__title").get_text(strip=True)
        company = card.find("h4", class_="base-search-card__subtitle").get_text(strip=True)
        location = card.find("span", class_="job-search-card__location").get_text(strip=True)
        titles.append(title)
        companies.append(company)
        locations.append(location)
    data = pd.DataFrame({
        "title":titles,
        "company":companies,
        "location":locations
    })
    
    # Return dataframe
    return data

In [6]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analysis')
results

Unnamed: 0,title,company,location
0,Data Analyst,The Walt Disney Company,"New York, NY"
1,Business Analyst,Foley,"Texas, United States"
2,Business Analyst,Foley,"South Carolina, United States"
3,Business Intelligence Analyst,Staples,"Framingham, MA"
4,"Business Intelligence Analyst, Growth Marketin...",Google,"Sunnyvale, CA"
5,"Analyst, Business Analytics",Petco,"Austin, TX"
6,Business Analyst,Foley,"Florida, United States"
7,Business Analyst,Foley,"Georgia, United States"
8,"Business Intelligence Analyst, Growth Marketin...",Google,"San Francisco, CA"
9,Portfolio Management Analyst - Crystal Creek C...,The Yarrow Group,"Jackson, WY"


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [7]:
"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search_pages(keywords, pages):
    
    listing_count = 0
    titles = []
    companies = []
    locations = []
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?currentJobId=4187409133&'

    for _ in range(pages):
        # Assemble the full url with parameters
        scrape_url = ''.join([BASE_URL, 'keywords=', keywords, '&start=', str(listing_count)])
        # Create a request to get the data from the server 
        page = requests.get(scrape_url)
        soup = BeautifulSoup(page.text, 'html.parser')

        results = soup.select('ul.jobs-search__results-list')[0] # get list with cards
        cards = results.select("div.base-search-card__info") # get each card individualls
        for card in cards:
            title = card.find("h3", class_="base-search-card__title").get_text(strip=True)
            company = card.find("h4", class_="base-search-card__subtitle").get_text(strip=True)
            location = card.find("span", class_="job-search-card__location").get_text(strip=True)
            titles.append(title)
            companies.append(company)
            locations.append(location)
            listing_count += 1
    data = pd.DataFrame({
        "title":titles,
        "company":companies,
        "location":locations
    })
    
    # Return dataframe
    return data

In [8]:
# Example to call the function

results = scrape_linkedin_job_search_pages('data%20analysis', 5)
results

Unnamed: 0,title,company,location
0,Data Analyst,The Walt Disney Company,"New York, NY"
1,Business Analyst,Foley,"Texas, United States"
2,Business Analyst,Foley,"South Carolina, United States"
3,Business Intelligence Analyst,Staples,"Framingham, MA"
4,"Business Intelligence Analyst, Growth Marketin...",Google,"Sunnyvale, CA"
...,...,...,...
295,Analytics Engineer (Hybrid),Weedmaps,"Austin, TX"
296,Data and Process Analyst,PharmaForce,United States
297,DATA ANALYST,Lensa,"New York, NY"
298,Growth Strategy & Analytics Coordinator,Privia Health,"Arlington, VA"


## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [9]:
"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search_pages_countries(keywords, pages, country):
    
    listing_count = 0
    titles = []
    companies = []
    locations = []
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?currentJobId=4187409133&'

    for _ in range(pages):
        # Assemble the full url with parameters
        scrape_url = ''.join([BASE_URL, 'keywords=', keywords, '&location=', country, '&start=', str(listing_count)])
        # Create a request to get the data from the server 
        page = requests.get(scrape_url)
        soup = BeautifulSoup(page.text, 'html.parser')

        results = soup.select('ul.jobs-search__results-list')[0] # get list with cards
        cards = results.select("div.base-search-card__info") # get each card individualls
        for card in cards:
            title = card.find("h3", class_="base-search-card__title").get_text(strip=True)
            company = card.find("h4", class_="base-search-card__subtitle").get_text(strip=True)
            location = card.find("span", class_="job-search-card__location").get_text(strip=True)
            titles.append(title)
            companies.append(company)
            locations.append(location)
            listing_count += 1
    data = pd.DataFrame({
        "title":titles,
        "company":companies,
        "location":locations
    })
    
    # Return dataframe
    return data

In [10]:
# Example to call the function

results = scrape_linkedin_job_search_pages_countries('data%20analysis', 5, 'Germany')
results

Unnamed: 0,title,company,location
0,Data Scientist,Mondu,"Berlin, Berlin, Germany"
1,Graduate Private Equity Analyst,CityGrad,"Munich, Bavaria, Germany"
2,People Analytics Specialist (m/w/d),ZARA,"Hamburg, Hamburg, Germany"
3,Data Analytics Analyst (H/F),ODDO BHF,"Frankfurt, Hesse, Germany"
4,Junior Data Analyst,AMBOSS,"Berlin, Berlin, Germany"
...,...,...,...
295,Spezialist/in Statistik (reine Statistik + SP...,WissPro,Germany
296,Junior Business Operations Associate (m/f/d),Fusion Consulting,"Mainz, Rhineland-Palatinate, Germany"
297,Junior Data Scientist (m/w/d),Lidl in Germany,"Neckarsulm, Baden-Württemberg, Germany"
298,Power BI Business Analyst (m/w/d),TALENTLOTSEN GmbH,"Offenburg, Baden-Württemberg, Germany"


## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [11]:
def scrape_linkedin_job_search_pages_countries_days(keywords, pages, country, num_days):
    """
    Searches LinkedIn jobs with specified keywords, country, and date range.
    
    Parameters:
    - keywords (str): URL-encoded job title or skill
    - pages (int): Number of result pages to scrape
    - country (str): Country name or region
    - num_days (int): Filter jobs posted in the last X days
    """
    listing_count = 0
    titles = []
    companies = []
    locations = []

    # Convert days to seconds and form the f_TPR param
    seconds = num_days * 86400
    time_filter = f"f_TPR=r{seconds}"

    BASE_URL = 'https://www.linkedin.com/jobs/search/?'

    for _ in range(pages):
        scrape_url = ''.join([
            BASE_URL,
            'keywords=', keywords,
            '&location=', country,
            '&', time_filter,
            '&start=', str(listing_count)
        ])
        page = requests.get(scrape_url)
        soup = BeautifulSoup(page.text, 'html.parser')

        try:
            results = soup.select('ul.jobs-search__results-list')[0]
        except IndexError:
            break  # no results found
        cards = results.select("div.base-search-card__info")
        for card in cards:
            title = card.find("h3", class_="base-search-card__title").get_text(strip=True)
            company = card.find("h4", class_="base-search-card__subtitle").get_text(strip=True)
            location = card.find("span", class_="job-search-card__location").get_text(strip=True)
            titles.append(title)
            companies.append(company)
            locations.append(location)
            listing_count += 1

    return pd.DataFrame({
        "title": titles,
        "company": companies,
        "location": locations
    })

In [12]:
results = scrape_linkedin_job_search_pages_countries_days('data%20analysis', 3, 'Canada', 5)
results

Unnamed: 0,title,company,location
0,Data Analyst - Banking,Capgemini,"Mississauga, Ontario, Canada"
1,Business Systems Analyst,Scotiabank,"Toronto, Ontario, Canada"
2,Decision Sciences Analyst,Volkswagen Financial Services Canada,"Pickering, Ontario, Canada"
3,Data Scientist,Scotiabank,"Toronto, Ontario, Canada"
4,Business Analyst,Brookfield,"Toronto, Ontario, Canada"
...,...,...,...
175,Data BA,CyberSolve IT Inc.,"Toronto, Ontario, Canada"
176,Technical Business Analyst,Southampton Financial Inc,"Toronto, Ontario, Canada"
177,"Investment Banking Analyst, Co-op Winter 2026",Raymond James Ltd.,"Toronto, Ontario, Canada"
178,"Business Intelligence Engineer, P2 Science, Da...",Amazon,"Vancouver, British Columbia, Canada"


## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [37]:
import re
import time

def scrape_linkedin_job_search_pages_countries_days_seniorities(keywords, pages, country, num_days):
    listing_count = 0
    titles = []
    companies = []
    locations = []
    seniorities = []

    BASE_URL = 'https://www.linkedin.com/jobs/search/?'

    # Convert num_days to seconds for LinkedIn param f_TPR
    seconds = num_days * 86400
    country_param = f'location={country}&' if country else ''

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Referer": "https://www.google.com/"
    }

    for _ in range(pages):
        scrape_url = (f"{BASE_URL}"
                      f"keywords={keywords}&"
                      f"{country_param}"
                      f"f_TPR=r{seconds}&"
                      f"start={listing_count}")

        page = requests.get(scrape_url, headers=headers)
        soup = BeautifulSoup(page.text, 'html.parser')

        results = soup.select('ul.jobs-search__results-list')
        if not results:
            break  # No results on this page, stop early
        results = results[0]

        cards = results.select("li")
        if not cards:
            break  # No job cards found, stop early

        for card in cards:
            job_link = card.find("a", href=True)
            if not job_link:
                continue

            job_url = job_link["href"]
            job_id_match = re.search(r'-at-[^-/]+-(\d+)', job_url)
            if not job_id_match:
                continue
            job_id = job_id_match.group(1)
            job_detail_url = f"https://www.linkedin.com/jobs/view/{job_id}"

            title_tag = card.select_one("h3.base-search-card__title")
            company_tag = card.select_one("h4.base-search-card__subtitle")
            location_tag = card.select_one("span.job-search-card__location")

            if not (title_tag and company_tag and location_tag):
                continue

            title = title_tag.get_text(strip=True)
            company = company_tag.get_text(strip=True)
            location = location_tag.get_text(strip=True)

            job_detail_page = requests.get(job_detail_url, headers=headers)
            job_soup = BeautifulSoup(job_detail_page.text, 'html.parser')

            seniority = "Not Listed"
            for li in job_soup.select("li"):
                if 'Seniority level' in li.get_text():
                    seniority = li.get_text(strip=True).replace('Seniority level', '').strip()
                    break

            titles.append(title)
            companies.append(company)
            locations.append(location)
            seniorities.append(seniority)

            listing_count += 1
            time.sleep(0.5)  # be polite

    data = pd.DataFrame({
        "title": titles,
        "company": companies,
        "location": locations,
        "seniority": seniorities
    })

    return data

In [39]:
results = scrape_linkedin_job_search_pages_countries_days_seniorities('data%20analyst', 2, 'Canada', 3)
results

Unnamed: 0,title,company,location,seniority
0,Data Analyst,MRM,"Toronto, Ontario, Canada",Mid-Senior level
1,"(Canada) Sr. Data Analyst, Life Sciences",PointClickCare,"Mississauga, Ontario, Canada",Mid-Senior level
2,Senior Data Analyst,KOHO,Canada,Mid-Senior level
3,"Intern, Data Science (Fall 2025)",Wealthsimple,"Toronto, Ontario, Canada",Not Applicable
4,Senior Data Analyst (Remote - Anywhere),Jobgether,Canada,Mid-Senior level
5,Data Analyst,Provident10,"St John’s, Newfoundland and Labrador, Canada",Associate
6,Senior Data Analyst - Data Products & Insights,RBC,"Toronto, Ontario, Canada",Not Applicable
7,Data Centre Analyst,TransLink,"Vancouver, British Columbia, Canada",Mid-Senior level
8,Senior Data Analyst,Scopely,Canada,Mid-Senior level
9,Business Systems Analyst,Scotiabank,"Toronto, Ontario, Canada",Mid-Senior level
