# Job Board Scraping Lab

In this lab you will first see a minimal but fully functional code snippet to scrape the LinkedIn Job Search webpage. You will then work on top of the example code and complete several chanllenges.

### Some Resources 

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

In [1]:
# Import the required libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

"""
This function searches job posts from LinkedIn and converts the results into a dataframe.
"""
def scrape_linkedin_job_search(keywords):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])

    # Create a request to get the data from the server 
    page = requests.get(scrape_url)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    titles = []
    companies = []
    locations = []
    for card in soup.select("div.result-card__contents"):
        title = card.findChild("h3", recursive=False)
        company = card.findChild("h4", recursive=False)
        location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
        titles.append(title.string)
        companies.append(company.string)
        locations.append(location.string)
    
    # Inject job titles, companies, and locations into the empty dataframe
    zipped = zip(titles, companies, locations)
    for z in list(zipped):
        data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    # Return dataframe
    return data

In [51]:
# Example to call the function

results = scrape_linkedin_job_search('data%20analysis')
results

Unnamed: 0,Title,Company,Location
0,Data Analyst,Populus Group,"Troy, Michigan"
1,Data Analyst,Atlantic Partners Corporation,"Wilmington, Delaware, United States"
2,Data Analyst,EPITEC,"Plano, Texas"
3,Business Data Analyst,Fluxtek Solutions Inc,"West Palm Beach, Florida Area"
4,Data Analysis Associate,Genesis,"Mobile, Alabama"
5,Sr Data Modeler,Saama,"South San Francisco, California"
6,Data Analyst,Motion Recruitment,"Philadelphia, Pennsylvania"
7,Senior Data Analyst,Ferguson Consulting Inc,"St Louis, Missouri, United States"
8,Data Analyst,CardinalCommerce,"Cleveland/Akron, Ohio Area"
9,Maps Data Analysis,Modis,"San Antonio, Texas Area"


## Challenge 1

The first challenge for you is to update the `scrape_linkedin_job_search` function by adding a new parameter called `num_pages`. This will allow you to search more than 25 jobs with this function. Suggested steps:

1. Go to https://www.linkedin.com/jobs/search/?keywords=data%20analysis in your browser.
1. Scroll down the left panel and click the page 2 link. Look at how the URL changes and identify the page offset parameter.
1. Add `num_pages` as a new param to the `scrape_linkedin_job_search` function. Update the function code so that it uses a "for" loop to retrieve several pages of search results.
1. Test your new function by scraping 5 pages of the search results.

Hint: Prepare for the case where there are less than 5 pages of search results. Your function should be robust enough to **not** trigger errors. Simply skip making additional searches and return all results if the search already reaches the end.

In [52]:
# your code here

def scrape_linkedin_job_search_pages(keywords,total_page_number):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    scrape_url = ''.join([BASE_URL, 'keywords=', keywords])
    
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    
    
    # Add a new parameter num_pages
    for num in range(total_page_number):
        if num == 0:
            page = requests.get(scrape_url)
        else:
            page = requests.get(''.join([scrape_url,'&start=',str(total_page_number*25)]))
        soup = BeautifulSoup(page.text, 'html.parser')
        titles = []
        companies = []
        locations = []   
        
        for card in soup.select("div.result-card__contents"):
            title = card.findChild("h3", recursive=False)
            company = card.findChild("h4", recursive=False)
            location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
        zipped = zip(titles, companies, locations)
        for z in list(zipped):
            data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    
    # Return dataframe
    return data

In [53]:
scrape_linkedin_job_search_pages('data%20analysis',3)

Unnamed: 0,Title,Company,Location
0,Data Analyst,Populus Group,"Troy, Michigan"
1,Data Analyst,Atlantic Partners Corporation,"Wilmington, Delaware, United States"
2,Data Analyst,EPITEC,"Plano, Texas"
3,Business Data Analyst,Fluxtek Solutions Inc,"West Palm Beach, Florida Area"
4,Data Analysis Associate,Genesis,"Mobile, Alabama"
5,Sr Data Modeler,Saama,"South San Francisco, California"
6,Data Analyst,Motion Recruitment,"Philadelphia, Pennsylvania"
7,Senior Data Analyst,Ferguson Consulting Inc,"St Louis, Missouri, United States"
8,Data Analyst,CardinalCommerce,"Cleveland/Akron, Ohio Area"
9,Maps Data Analysis,Modis,"San Antonio, Texas Area"


## Challenge 2

Further improve your function so that it can search jobs in a specific country. Add the 3rd param to your function called `country`. The steps are identical to those in Challange 1.

In [5]:
def scrape_linkedin_job_search_country(keywords,total_page_number, country):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    if country == True:
        scrape_url = ''.join([BASE_URL, "keywords=", keywords.replace(" ","%20")])
    else:            
        scrape_url = ''.join([BASE_URL, "keywords=", keywords.replace(" ","%20"),"&location=",country.replace(" ","%20")])
    
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    
    
    # Add a new parameter num_pages
    for num in range(total_page_number):
        if num == 0:
            page = requests.get(scrape_url)
        else:
            page = requests.get(''.join([scrape_url,'&start=',str(total_page_number*25)]))
        soup = BeautifulSoup(page.text, 'html.parser')
        titles = []
        companies = []
        locations = []   
        
        for card in soup.select("div.result-card__contents"):
            title = card.findChild("h3", recursive=False)
            company = card.findChild("h4", recursive=False)
            location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
        zipped = zip(titles, companies, locations)
        for z in list(zipped):
            data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    
    # Return dataframe
    return data

In [6]:
scrape_linkedin_job_search_country('data analysis',3,"United Kingdom")

Unnamed: 0,Title,Company,Location
0,Senior Data Analyst,Huxley,"London, United Kingdom"
1,Data Analyst,Shred-it,"Exeter, United Kingdom"
2,Data Analyst - Global Supplier Management,Taylor & Francis Group,"Oxford, United Kingdom"
3,DATA ANALYST,Gymshark,"Solihull, GB"
4,Data Analyst,UST Global,"Leeds, West Yorkshire, United Kingdom"
...,...,...,...
70,Data Analyst,Attivo Group,"Jessop House Jessop Avenue, Cheltenham GL50 3S..."
71,Data Analyst- Analytics,Kier Group,"Northampton, GB"
72,Data Analyst,BWD Search & Selection,"Manchester, United Kingdom"
73,Data Engineer,Hitachi Capital (UK) PLC,"Leeds, GB"


## Challenge 3

Add the 4th param called `num_days` to your function to allow it to search jobs posted in the past X days. Note that in the LinkedIn job search the searched timespan is specified with the following param:

```
f_TPR=r259200
```

The number part in the param value is the number of seconds. 259,200 seconds equal to 3 days. You need to convert `num_days` to number of seconds and supply that info to LinkedIn job search.

In [7]:
def scrape_linkedin_job_search_days(keywords,total_page_number, country, days):
    
    # Define the base url to be scraped.
    # All uppercase variable name signifies this is a constant and its value should never unchange
    BASE_URL = 'https://www.linkedin.com/jobs/search/?'
    
    # Assemble the full url with parameters
    if country == True:
        scrape_url = ''.join([BASE_URL,"f_TPR=r",str(days*86400),"&", "keywords=", keywords.replace(" ","%20")])
    else:            
        scrape_url = ''.join([BASE_URL,"f_TPR=r",str(days*86400),"&", "keywords=", keywords.replace(" ","%20"),"&location=",country.replace(" ","%20")])
    
    # Create an empty dataframe with the columns consisting of the information you want to capture
    columns = ['Title', 'Company', 'Location']
    data = pd.DataFrame(columns=columns)

    # Retrieve HTML code from the webpage. Parse the HTML into a list of "cards".
    # Then in each job card, extract the job title, company, and location data.
    
    
    # Add a new parameter num_pages
    for num in range(total_page_number):
        if num == 0:
            page = requests.get(scrape_url)
        else:
            page = requests.get(''.join([scrape_url,'&start=',str(total_page_number*25)]))
        soup = BeautifulSoup(page.text, 'html.parser')
        titles = []
        companies = []
        locations = []
        
        for card in soup.select("div.result-card__contents"):
            title = card.findChild("h3", recursive=False)
            company = card.findChild("h4", recursive=False)
            location = card.findChild("span", attrs={"class": "job-result-card__location"}, recursive=True)
            seniority_level= card.findChild("")
            titles.append(title.string)
            companies.append(company.string)
            locations.append(location.string)
        zipped = zip(titles, companies, locations)
        for z in list(zipped):
            data=data.append({'Title' : z[0] , 'Company' : z[1], 'Location': z[2]} , ignore_index=True)
    
    
    # Return dataframe
    return data

In [8]:
pd.set_option("display.max_rows",100)

scrape_linkedin_job_search_days("data analysis", 3, "Germany", 3)

Unnamed: 0,Title,Company,Location
0,Senior Data Analyst,Takeaway.com,"Berlin, DE"
1,Senior Data Analyst,Requaero Limited - Recruitment Search and Sele...,"Jena, Thuringia, Germany"
2,Senior Data Analyst,komoot,"Munich, Bavaria, Germany"
3,Data Engineer,ICIS,"Karlsruhe, Baden-Württemberg, Germany"
4,Senior Data Analyst for digital feedback system,Xcede,"Berlin, DE"
5,Data Developer,Movement8,"Berlin Area, Germany"
6,Data Analyst - Finance,Jimdo,"Hamburg, Hamburg, Germany"
7,Senior Data Scientist,Glocomms,"Stuttgart Area, Germany"
8,Data Analyst,GetYourGuide,"Berlin, DE"
9,Consumer Research & Data Analyst,Mintel,"Düsseldorf, DE"


## Bonus Challenge

Allow your function to also retrieve the "Seniority Level" of each job searched. Note that the Seniority Level info is not in the initial search results. You need to make a separate search request for each job card based on the `currentJobId` value which you can extract from the job card HTML.

After you obtain the Seniority Level info, update the function and add it to a new column of the returned dataframe.

In [9]:
# your code here

TEST_URL = "https://www.linkedin.com/jobs/search/?keywords=data%20analysis"

In [26]:
import bs4

response= requests.get(TEST_URL).content
#html_bytes = response.content
soup = BeautifulSoup(response, "html.parser")


In [27]:
test_url="https://www.linkedin.com/jobs/view/1520646488/"

import bs4

response= requests.get(test_url)
#html_bytes = response.content
soup = BeautifulSoup(response.text, "html.parser")



soup.body

<body><input aria-hidden="true" class="global-alert__state" id="cookie-policy" type="checkbox"/><div class="global-alert global-alert--yield" data-id="cookie-policy"><figure class="global-alert__icon global-alert__icon--yield lazy-load onload"></figure><p class="global-alert__message-content">This website uses cookies to improve service and provide tailored ads. By using this site, you agree to this use. See our <a data-tracking-control-name="guest_job_details_cookie_policy_click" data-tracking-will-navigate="" href="https://www.linkedin.com/legal/cookie-policy?trk=guest_job_details_cookie_policy_click">Cookie Policy</a>.</p><label class="global-alert__label" for="cookie-policy"><figure class="global-alert__icon global-alert__icon--dismiss lazy-load onload"></figure></label></div><header class="header"><nav class="nav"><a class="nav__logo-link" data-tracking-control-name="guest_job_details_nav-header-logo" data-tracking-will-navigate="" href="/?trk=guest_job_details_nav-header-logo"><s

In [45]:
#Finding the seniority level by identifying the correct class using the find method
seniority = soup.find("span", class_= "job-criteria__text job-criteria__text--criteria").contents[0]

In [46]:
seniority

'Mid-Senior level'

In [49]:
linkedin_url= "https://www.linkedin.com/jobs/search/?keywords=data%20analysis"

linkedin_response= requests.get(linkedin_url)
#html_bytes = response.content
linkedin_soup = BeautifulSoup(linkedin_response.text, "html.parser")


In [50]:
linkedin_soup

<!DOCTYPE html>
<html lang="en"><head><meta content="d_jobs_guest_search" name="pageKey"/><meta content="en_US" name="locale"/><meta data-app-id="com.linkedin.jobs-guest-frontend.d_web" data-custom-tracking-code="" data-tracking-page-type="" id="config"><link href="https://www.linkedin.com/jobs/data-analysis-jobs" rel="canonical"/><link href="https://static-exp1.licdn.com/scds/common/u/images/logos/favicons/v1/favicon.ico" rel="icon"/><script>function getDfd() {let yFn,nFn;const p=new Promise((y, n)=>{yFn=y;nFn=n;});p.resolve=yFn;p.reject=nFn;return p;}window.lazyloader = getDfd();window.tracking = getDfd();window.impressionTracking = getDfd();</script><script async="" src="https://static-exp1.licdn.com/sc/h/gyr9f1lbpyirhri42ukub2rz"></script><meta content="Today's top 190,000+ Data Analysis jobs in United States. Leverage your professional network, and get hired. New Data Analysis jobs added daily." name="description"/><meta content="noarchive" name="robots"/><meta content="width=devi

In [59]:
# find the first currentJobId


linkedin_soup.select("li.result-card.job-result-card.result-card--with-hover-state")[0]["data-id"]

'1584105172'

In [64]:
#use the for method to loop through the entire selected list to retrieve all currentJobId

Job_ID= []
Job_list= linkedin_soup.select("li.result-card.job-result-card.result-card--with-hover-state")
for ID in Job_list:
    Job_ID.append(ID["data-id"])
    
Job_ID

['1584105172',
 '1545599594',
 '1545596913',
 '1581494653',
 '1585268399',
 '1548122822',
 '1585679592',
 '1582388343',
 '1545549647',
 '1541359745',
 '1587448683',
 '1589998666',
 '1589196496',
 '1589499027',
 '1589394873',
 '1548195599',
 '1572531876',
 '1589323018',
 '1580793717',
 '1505105484',
 '1524057360',
 '1507488837',
 '1568359165',
 '1589282384']

In [67]:
seniority_list=[]
for ID in Job_ID:
    job_url="https://www.linkedin.com/jobs/view/"
    # combining the ID obtained above and the job_url to create a new url
    actual_url= job_url+ID+"/"
    linkedin_response= requests.get(actual_url)

    linkedin_soup = BeautifulSoup(linkedin_response.text, "html.parser")
    seniority_list.append(linkedin_soup.find("span", class_= "job-criteria__text job-criteria__text--criteria").contents[0])
    
seniority_list  

['Associate',
 'Associate',
 'Mid-Senior level',
 'Entry level',
 'Entry level',
 'Mid-Senior level',
 'Not Applicable',
 'Mid-Senior level',
 'Mid-Senior level',
 'Mid-Senior level',
 'Not Applicable',
 'Entry level',
 'Entry level',
 'Associate',
 'Entry level',
 'Mid-Senior level',
 'Internship',
 'Associate',
 'Mid-Senior level',
 'Entry level',
 'Entry level',
 'Not Applicable',
 'Mid-Senior level',
 'Mid-Senior level']

In [68]:
len(seniority_list)

24