## LinkedIn API Scraping

Prompt: What jobs are dedicated to data activities versus which jobs have a stated data component?

Their [terms and conditions](https://legal.linkedin.com/api-terms-of-use)

LinkedIn doesn't like scrapers, so they don't have a personal API. But we can still do stuff with our scraping packages that I will show here.

We have a lot of functionality that we could do - and these are limited to what I am able to access usually from my LinkedIn account. 

Selenium provides an API that allows you to access web drivers including Firefox, Internet Explorer, and Chrome. I then use BeautifulSoup to parse the webpage information I am interested in.

Source: https://levelup.gitconnected.com/linkedin-scrapper-a3e6790099b5

From my research, there are two types of info people usually want to scrape from LinkedIn. The first is profile scraping (there are tools like PhantomBuster for that). But the one that we are more interested in for this project is the job descriptions. The code below is for this second part!

In [10]:
CONFIG_PATH = "/Users/mtaruno/Documents/DevZone/job-research/scraping/config.txt"
CHROMEDRIVER_PATH = (
    "/Users/mtaruno/Documents/DevZone/job-research/scraping/chromedriver-m1-15"
)
OUTPUT_PATH = "../data/scraping_results/linkedin_"

In [2]:
import requests, time
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
import re
import requests
import pandas as pd
import numpy as np

In [3]:
# from selenium.webdriver.support.ui import WebDriverWail
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support import expected_conditions as ec


# Initializing account info
# My personal login info (I put it into config.txt for privacy reasons)
with open(CONFIG_PATH) as f:
    file = f.readlines()[0]
    username, password = file.split()
    

def initialize():
    ''' Initializes a Chrome driver, opens linkedin and automatically inputs my username and password. '''
    # Input path to chrome driver executable
    browser = webdriver.Chrome(CHROMEDRIVER_PATH) # Now we're connected to a browser!

    # This driver allows us to access webpages from a chrome browser
    # Logging in to LinkedIn
    browser.get('https://www.linkedin.com/login')

    # Entering login info
    elementID = browser.find_element_by_id('username')
    elementID.send_keys(username)

    elementID = browser.find_element_by_id('password')
    elementID.send_keys(password)

    elementID.submit()
    
    return browser

browser = initialize()

In [227]:
# Code to control how many jobs pop up

# no_of_jobs = 50

# # To show more jobs. Depends on number of jobs selected
# i = 2
# while i <= (no_of_jobs/25): 
#     browser.find_element_by_xpath('/html/body/main/div/section/button').click()
#     i = i + 1
#     sleep(5)

In [4]:
# Searching "data" in jobs

browser.get('https://www.linkedin.com/jobs/?showJobAlertsModal=false')
jobID = browser.find_element_by_class_name('jobs-search-box__text-input')
jobID.send_keys("data")

# Make sure to close messages first so that the search button is "clickable"
# This code is an automated way of doing that

from selenium.common.exceptions import NoSuchElementException

def popup():
    ''' Uses try loops to recognize when the popup is available and closing it
    down in the event that it is. '''
    try:
        if browser.find_element_by_class_name('msg-overlay-list-bubble--is-minimized') is not None:
            pass
    except NoSuchElementException:
        try:
            if browser.find_element_by_class_name('msg-overlay-bubble-header') is not None:
                browser.find_element_by_class_name('msg-overlay-bubble-header').click()
        except NoSuchElementException:
            pass

popup()

# Actually clicking the search button (we have to find the HTML element, 
# then use click method on that element)
search = browser.find_element_by_class_name('jobs-search-box__submit-button')
search.click()

In [6]:
# Get page source code
time.sleep(5) # While waiting for page to load

src = browser.page_source

print(src)

# Beautiful Soup object
soup = BeautifulSoup(src, 'lxml') # Using lxml parser
# Make sure to do pip install lxml if you haven't

# Get the search result number
results = soup.find('small', {'class': 'display-flex t-12 t-black--light t-normal'}).get_text().strip().split()[0]
results = f"There are {int(results.replace(',', ''))} results"
print(results)

<html lang="en" class="theme theme--mercado artdeco osx theme--dark"><head>
    <script type="text/javascript" async="" charset="utf-8" id="utag_145" src="https://platform.linkedin.com/litms/vendor/google/gtag-cm-dv360-sa360.js?id=DC-9261636"></script><script type="text/javascript" async="" charset="utf-8" id="utag_75" src="https://platform.linkedin.com/litms/vendor/google/gtag-adwords.js?id=AW-979305453"></script><script type="application/javascript">!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);</script>

    <title>(28) data Jobs | LinkedIn</title>

    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="asset-url" class="mercado-icons-sprite" id="artdeco-icons/static/images/sprite-asset" content="https://static-exp1.licdn.com/sc/h/7438dbnn8galtczp2gk2s4bgb">
 

AttributeError: 'NoneType' object has no attribute 'get_text'

Tip: You can right click and press "Inspect" on whatever it is you want to find, and that is going to pull up the relevant HTML info.

### Narrowing to Information We Want

Reference: https://amandeepsaluja.com/extracting-job-information-from-linkedin-jobs-using-beautifulsoup-and-selenium/

In [7]:
# Looking for all the job containers
# I found the class from doing a string search for Umbel (which was my first job result in the chrome driver)

job_container = soup.find_all('li', {"class":"jobs-search-results__list-item occludable-update p0 relative ember-view"})

# Filtering down to the links for individual companies
expression = re.compile(r"\/jobs\/view")
l1 = [job.attrs['href'] for job in soup.find_all('a')]
postings = [ "https://linkedin.com" + s for s in l1 if expression.match(s) ]
postings = list(set(postings)) # Getting unique postings

In [8]:
postings

['https://linkedin.com/jobs/view/2590484203/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=IN_NETWORK&refId=BIAdEjx1%2BuyHGomSYEDA7w%3D%3D&trackingId=PB6fZE7GUvGc8JudwQCyBQ%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2768830654/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=TOP_APPLICANT&refId=BIAdEjx1%2BuyHGomSYEDA7w%3D%3D&trackingId=X0hw80FTUWk0owe5XOM%2BhA%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2716169913/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=SKILL_ASSESSMENTS&refId=BIAdEjx1%2BuyHGomSYEDA7w%3D%3D&trackingId=buRArFeyQw1yAbP%2Bx3r4GA%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2767838197/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=IN_NETWORK&refId=BIAdEjx1%2BuyHGomSYEDA7w%3D%3D&trackingId=%2Bw9gSTBscsqQJR54jtPfRA%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2590484203/?alternateChannel=search&refId=BIAdEjx1%2BuyHGomSYEDA7w%3D%3D&trackingId=PB6fZE7GUvGc8JudwQCyBQ%3D%3D&trk=d_flagship

In [225]:
postings # Displays the links to the actual jobs

['https://linkedin.com/jobs/view/2387983431/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=IN_NETWORK&refId=h0NcM8dRPY3OXon4V8lm2w%3D%3D&trackingId=XjU1koQg9PaJ5zzXb%2F40dQ%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2374903762/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=SKILL_ASSESSMENTS&refId=h0NcM8dRPY3OXon4V8lm2w%3D%3D&trackingId=%2FUpN6pWGjirpSV4%2BViBHgA%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2387979749/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=IN_NETWORK&refId=h0NcM8dRPY3OXon4V8lm2w%3D%3D&trackingId=C12HFSkujcb8nRC9XcVywQ%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2374919220/?eBP=JOB_SEARCH_ORGANIC&recommendedFlavor=SKILL_ASSESSMENTS&refId=h0NcM8dRPY3OXon4V8lm2w%3D%3D&trackingId=y7aX6VajGMnuZEjiUNlTJw%3D%3D&trk=flagship3_search_srp_jobs',
 'https://linkedin.com/jobs/view/2374972050/?alternateChannel=search&refId=h0NcM8dRPY3OXon4V8lm2w%3D%3D&trackingId=qoGaY88qa%2BXk4Aw7J3%2BdCg%3D%3D&trk=flagship3_se

In [11]:
def scrape():
    ''' Goes through postings one by one that was generated with the code before,
    scrapes out all the desired information, and stores it in a dataframe.
    
    Returns the dataframe and saves it with the timestamp labeled on it as a csv file.
    '''
    title = []
    description = []
    company_name = []
    industry = []
    location = []
    job_functions = []
    time_posted = []
    employment_type = []
    applicant_count = []
    
    for post in postings:
    
        browser.get(post)
        popup()
        time.sleep(2)

        # Parsing out the wanted attributes based on company

        # Classnames are unique, so we filter by that
        # The LinkedIn layout is all the same, hence why this works.

        html = browser.page_source
        time.sleep(2)

        page = BeautifulSoup(html, "lxml")

        # Getting the description
        result = page.find_all("div", {"class": "jobs-box--fadein"})
        description.append(result[0].span.text)

        # Getting the title
        result = page.find_all("div", {"class": "p5"})
        title.append(result[0].h1.text)
        
        # Getting the company name
        result = page.find_all("a", {"class": "ember-view t-black t-normal"})
        company_name.append(result[0].text.replace("\n",""))
        try:
            # Getting the industry
            result = page.find_all("li", {"class": "jobs-description-details__list-item t-14"})
            industry.append(result[0].text.replace("\n",""))
        except:
            industry.append(np.nan)
        try:
            # Getting the job functions (it's in the same class)
            job_functions.append(result[1].text.replace("\n", ""))
        except:
            job_functions.append(np.nan)
        
        # Getting the location
        result = page.find_all("span", {"class": "jobs-unified-top-card__bullet"})
        location.append(result[0].text)
        
        try:
            # Getting the employment type
            result = page.find_all("p", {"class": "t-14 mb3"})
            employment_type.append(result[0].text.replace("\n", ""))
        except:
            employment_type.append(np.nan)
        
        # Getting the time posted
        result = page.find_all("span", {"class": "jobs-unified-top-card__posted-date"})
        time_posted.append(result[0].text.replace("\n", ""))
        
        # Getting the applicant count
        result = page.find_all("span", {"class": "jobs-unified-top-card__applicant-count"})
        applicant_count.append(result[0].text.replace("\n", ""))
    
    # Storing in dataframe
    df = pd.DataFrame({"Title": title, "Description": description,
                      "Company Name": company_name, "Location": location,
                      "Industry": industry, "Job Functions": job_functions, 
                      "Time Posted": time_posted, "Employment Type": employment_type,
                      "Applicant Count": applicant_count})
    
    # Saving dataframe into csv, with a timestamp attached to it
    df.to_csv(OUTPUT_PATH + str(datetime.now())[:19] + ".csv")
    
    return df

df = scrape()

In [275]:
df

Unnamed: 0,Title,Description,Company Name,Location,Industry,Job Functions,Time Posted,Employment Type,Applicant Count
0,Data Analyst – People Data Solutions,\n Facebook's mission is to give peop...,Facebook,"\n Menlo Park, CA\n",Internet,Information Technology,20 hours ago,Full-time,51 applicants
1,Data Scientist Intern - Digital Media Analytic...,\nJob Summary\n\nThe Data Scientist Intern on ...,Disney Media & Entertainment Dis...,"\n Seattle, WA\n",Marketing & Advertising,Online Media,6 days ago,Mid-Senior level,17 applicants
2,Data Analyst – People Data Solutions,\n Facebook's mission is to give peop...,Facebook,"\n New York, NY\n",Internet,Information Technology,20 hours ago,Full-time,58 applicants
3,Data Scientist,\nWhat are we Building?We believe it is import...,Photomath,\n San Francisco Bay Area\n,,,14 hours ago,,137 applicants
4,Data Analyst - Field Operations,"\n At Confluent, we’re creating a cat...",Confluent,"\n Mountain View, CA\n",Computer Software,Computer Networking,1 hour ago,Entry level,1 applicant
5,Data Analyst - Field Operations,"\n At Confluent, we’re creating a cat...",Confluent,"\n Mountain View, CA\n",Computer Software,Computer Networking,1 hour ago,Entry level,1 applicant
6,Data Analyst,"\n At Urbint, our mission is to make ...",Urbint,"\n New York, NY\n",Information Technology & Service...,Computer Software,14 hours ago,Mid-Senior level,129 applicants
7,"Intern, Data Science","\nJob Description\n\nAt Rockwell Automation, w...",Rockwell Automation,"\n Austin, TX\n",Industrial Automation,Computer Software,19 hours ago,Associate,23 applicants


In [268]:
# Unit testing above function

browser.get(postings[3])
popup()
time.sleep(2)
html = browser.page_source
time.sleep(2)

page = BeautifulSoup(html, "lxml")
result = page.find_all("p", {"class": "t-14 mb3"})
# result[0].text.replace("\n", "")
result

# Note: Some of the postings had components that were not scrapable. For this
# I just added np.nan as the entry.

In [279]:
# Here's an example of the first description
df['Description'][0]

"\n          Facebook's mission is to give people the power to build community and bring the world closer together. Through our family of apps and services, we're building a different kind of company that connects billions of people around the world, gives them ways to share what matters most to them, and helps bring people closer together. Whether we're creating new products or helping a small business expand its reach, people at Facebook are builders at heart. Our global teams are constantly iterating, solving problems, and working together to empower people around the world to build community and connect in meaningful ways. Together, we can help people build stronger communities - we're just getting started.\n\nThe Data Analyst is responsible for ensuring the efficiency of Facebook’s People Data Team, identifying areas for improvement, and building both long term and ad-hoc solutions. The ideal candidate will have a strong technical, analytical, and operational background. SQL and T