# Import All Necessary Libraries

The objective of the code below is to scrap data on Data Science jobs on LinkedIn located in Barcelona.
The libraries used are the following:
<ul>
  <li>numpy</li>
  <li>pandas</li>
  <li>bs4</li>
  <li>time</li>
  <li>selenium</li>
</ul>

In [1]:
import requests
import pandas as pd
import numpy as np
import bs4
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains



# ChromeDriver

ChromeDriver is used to nagivate the browser, it can be downlaoded from this <a href="https://chromedriver.storage.googleapis.com/index.html?path=107.0.5304.62/">link</a>.

In [2]:
driver_path = "chromedriver.exe"
driver = webdriver.Chrome(executable_path=driver_path)

  driver = webdriver.Chrome(executable_path=driver_path)


# Opening LinkedIn

The below code will autoamtically open LinkedIn and execute the following:
<ul>
  <li>Wait to ensure the page is fully loaded</li>
  <li>Calcualte the screen height to be used for scrolling one page at a time</li>
  <li>Pause in between screen heights to ensure the page is loaded to accurately find necessary elements</li>
</ul>

## Creating Data Frame of Jobs

Now that the page is fully loaded to the planned capability, we begin scrapping the data. The below code uses both HTML and Python and follows the below steps:


 <ul>
  <li>Parse to HTML</li>
  <li>Find unordered list with jobs</li>
  <li>Find all elements in the list</li>
  <li>Take the url of the detailed page for one job, request and parse to html </li>
  <li>Find the title, company, and location from the detailed page</li>
  <li>Look for the current state of the job and categorize into: "Early Application", "On-going Application", and "Other"</li>
  <li>Find date since job was published</li>
  <li>Print all the jobs</li>
  <li>Retrieve number of applicants, a try and escept block is used because the tag and class differs for applicants that are lower than 25 or higher than 200.</li>
</ul>


In [None]:
# Opening linkedIn's login page
driver.get("https://www.linkedin.com/jobs/search/?currentJobId=3270287326&geoId=107025191&keywords=data%20scientist%2C%20barcelona&location=Barcelona%2C%20Catalonia%2C%20Spain&refresh=true&position=1&pageNum=0")
 
# waiting for the page to load
time.sleep(1)

scroll_pause_time = 1
screen_height = driver.execute_script("return window.screen.height;")
i = 0

while True and i < 190: # set limit to stop when no more "load more results" available; # of scrolls can vary so include a buffer
    #scroll one screen height every time
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
    i += 1
    time.sleep(scroll_pause_time)
    scroll_height = driver.execute_script("return document.body.scrollHeight;")
    if screen_height * i > scroll_height:
        try:
            element=driver.find_element(By.CLASS_NAME, "infinite-scroller__show-more-button")
            time.sleep(scroll_pause_time)
            driver.execute_script("arguments[0].click();", element)      
        except:
            time.sleep(scroll_pause_time)
            driver.find_element(By.CSS_SELECTOR, "button[class='infinite-scroller__show-more-button infinite-scroller__show-more-button--visible']").click()
                  
j = 1
soup = bs4.BeautifulSoup(driver.page_source, "html.parser") #parse to HTML
resList = soup.find("ul", {"class": "jobs-search__results-list"}) #find unordered list with jobs
jobs = resList.find_all("li") #find all the elements in the list
#print(len(jobs))

df = pd.DataFrame(columns = ["Job Title", "Company Name", "Location", "State", "Posting Date", "Offer URL", "Number of Applicants",
                            "Promoted", "Workspace", "Seniority", "Employment Type", "Industry", "Python Required",
                            "Application through Linkedin", "Number of Employees"])

for job in jobs: #loop on all the jobs in the list
    
    url = job.a["href"] #take the url of the detailed page for one job
    detailedPage = requests.get(url) #request to that url
    detailedSoup = bs4.BeautifulSoup(detailedPage.text, "html.parser") #parse to html
    
    #find title, company and location from the detailed page; if not present, don't add the job
    generalInfo = detailedSoup.find("div", {"class": "top-card-layout__card relative p-2 papabear:p-details-container-padding"}) #find the section with hte main info
    try:
        title = generalInfo.find("h1").text.strip() #take the title and remove the extra spaces with strip
        company = generalInfo.find("a", {"class": "topcard__org-name-link topcard__flavor--black-link"}).text.split("\n")[1].strip()
        location = generalInfo.find("span", {"class": "topcard__flavor topcard__flavor--bullet"}).text.split("\n")[1].strip()
    except:
        continue
        
    #Find the state from the general page
    stateText = job.find("span", {"class": "result-benefits__text"})
    if stateText is None: #if the alert is not there, state is None
        state = np.NaN
    else: #if the alert exists, we can take the text, strip and find the right state according to the situation
        stateText = stateText.text.strip()
        if stateText == "Be an early applicant":
            state = "Early Applications"
        elif stateText == "Actively Hiring":
            state = "On-going"
        else:
            state = "Others"
        
    
    #date since publication of the job
    date = job.find("time", {"class": "job-search-card__listdate"})
    if date is None:
        date = job.find("time", {"class": "job-search-card__listdate--new"})
    date = date.text.strip()
    
    #retrieve the number of applicants; try-except block because when number of applicants is lower than 25 or higher than 200,
    # the tag and the class are different
    try:
        applicants = generalInfo.find("span", {"class": "num-applicants__caption topcard__flavor--metadata topcard__flavor--bullet"}).text.strip()
    except AttributeError:
        applicants = generalInfo.find("figcaption", {"class": "num-applicants__caption"}).text.strip()
        
    applicants = applicants.split(" ") #split at the space
    if len(applicants) == 6: #this is the case when <25 applicants: "Sé de los primeros 25 solicitantes"
        applicants = int(applicants[4])
    elif len(applicants) == 4: #this is the case when >200 applicants: "Mas de 200 solicitudes"
        applicants = int(applicants[2])
    else:
        applicants = int(applicants[0]) #this is the regular case: "x solicitudes"
    
   
    
    coreSection = detailedSoup.find("div", {"class": "core-section-container__content break-words"}) #find the section in the middle of the page
    items = coreSection.find_all("span", {"class": "description__job-criteria-text description__job-criteria-text--criteria"}) #find the 4 items there
    if len(items) == 4: #if they are all present
        seniority = items[0].text.strip()
        employment = items[1].text.strip()
        industry = items[3].text.strip()
    else: #if only the employment is present (it happens in some cases)
        employment = items[0].text.strip()
        seniority = np.NaN
        industry = np.NaN
    
    #Python required
    description = detailedSoup.find("div", {"class": "show-more-less-html__markup"}).text #find the description section and take text
    test_string = ["Python", "python", "PYTHON"] #define list with possible words to look for (not case sensitive)
    if any(word in description for word in test_string): #if the word python written in any way is there
        required = True #then required is true
    else:
        required = False #otherwise is false
        
    #Application through Linkedin
    applyFromCompany = detailedSoup.find("a", {"class": "apply-button apply-button--link top-card-layout__cta mt-2 ml-1.5 h-auto babybear:flex-auto top-card-layout__cta--primary btn-md btn-primary"})
    if applyFromCompany is not None:
        easyApply = False
    else:
        applyFromLinkedin = detailedSoup.find("button", {"class": "apply-button apply-button--default top-card-layout__cta mt-2 ml-1.5 h-auto babybear:flex-auto top-card-layout__cta--primary btn-md btn-primary"})
        if applyFromLinkedin is not None:
            easyApply = True
        else:
            easyApply = np.NaN
    
    
    my_dict = {"Job Title": title,
              "Company Name": company,
              "Location": location,
              "State": state,
              "Posting Date": date,
              "Offer URL": url,
              "Number of Applicants": applicants,
              "Promoted": np.NaN,
              "Workspace": np.NaN,
              "Seniority": seniority,
              "Employment Type": employment,
              "Industry": industry,
              "Python Required": required,
              "Application through Linkedin": easyApply,
              "Number of Employees": np.NaN}
    
    entry = pd.DataFrame([my_dict])
    df = pd.concat([df, entry], ignore_index = True)
    
    
    print(j, "jobs added to the df")
    
    j += 1
    
    time.sleep(1)

print("DATAFRAME FILLED") #note that few instances might not have been added because of blocks from Linkedin (so rows almost 1000)

driver.close()

# Notes on Un-Retrievable Data

The below points could not be retrieved, however the code that would be used if it was retrievable is under each case.

### Promoted
Information on whether or not a job has been promoted could not be retrieved, if it could have been the below code would be used: 

`promotedSection = job.find("li", {"class": "t-12 t-normal t-black--light job-card-container__footer-item"})
    if promotedSection is not None:
        promoted = True
    else:
        promoted = False`
    
### Workplace
Information on the workplace type of a job could not be retrieved, if it could we would use the below code:

`workspace - NOT POSSIBLE - The following would be the code to retrieve the info if we could
            workspace = detailedSoup.find("span", {"class": "jobs-unified-top-card__workplace-type"})`
            
### Number of Employees
The following would be the code to retrieve the info if we could

`companyNameUrl = generalInfo.find("a", {"class": "topcard__org-name-link topcard__flavor--black-link"})["href"]
companyDetail = requests.get(companyNameUrl)
companyDetailSoup = bs4.BeautifulSoup(companyDetail.text, "html.parser")
employees = companyDetailSoup.find("a", {"class": "face-pile__cta self-center link-no-visited-state"})`

### Company Name URL
The following would be the code to retrieve the company name url:

`companyNameUrl = generalInfo.find("a", {"class": "topcard__org-name-link topcard__flavor--black-link"})["href"]
companyDetail = requests.get(companyNameUrl)
companyDetailSoup = bs4.BeautifulSoup(companyDetail.text, "html.parser")
employees = companyDetailSoup.find("a", {"class": "face-pile__cta self-center link-no-visited-state"})`

# Print Data Frame

This will gather all the data gathered intot he dataframe, note that some instances might not have been added because of blocks from LinkedIn. Rows printed below should be almost 1,000.


In [None]:
print("DATAFRAME FILLED")

# Display and Save Data Frame

Show the beginning and thend of the dataframe, and finally save the dataframe into a csv file.

In [None]:
display(df)
df.to_csv("linkedin.csv")