# Using Selenium and BeautifulSoup to scrape Indeed

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Goal" data-toc-modified-id="Goal-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Goal</a></span></li><li><span><a href="#Importing-relevant-libraries" data-toc-modified-id="Importing-relevant-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing relevant libraries</a></span></li><li><span><a href="#A-function-to-compile-a-dictionary-for-the-information-collected" data-toc-modified-id="A-function-to-compile-a-dictionary-for-the-information-collected-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>A function to compile a dictionary for the information collected</a></span></li><li><span><a href="#Scrapping-the-web" data-toc-modified-id="Scrapping-the-web-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Scrapping the web</a></span></li><li><span><a href="#Creating-the-new-dataframe" data-toc-modified-id="Creating-the-new-dataframe-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Creating the new dataframe</a></span><ul class="toc-item"><li><span><a href="#Inspecting-one-of-the-dataframes-created" data-toc-modified-id="Inspecting-one-of-the-dataframes-created-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Inspecting one of the dataframes created</a></span></li></ul></li></ul></div>

## Goal
To scrape jobs off a job aggregation website (Indeed.com) to obtain information that can be used to firstly predict salary and also to determine industry skills that are relevant to specific roles and seniority.

## Importing relevant libraries

In [9]:
from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy.selector import Selector
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## A function to compile a dictionary for the information collected
I will be collecting 4 key bits of information from each job, the title, job description, salary and location. Once I have collected this data from a website, I will need it to be compiled into a format in which I can manipulate into a Dataframe.

In [10]:
def information(title,description,salary,location):
    info={'title':title,
    'description':description,
    'salary':salary,
         'location':location}
    return info

## Scrapping the web
To scrape Indeed, I have chosen two tools (Selenium and BeautifulSoup). I will be using Selenium to act as my browser and navigate the website because there is quite a lot of information stored within the __java script__ that can only be scrapped if the java script runs.

Once On the website I then pull the html and decipher the information I am after using __BeautifulSoup__.

In [19]:
def get_info(url):
    #Initiallising the driver
    driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver.exe")
    
    #Navigating to the base website with a predetermined search query in the URL
    driver.get(url)
    html = driver.page_source
    
    #Use BeautifulSoup to pasrse the html into a format in which we can search for data
    soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    
    #Find all the job titles on that page and store it in a list
    list_links = driver.find_elements_by_class_name('jobtitle')
    
    #Create a new empty list and new empty dictionary to use for storing the information
    newurl_list=[]
    information_total={}
    
    # (i) will be the indexes for my final dictionary so I will instantiate it as (-1) to start
    i=-1
    
    #Search the html to pull the total number of jobs found and 
    #use it to define how to move to the next page of job listings
    #Once the pages have been established, close the driver
    for number in soup.find_all('div', {'id':"searchCount"}):
        number=number.text.split(' ')
        total=number[-2]
        print(total)
        for i in range(0,int(total.replace(',', '')),10):
            newurl=url+str(i)
            newurl_list.append(newurl)
    driver.close()
    
    #Open a new instance of the driver using the list of urls with the pages added to the urls
    driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver.exe")
    for item in newurl_list:
        driver.get(item)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
        list_links = driver.find_elements_by_class_name('jobtitle')
        
        #Start a try loop in case the for loop fails (pop-up windoes/redirects)
        try:
            
            #For each job on the page, sleep for 1 second, this allows time for the whole page to laod
            #i+1 becaue (i) is the key for that job,
            #look for the job title, description, salary and location
            for link in list_links:
                i=i+1
                link.click()
                sleep(1)
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
                for title in soup.find_all('div', {'id':"vjs-jobtitle"}):
                    title=title.text
                for description in soup.find_all('div', {'id':"vjs-desc"}):
                    description=description.text
                for salary in soup.find_all('div', {'id':"vjs-jobinfo"}):
                    salary=salary.text
                for location in soup.find_all('span', {'id':
                    "vjs-loc"}):
                    location=location.text
                close=driver.find_element_by_id("vjs-x")
                close.click()
                
                #Use the function made previously to put the information into a dictionary with the unique key
                information_total[i]=information(title,description,salary,location)


        except:
            pass
    driver.close()
    
    #The result of running the function will be a dictionary of dictionaries which is what I am after
    return information_total

Now, it is as simple as running the function with the correct url and collecting the resulting dictionary and turning it into a dataframe which I can use for my analysis

In [20]:
information_total=get_info("https://au.indeed.com/jobs?q=%22data+scientist%22&l=WA&start=")

10


## Creating the new dataframe
Dataframe is made using pandas but needs to be transposed due ot the layout

In [21]:
df=pd.DataFrame(information_total)

In [22]:
df=df.T

### Inspecting one of the dataframes created
All the columns have been loaded with the right labels which is what I am after.

This is just a small subset of the data I scrapped, I passed the function a total of 12 different queeries and I have 12 different csv files. This will mean I do have a lot of duplicates but I don't think this is an issue for me in the long term as I can easily drop any duplicates later on in the track.

In [24]:
df

Unnamed: 0,description,location,salary,title
1,Data and Analytics Team | Perth\n-------------...,- Perth WA,Data ScientistVGW - Perth WA,Data Scientist
2,Graduate Data Scientist-234026\nWe believe suc...,- Perth WA,Graduate Data ScientistUGL Limited257 reviews ...,Graduate Data Scientist
3,Requisition ID: 16575\nJob Category: Consultin...,- Perth WA,Intermediate Data ScientistHatch279 reviews - ...,Intermediate Data Scientist
4,Job ID: 554101\n\nJob type: Full Time - Fixed ...,- Perth WA,Lead Data ScientistDowner Group291 reviews - P...,Lead Data Scientist
5,Permanent Role\n\nBuisness Critical Position\n...,- Perth WA,Data ScientistMichael Page169 reviews - Perth ...,Data Scientist
6,About our Client:\n\nMy client is global leade...,- Perth WA,Data ScientistHydrogen Group6 reviews - Perth ...,Data Scientist
7,Advanced Analytics Consulting Projects\nPerman...,- Perth WA,"Data ScientistKelly Services12,380 reviews - P...",Data Scientist
8,The Company:\n\nOur client is a well known pla...,- Perth WA,Data ScientistBeacham Group Pty Ltd - Perth WA...,Data Scientist
9,Long-Term contract\nSuperannuation paid on all...,- Perth WA,SAP Master Data ScientistChandler Macleod80 re...,SAP Master Data Scientist
10,Newly created role for a Data Scientist who th...,- Western Australia,Applied Maths SpecialistTalent International6 ...,Applied Maths Specialist


In [10]:
df.shape

(13, 4)

In [11]:
df.to_csv('data_scientist_vic', sep='\t')