# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
driver_path = "./drivers/windows/geckodriver.exe"
# Linux
#driver_path = "./drivers/linux/geckodriver"
driver = webdriver.Firefox(executable_path=driver_path)

  driver = webdriver.Firefox(executable_path=driver_path)


### Define position and location 

In [24]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [5]:
## Number of postings to scrape
postings = 100

jn=0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    for job in jobs:
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        jn += 1
        
        liens = job.find_elements(By.TAG_NAME, "a")
        links = liens[0].get_attribute("href")
        
        title = soup.select('.jobTitle')[0].get_text().strip()
        company = soup.select('.companyName')[0].get_text().strip()
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
       
        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Job number    1 added - Data Scientist
Job number    2 added - Data Scientist
Job number    3 added - Data Scientist
Job number    4 added - Data Scientist (REMOTE)
Job number    5 added - Data Scientist
Job number    6 added - Data Scientist
Job number    7 added - Data Analyst - Sr Specialist
Job number    8 added - Network Data Scientist
Job number    9 added - Data Scientist
Job number   10 added - Data Scientist
Job number   11 added - Data Scientist
Job number   12 added - Data Scientist
Job number   13 added - Associate Data Scientist
Job number   14 added - Data Scientist
Job number   15 added - Data Scientist
Job number   16 added - Data Scientist
Job number   17 added - Data Scientist
Job number   18 added - Jr. Data Scientist
Job number   19 added - Junior Data Scientist
Job number   20 added - Data Scientist
Job number   21 added - Data Scientist, Baseball Research & Development
Job number   22 added - Senior Data Scientist (Remote)
Job number   23 added - Data Scientist
Jo

In [6]:
driver.quit()

### Scrape full job descriptions

In [7]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [16]:
import random
import time

In [17]:
driver = webdriver.Firefox(executable_path=driver_path)
descriptions=[]
for i in Links_list:
    driver.get(i)
    driver.implicitly_wait(random.randint(3, 8))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

  driver = webdriver.Firefox(executable_path=driver_path)


In [19]:
driver.quit()

### Save results

In [25]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [21]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,PNC Financial Services Group,"Remote in Pittsburgh, PA 15222",3.5,PostedJust posted,,"Participates in the data gathering, data proce...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"Job Profile\n\nPosition Overview\n\nAt PNC, ou..."
1,Data Scientist,PNC Financial Services Group,"Remote in Pittsburgh, PA 15222",3.5,PostedJust posted,,"Participates in the data gathering, data proce...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"Job Profile\n\nPosition Overview\n\nAt PNC, ou..."
2,Data Scientist,"Shaw Industries Group, Inc.",Remote,3.8,PostedToday,,Partner with data scientists across the enterp...,https://www.indeed.com/company/Shaw-Industries...,We are looking for a data scientist to join ou...
3,Data Scientist,TVision,"Remote in Boston, MA 02109",,PostedPosted 18 days ago,,Your choice of comprehensive health benefits f...,https://www.indeed.com/rc/clk?jk=6516da9c2bd81...,The Company\nTVision measures who watches TV a...
4,Data Scientist (REMOTE),State Farm,"Remote in Bloomington, IL 61701",3.7,PostedPosted 3 days ago,"$69,115 - $169,250 a year",Strong communication skills and the ability to...,https://www.indeed.com/rc/clk?jk=4b1e7c89804f1...,Overview:\nWe are not just offering a job but ...
...,...,...,...,...,...,...,...,...,...
146,Data Science Engineer,Sensydia,Remote,,PostedPosted 10 days ago,"From $150,000 a year","A Master’s degree in engineering or science, a...",https://www.indeed.com/company/Sensydia/jobs/D...,About this role\nAre you interested in applyin...
147,Data Scientist (Remote),Silverxis,Remote,3.5,EmployerActive 5 days ago,From $25 an hour,Familiarity with health/public health data.\nF...,https://www.indeed.com/company/Silverxis/jobs/...,Data Scientist (Remote)\nRICHMOND VA\nL*ocal c...
148,Data Engineer with Machine learning Exp.,Albet Technologies LLC,Remote,,PostedPosted 12 days ago,$39.88 - $70.00 an hour,Work alongside data scientists and product own...,https://www.indeed.com/company/Albet-Technolog...,"Candidates needs to be strong in spark, big da..."
149,Staff Data Scientist,Mothership Technologies,Remote,,PostedPosted 30+ days ago,,Make significant and early contributions to th...,https://www.indeed.com/rc/clk?jk=2dca28ccb7473...,We are building the future of freight\n\nMothe...
