# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Instal Firefox, Selenium, Gecko Driver, Beautiful Soup

In [None]:
#Install firefox
!apt-get update
!apt install firefox

#Install selenium
!pip install selenium

#Updating and installing firefox libraries
!apt-get update && apt-get install -y wget bzip2 libxtst6 libgtk-3-0 libx11-xcb-dev libdbus-glib-1-2 libxt6 libpci-dev && rm -rf /var/lib/apt/lists/*

#Installing Geck Driver
!wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
!tar -xvzf geckodriver*
!chmod +x geckodriver
!export PATH=$PATH:/path-to-extracted-file/.

#Instal beautifulsoup
!pip install beautifulsoup4

### Install UC(undetected chromedriver)

Firefox trigger cloudflare protection, so I use UC instead

In [1]:
%pip install selenium
%pip install beautifulsoup4
%pip install undetected-chromedriver

Collecting selenium
  Downloading selenium-4.16.0-py3-none-any.whl (10.0 MB)
     ---------------------------------------- 10.0/10.0 MB 8.2 MB/s eta 0:00:00
Collecting trio~=0.17
  Downloading trio-0.23.1-py3-none-any.whl (448 kB)
     -------------------------------------- 448.3/448.3 kB 7.0 MB/s eta 0:00:00
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Collecting outcome
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting sniffio>=1.3.0
  Using cached sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
     ---------------------------------------- 58.3/58.3 kB 3.0 MB/s eta 0:00:00
Installing collected packages: sniffio, outcome, h11, exceptiongroup, wsproto, trio, trio-websocket, selen

### Import Dependencies

In [1]:
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options as FirefoxOptions

import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

import random
import time

import undetected_chromedriver as uc

### Define Position and Location

In [2]:
## Enter a job position
position = "data+scientist"
## Enter a location (City, State or Zip or remote)
locations = "united+states"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])
print(url)

https://www.indeed.com/jobs?q=data+scientist&l=united+states


### Set Path to Webdriver

In [12]:
# legacy
driver_path = '/content/geckodriver'
firefox_driver_path = '/content/geckodriver'

# random user agent
user_agents= ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
              'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
              'Mozilla/5.0 (iPhone; CPU iPhone OS 17_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/120.0.6099.101 Mobile/15E148 Safari/604.1',
              'Mozilla/5.0 (iPad; CPU OS 17_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/120.0.6099.101 Mobile/15E148 Safari/604.1',
              'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.43 Mobile Safari/537.36']
random_user_agent = random.choice(user_agents)

# options for chrome driver
options = ['--headless',
           '--no-sandbox',
           f'--user-agent={random_user_agent}',
           '--disable-blink-features=AutomationControlled']
chrome_options = uc.ChromeOptions()
for option in options:
  chrome_options.add_argument(option)

# initialize the driver
driver = uc.Chrome(version_main=119, options=chrome_options)

### Scrape Job Postings

In [5]:
## Number of postings to scrape
postings = 1500

jn=0
for i in range(0, postings, 10):
    driver.implicitly_wait(random.randint(1, 3))
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)
    # res.append(driver.page_source)
    
    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')
    #print(jobs)

    for job in jobs:
        #print(job)
        result_html = job.get_attribute('innerHTML')
        #print(result_html)
        soup = BeautifulSoup(result_html, 'html.parser')
        #print(soup , '\n')

        jn += 1

        liens = job.find_elements(By.TAG_NAME, "a")
        #print(liens)
        links = liens[0].get_attribute("href")
        #print(links)

        title = soup.select('.jobTitle')[0].get_text().strip()
        print(title)

        #company = soup.find_all(attrs={'data-testid': 'company-name'})[0].get_text().strip()
        #print(company)
        try:
            company = soup.find_all(attrs={'data-testid': 'company-name'})[0].get_text().strip()
            #print(company)
        except:
            company = 'NaN'
        print(company)
        #location = soup.select('.companyLocation')[0].get_text().strip() #origional
        #location = soup.select('.company_location')[0].get_text().strip()
        location = soup.find_all(attrs={'data-testid': 'text-location'})[0].get_text().strip()
        print(location)
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''

        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Data Scientist – NLP
Solytics Partners LLC
Hybrid remote in New York, NY 10001
Job number    1 added - Data Scientist – NLP
Staff Data Scientist, Core AI
Indeed
Remote
Job number    2 added - Staff Data Scientist, Core AI
Data Analyst / Data Scientist
DATSURA
Washington, DC 20549 (NoMa area)
Job number    3 added - Data Analyst / Data Scientist
Sr Data Scientist
Public Storage
Glendale, CA
Job number    4 added - Sr Data Scientist
Climate Data Scientist
Leidos
Remote
Job number    5 added - Climate Data Scientist
VISS/AXIS Ai Developer - (Ai modeling, data engineering, X-ray imaging) - Kentucky (BOSK)
SK Battery America
Glendale, KY 42740
Job number    6 added - VISS/AXIS Ai Developer - (Ai modeling, data engineering, X-ray imaging) - Kentucky (BOSK)
Sr Statistician
The Joint Commission
Hybrid remote in Oakbrook Terrace, IL 60181
Job number    7 added - Sr Statistician
Data Scientist I
Battelle
Egg Harbor Township, NJ
Job number    8 added - Data Scientist I
Data Scientist (L5) - Ad-Me

In [43]:
dataframe.head()

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist – NLP,Solytics Partners LLC,"Hybrid remote in New York, NY 10001",,PostedJust posted,"$90,000 - $120,000 a year",Solytics Partners provide products and service...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Solytics Partners provide products and service...
1,"Staff Data Scientist, Core AI",Indeed,Remote,,EmployerActive 2 days ago,"$164,000 - $238,000 a year",Our Mission\nAs the world’s number 1 job site*...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Our Mission\nAs the world’s number 1 job site*...
2,Data Analyst / Data Scientist,DATSURA,"Washington, DC 20549 (NoMa area)",,EmployerActive 2 days ago,"$130,000 - $150,000 a year","We are small, technology consulting firm assis...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"We are small, technology consulting firm assis..."
3,Sr Data Scientist,Public Storage,"Glendale, CA",,PostedPosted 2 days ago,"$140,000 - $180,000 a year",Company Description\n\nPublic Storage is recog...,https://www.indeed.com/rc/clk?jk=5a78bb898c7c9...,Company Description\n\nPublic Storage is recog...
4,Climate Data Scientist,Leidos,Remote,,PostedJust posted,"$78,000 - $141,000 a year",Description\nUnleash Your Potential\nAt Leidos...,https://www.indeed.com/rc/clk?jk=3a1687c4e456c...,Description\nUnleash Your Potential\nAt Leidos...


### Scrape Full Job Descriptions

In [6]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [17]:
if descriptions: 
    # Update existing descriptions in dataframe
    dataframe.loc[:len(descriptions)-1, 'Description'] = descriptions

    # Identify the URL to resume from
    resume_url = dataframe.loc[len(descriptions)-1, 'Links']

    # Find index of resume_url in your Links_list
    resume_index = Links_list.index(resume_url) + 1

686


In [31]:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

target_length = dataframe.shape[0]
if 'descriptions' not in globals(): 
    descriptions=[]
for i, link in enumerate(Links_list):
    driver.get(link)
    try:
        element = WebDriverWait(driver, 6).until(
            EC.visibility_of_element_located((By.XPATH, '//div[@id="jobDescriptionText"]'))
        )
        jd = element.text
    except:
        jd = 'NaN'
    descriptions.append(jd)
    print("{0} remaining... Jd for job {1} added - {2}".format(target_length-i-1, i, jd[:50]))
    time.sleep(random.randint(2,5))

dataframe['Descriptions'] = descriptions

2126 remaining... Jd for job 0 added - Solytics Partners provide products and services to
2125 remaining... Jd for job 1 added - Our Mission
As the world’s number 1 job site*, our
2124 remaining... Jd for job 2 added - We are small, technology consulting firm assisting
2123 remaining... Jd for job 3 added - Company Description

Public Storage is recognized 
2122 remaining... Jd for job 4 added - Description
Unleash Your Potential
At Leidos, we d
2121 remaining... Jd for job 5 added - Come join us and build your future with SK battery
2120 remaining... Jd for job 6 added - Overview:
This is a hybrid position and will requi
2119 remaining... Jd for job 7 added - Battelle is guided by a founding mission. We inves
2118 remaining... Jd for job 8 added - Los Gatos, California
Data Science and Engineering
2117 remaining... Jd for job 9 added - Solutions
Oil & Gas Pipeline Power Utilities Water
2116 remaining... Jd for job 10 added - Applied Materials is the leader in materials engin
2115 rema

ValueError: Length of values (2129) does not match length of index (2127)

### Save Results

In [40]:
dataframe['Descriptions'] = descriptions

In [41]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [42]:
dataframe.head()

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist – NLP,Solytics Partners LLC,"Hybrid remote in New York, NY 10001",,PostedJust posted,"$90,000 - $120,000 a year",Solytics Partners provide products and service...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Solytics Partners provide products and service...
1,"Staff Data Scientist, Core AI",Indeed,Remote,,EmployerActive 2 days ago,"$164,000 - $238,000 a year",Our Mission\nAs the world’s number 1 job site*...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Our Mission\nAs the world’s number 1 job site*...
2,Data Analyst / Data Scientist,DATSURA,"Washington, DC 20549 (NoMa area)",,EmployerActive 2 days ago,"$130,000 - $150,000 a year","We are small, technology consulting firm assis...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,"We are small, technology consulting firm assis..."
3,Sr Data Scientist,Public Storage,"Glendale, CA",,PostedPosted 2 days ago,"$140,000 - $180,000 a year",Company Description\n\nPublic Storage is recog...,https://www.indeed.com/rc/clk?jk=5a78bb898c7c9...,Company Description\n\nPublic Storage is recog...
4,Climate Data Scientist,Leidos,Remote,,PostedJust posted,"$78,000 - $141,000 a year",Description\nUnleash Your Potential\nAt Leidos...,https://www.indeed.com/rc/clk?jk=3a1687c4e456c...,Description\nUnleash Your Potential\nAt Leidos...


In [39]:
dataframe.shape
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        2127 non-null   object
 1   Company      2127 non-null   object
 2   Location     2127 non-null   object
 3   Rating       2127 non-null   object
 4   Date         2127 non-null   object
 5   Salary       2127 non-null   object
 6   Description  2127 non-null   object
 7   Links        2127 non-null   object
dtypes: object(8)
memory usage: 133.1+ KB


### Exit WebDriver

In [44]:
driver.quit()