# Web Scraping Data Analyst Jobs from Indeed

#### In this project, I utilize [Selenium](https://www.selenium.dev), [BeautifulSoup4](https://pypi.org/project/beautifulsoup4/) and [Pandas](https://pandas.pydata.org/) to scrape data about Data Analyst jobs in Alberta, Canada, from [Indeed](https://www.ca.indeed.com).

This project uses the following Python modules:
- pandas
- selenium
- webdriver-manager
- nltk
- BeautifulSoup4  

### Overview
___
  
To scrape the data, we'll use Selenium to enter "Data Analyst" in the main search bar and "Alberta" in the location search bar. We'll then iteratively open the jobs from the search results in a new tab using Selenium again and extract the desired details from the page source using BeautifulSoup into a Pandas DataFrame. Once we're done with a page of the search results and if the results span multiple pages, we'll utilize Selenium to click on the ">" (i.e. "Next Page") button and extract the desired details from each job on the new page.  
  
Finally, we'll store the collected data in a csv file.  
  
As a bonus, we'll use the nltk module to tokenize the summaries of each scraped job and count the number of times a word appears overall in an attempt to find the skills and attributes most commonly required of applicants for Data Analyst jobs posted on Indeed.  
  
Also, note that it is possible to use the Requests Python module instead of Selenium for this project. I just preferred to use Selenium.

#### Step 1: Import the necessary modules for the webscraping

In [None]:
import re, csv
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager

#### Step 2: Define a function for getting the "next page" button in the job search results page.

In [101]:
def find_next(soup, driver):
    # Get next page link
    pages_ul = soup.find("ul", attrs={"class": "pagination-list"})
    nextpage_anchor = pages_ul.find_all("li")[-1].contents[0]
    
    if nextpage_anchor.get("aria-label") == "Next":
        nextpage = driver.find_element(By.XPATH, f"//a[@href=\'{nextpage_anchor.get('href')}\']")
    else:
        nextpage = None
    
    return nextpage

#### Step 4: Initialize the Chrome Webdriver

In [None]:
# Initialize web driver
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

#### Step 5: Search for Data Analyst jobs in Alberta using Selenium

In [None]:
query = "Data Analyst"
location = "Alberta"
indeed_url = "https://ca.indeed.com/"

# Get Indeed homepage
driver.get(indeed_url)

# Find input elements
query_input = driver.find_element(By.NAME, "q")
location_input = driver.find_element(By.NAME, "l")

# Enter job query and location
query_input.send_keys(Keys.CONTROL + "a")
query_input.send_keys(Keys.DELETE)
query_input.send_keys(query)

location_input.send_keys(Keys.CONTROL + "a")
location_input.send_keys(Keys.DELETE)
location_input.send_keys(location)

# "Press 'ENTER'" to search DA jobs
location_input.send_keys(Keys.RETURN)

#### Step 6: Iterate over jobs in search results and extract data into a DataFrame

In [93]:
# Write column headers to file
column_headers = ["Job_Title", "Job_Type", "Company", "Location", "Remote", "Salary", "Summary"]
df = pd.DataFrame(columns=column_headers)

# Iterate over all result pages
while True:
    # Get jobs list and job links
    joblist_page = driver.page_source

    soup = BeautifulSoup(joblist_page, "html.parser")

    # Get job list html
    joblist = soup.find("ul", attrs={"class": "jobsearch-ResultsList"})
    
    # Iterate over jobs in the job list and scrape relevant data from the current page
    for job in joblist.find_all("div", attrs={"class": "cardOutline"}):
        job_title_anchor = job.find("a", attrs={"class": "jcs-JobTitle"})

        title = job_title_anchor.get_text()
        company = job.find("span", attrs={"class": "companyName"}).get_text()
        location = job.find("div", attrs={"class": "companyLocation"}).get_text()

        remote = None
        if "remote" in location.lower():
            remote = True
            location = re.sub("Remote in ", "", location, flags=re.I)
        else:
            remote = False

        location = re.sub("\+\d+\s*location", "", location, flags=re.I)

        job_url = indeed_url + job_title_anchor.get("href")[1:]   # the slicing is for removing the leading '/'

        # Open job details page in new tab
        driver.execute_script("window.open('');")
        driver.switch_to.window(driver.window_handles[1])
        driver.get(job_url)

        job_view = driver.find_element(By.ID, "viewJobSSRRoot")
        job_view_soup = BeautifulSoup(job_view.get_attribute("outerHTML"), "html.parser")

        job_details = job_view_soup.find_all("div", attrs={"class": "jobsearch-JobDescriptionSection-sectionItem"})

        # Obtain the job salary and job type ('full time', 'temporary', etc...)
        # if these details are available
        salary = None
        job_type = None
        for detail in job_details:

            # Parse salary text. Some job postings show
            # a salary range in hour, month or year
            # In this project, we'll take the mean of the salary range
            # and convert every salary to the monthly value for storage
            if detail.contents[0].get_text() == "Salary":
                salary_text = detail.contents[1].get_text().replace(",", "")
                salary_l = re.findall("\d+\.?\d+", salary_text)

                if len(salary_l) == 2:
                    salary = (float(salary_l[0]) + float(salary_l[1])) / 2
                    salary = round(salary, 2)
                else:
                    salary = float(salary_l[0])

                # Compute monthly salary, use average for intervals
                # Assume every shift is 8hrs long and each week consists of 5 work days
                if "an hour" in salary_text or "per hour" in salary_text:
                    salary *= 8 * 5 * 4

                if "a year" in salary_text or "per year" in salary_text:
                    salary /= 12

                salary = round(salary, 2)

            # Get job type if available
            elif detail.contents[0].get_text() == "Job type":
                job_type = detail.contents[1].get_text()

        # Get the job description text and remove unnecessary whitespace and newlines
        job_descr = job_view_soup.find("div", attrs={"id": "jobDescriptionText"}).get_text()
        job_descr = job_descr.strip()
        job_descr = re.sub("\n+", " ", job_descr)
        job_descr = re.sub(" +", " ", job_descr)

        # Close new tab and switch back to tab with jobs list
        driver.close()
        driver.switch_to.window(driver.window_handles[0])

        df.loc[len(df.index)] = [title, job_type, company, location, remote, salary, job_descr]
            
    # Close popover if appearing on page, Note that we are using find_elements (emphasis on the "s")
    close_btn_popover = driver.find_elements(By.CSS_SELECTOR, "button.popover-x-button-close")
    
    if len(close_btn_popover) > 0:
        close_btn_popover[0].click()
    
    # Get next search result page
    nextpage = find_next(soup, driver)
    
    if nextpage:
        nextpage.click()
    else:
        break
    
# Quit browser and close csv file
driver.quit()




[WDM] - Current google-chrome version is 102.0.5005
[WDM] - Get LATEST chromedriver version for 102.0.5005 google-chrome
[WDM] - Driver [C:\Users\Owner\.wdm\drivers\chromedriver\win32\102.0.5005.61\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)


#### Step 7: Store collected data in a csv file

In [94]:
from os import remove
from os.path import exists

# Remove any existing file called "indeed_da_jobs.csv"
if exists("indeed_da_jobs.csv"):
    remove("indeed_da_jobs.csv")
    
# Store scraped data in csv file
df.to_csv("indeed_da_jobs.csv")
    
# Test-read the csv file
test = pd.read_csv("indeed_da_jobs.csv", index_col=[0])
test.head(5)

Unnamed: 0,Job_Title,Job_Type,Company,Location,Remote,Salary,Summary
0,Data Security Analyst,Full-time,CWB Financial Group,Hybrid Canada,True,,"At CWB , we strive to build value for the peop..."
1,Inventory Analyst,Full-time,Canadian Natural,"Calgary, AB",False,,Inventory Analyst- (2210923)The Opportunity:Ar...
2,Junior Financial Data Analyst,Full-time,Lawson Lundell LLP,"Calgary, AB",False,,Lawson Lundell LLP is a leading regional Canad...
3,Contract Data Reporting Analyst,Temporary,Robert Half,"Edmonton, AB",True,,Robert Half is recruiting now for an experienc...
4,Financial Data Analyst,Full-time,Government of Alberta,"Edmonton, AB",True,6190.0,Job Information Job Requisition ID: 29202 Mini...


___
### Bonus
  
#### Step 1: Get the summary of each scraped job

In [95]:
job_summaries = test.loc[:, ['Summary']]
job_summaries

Unnamed: 0,Summary
0,"At CWB , we strive to build value for the peop..."
1,Inventory Analyst- (2210923)The Opportunity:Ar...
2,Lawson Lundell LLP is a leading regional Canad...
3,Robert Half is recruiting now for an experienc...
4,Job Information Job Requisition ID: 29202 Mini...
...,...
145,Overview: KPMG is an industry leading firm tha...
146,About Athennian Athennian increases trust in b...
147,Resolute is a Full-Service IT firm with a mult...
148,12-month contract Let’s impact lives for the b...


#### Step 2: Download English stop words & add more stop words if needed
  
"[Stop words](https://en.wikipedia.org/wiki/Stop_word) are any word in a stop list which are filtered out before or after processing of natural language data." - Wikipedia

In [107]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = stopwords.words("English")
stop_words.extend(["skills", "data", "ability", "information", "business", "analyst", 
                   "requirements", "strong", "analysis", "process", "working"])
stop_words

[nltk_data] Downloading package stopwords to
[nltk_data]     D:\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

#### Step 3: Filter out stop words from job summaries

In [108]:
summary_words_list = []
for row in job_summaries.itertuples():
    summary = re.findall("[\w']+", row[1])
    summary = filter(lambda x: x.lower() not in stop_words, summary)   # Not a stop word
    summary = map(str.capitalize, summary)
    
    summary_words_list.extend(summary)
    
print(summary_words_list)



#### Step 4: Count the number of times each word appears overall in the summaries

In [109]:
from collections import Counter

summary_counter = Counter(summary_words_list)
summary_counter

Counter({'Cwb': 24,
         'Strive': 18,
         'Build': 91,
         'Value': 83,
         'People': 132,
         'Choose': 6,
         'Us': 110,
         'Every': 49,
         'Day': 64,
         'Clients': 119,
         'Investors': 7,
         'Holding': 2,
         'True': 13,
         'Values': 44,
         'Guide': 8,
         'Put': 10,
         'First': 32,
         'Relationships': 53,
         'Intention': 2,
         'Seek': 13,
         'Embrace': 10,
         'New': 136,
         'Ideas': 37,
         'Knowing': 3,
         'Better': 44,
         'Always': 28,
         'Possible': 13,
         'Believe': 27,
         'Things': 14,
         'Important': 22,
         'Harness': 2,
         'Power': 109,
         'Inclusion': 44,
         'Culture': 63,
         'Show': 10,
         'Individuals': 45,
         'Team': 416,
         'Accomplish': 9,
         'Strategy': 40,
         'Canadian': 38,
         'Western': 21,
         'Bank': 11,
         'Full': 90,
      

#### Step 5: Obtain the 10 most common words overall in the job descriptions

In [110]:
summary_counter.most_common(10)

[('Experience', 599),
 ('Work', 594),
 ('Team', 416),
 ('Management', 357),
 ('Support', 269),
 ('Knowledge', 212),
 ('Development', 205),
 ('Environment', 202),
 ('Technical', 186),
 ('Required', 183)]

It seems that the "Experience" and "Team" elements are recurring in most Data Analyst job descriptions in Alberta.

#### Step 6: Find the count the per job count for some specific words

In [111]:
job_count = len(df.index)
count_per_job = {k: round(v/ job_count, 3) for k, v in summary_counter.items()}
print("Python", count_per_job["Python"])
print("Excel", count_per_job["Excel"])
print("SQL", count_per_job["Sql"])
print("Tableau", count_per_job["Tableau"])
print("Power BI", count_per_job["Bi"])
print("Remote", count_per_job["Remote"])

Python 0.167
Excel 0.58
SQL 0.74
Tableau 0.18
Power BI 0.66
Remote 0.227


SQL and Power BI seem to be very common in job descriptions. Assuming each word are present in roughly equal amount in every job description, almost 75% of Data Analyst jobs in Alberta require SQL and about 66% require Power BI.