- The notebook is part of a project to re-design a course curriculum for MIE 1624: Introduction to Data Science and Analytics. This is done by performing a web scraping exercise to extract relevant skills required for data analyst, data scientist, data manager, data engineer, etc. from well-known job posting sites, such as Indeed, glassdoor, linkedin, upwork, etc. Additional data will also be obtained from Kaggle datasets and other online platforms such as CognitiveClas.ai, Coursera, EdX, DataCamp, etc.
- This notebook will extract the skills for data related jobs from Indeed sites, focusing on North America countries: US and Canada.
- The scraping is conducted using "requests", "BeautifulSoup", and if needded, "Selenium" libraries in Python, then "pandas" library will be used to assemble data into dataframe for further pre-processing and cleaning steps. Note that BeautifulSoup is:
  * a Python-based parsing library that allows you to extract data from web pages
  * It structures an HTML or XML web page. BS is made up of different parsing tools such as html.parser, lxml, and HTML5lib
  * user-friendly
  
-  Selenium is a library that lets you code a python script that would act just like a human user. Selenium is used when target websites has a lot of Javascript elements in its code. Selenium is an API that allow you to control a headless browser through a series of programs. When using Selenium, you can also perform other actions such as mouse clicks and filling forms. 
- A URL for data scientist job search in Toronto from Indeed site looks like: "https://ca.indeed.com/jobs?q=data%20scientist&l=Toronto%2C%20ON", where:


    * "q=" begins the string for the “what” field on the page, separating search terms with “+” (i.e. searching for “data+scientist” jobs)
    * “&l=” begins the string for city of interest, separating search terms with “+” if city is more than one word (i.e. “New+York”
    * Each page of the job results have 15 job posts.


In [None]:
# fake_useragent to mimics human interactions so dont get blocked by the site
!pip install fake_useragent



In [None]:
# Dependencies
from bs4 import BeautifulSoup
import requests
# import pymongo
import pandas as pd
import random
from fake_useragent import UserAgent  #generate random UAs
import time

In [None]:
# From here we generate a random user agent
ua = UserAgent()

user_agent = ua.random
header = {"user-agent": str(user_agent)}
# header = {"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

In [None]:
# Country of indeed website to be scraped
country = "ca"

In [None]:
# job_dict = ["Data+Analyst", "Data+Analytics", "Data+Manager", "Business+Analyst", "AI+System+Designer", "Data+Scientist", "Data+Engineer", "Machine+Learning+Engineer"]

In [None]:
# Create a job dictionary
job_dict = ["Data+Manager"]

# Create empty lists for job posting information
job_title_list = []
job_title_index = []
job_link_list = []
job_description_list = []

# Start the main web scraping
for i, title in enumerate(job_dict):

  print("Starting job search for: ", title)

  for page in range(0,210,10):

    # Create random UA and create dict parameter for requests
    url = f"https://{country}.indeed.com/jobs?q={title}&start={page}"    #use this for other countries than us
    # url = f"https://indeed.com/jobs?q={title}&start={page}"              #use this for us only  
    
    print("Current page: ", url)
    
    # Random time gap
    time_gap = random.randrange(3, 7, 1)
    time.sleep(time_gap)
        
    # Retrieve page with the requests module
    response = requests.get(
                            url,
                            # proxies=proxy_protocol,
                            headers=header)
    
    # Create BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')       
        
    # Retrieve the parent divs for all results
    results = soup.find_all('a', class_='result')
     
    # Start looping over results to get each job data
    for result in results:
      try:
        #  get job title and create job index
        job_title = result.find('h2', class_='jobTitle').text.replace('new', '')
        job_index = i + 1

        # get job link
        href = result.get('href')
        job_link = f'https://ca.indeed.com{href}'

        # Go into each job_link and scrape job description
        job_description_response = requests.get(job_link, headers=header)
        description_soup = BeautifulSoup(job_description_response.text, 'html.parser')
        job_description = description_soup.find('div', {'id': 'jobDescriptionText'}).text.replace('\n', '') 

        # Append to their lists
        job_title_list.append(job_title)
        job_title_index.append(job_index)
        job_link_list.append(job_link)
        job_description_list.append(job_description)

      except:
        pass

    # Update page parameter by adding 10
    page += 10
      
    # Every 10 pages, get random UA
    if page % 100 == 0:
      user_agent = ua.random
      header = {"user-agent": str(user_agent)}
      print(f"----------------\n\
              A new user-agent was created:\n\
              {user_agent}\n----------------")
          
print("===================\nScraping completed")

Starting job search for:  Data+Manager
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=0
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=10
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=20
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=30
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=40
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=50
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=60
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=70
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=80
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=90
----------------
              A new user-agent was created:
              Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/4.0; InfoPath.2; SV1; .NET CLR 2.0.50727; WOW64)
----------------
Current page:  https://ca.indeed.com/jobs?q=Data+Manager&start=100
Current page:  https://ca.indeed.c

In [None]:
# Putting list into dataframe
jobmarket = {"Job Title Index" : job_title_index,
                "Job Title" : job_title_list, 
                "Link": job_link_list,
                "Job Description": job_description_list}

# Generating the dataframe
jobmarket_df = pd.DataFrame(jobmarket)
jobmarket_df["Job Title Index"] = jobmarket_df["Job Title Index"]-1
jobmarket_df

Unnamed: 0,Job Title Index,Job Title,Link,Job Description
0,0,Product Data Manager,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,FortNine is a rapidly growing motorcycle and p...
1,0,"BAND 3 - Manager, Data Analytics",https://ca.indeed.com/rc/clk?jk=7b3c5d5fb0b45e...,"Manager, Data AnalyticsClassification: BAND 3J..."
2,0,Data Manager (Remote),https://ca.indeed.com/company/ITFO-Communicati...,"At ITFO, we are in the business of connecting ..."
3,0,"Manager, Data Insights",https://ca.indeed.com/rc/clk?jk=d2841378e12536...,"Reports ToDirector, AnalyticsLocationSysco Can..."
4,0,"Manager, Data Centre Infastructure",https://ca.indeed.com/rc/clk?jk=7a32cce564d29f...,Who is eHealth Saskatchewan?eHealth Saskatchew...
...,...,...,...,...
127,0,"Senior Manager, AML Data Governance",https://ca.indeed.com/rc/clk?jk=7a76bc8e78dd41...,Requisition ID: 120812Join a purpose driven wi...
128,0,Data Architect & Migration Manager,https://ca.indeed.com/rc/clk?jk=b2c8b98857a97c...,Company DescriptionBecome part of our growing ...
129,0,"Procurement Manager, Technology & Data Busines...",https://ca.indeed.com/rc/clk?jk=8367d2a205ff78...,Company DescriptionMake an impact at a global ...
130,0,"Manager, Analytics",https://ca.indeed.com/rc/clk?jk=4a744c6d07c2bf...,RBC Customer Relationship Management & Analyti...


# Topics

In [None]:
topics = ['ETL/ELT', 'Ad-hoc', 'EDA', 'data analysis', 'Business Intelligence tools', 'dashboards', 'visualization', 'business analysis', 'analytical tools', 'ROI analysis',\
          'machine learning', 'supervised', 'unsupervised', 'feature engineering', 'neural networks', 'artificial intelligence', 'deep learning', 'reinforcement learning',\
          'computer vision', 'natural language processing', 'recommendation system',\
          'data mining', 'data warehousing', 'databases', 'data architechture', 'math', 'statistics', 'text analytics']
          
jobmarket_df_topics = jobmarket_df.copy()

for item in topics:
  try:
    jobmarket_df_topics[item] = jobmarket_df_topics['Job Description'].str.contains(item, na=False, case=False)
    jobmarket_df_topics[[item]] = jobmarket_df_topics[[item]].astype(int)
  except:
    pass
  # print(item) 

In [None]:
jobmarket_df_topics.head()

Unnamed: 0,Job Title Index,Job Title,Link,Job Description,ETL/ELT,Ad-hoc,EDA,data analysis,Business Intelligence tools,dashboards,visualization,business analysis,analytical tools,ROI analysis,machine learning,supervised,unsupervised,feature engineering,neural networks,artificial intelligence,deep learning,reinforcement learning,computer vision,natural language processing,recommendation system,data mining,data warehousing,databases,data architechture,math,statistics,text analytics
0,0,Product Data Manager,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,FortNine is a rapidly growing motorcycle and p...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,"BAND 3 - Manager, Data Analytics",https://ca.indeed.com/rc/clk?jk=7b3c5d5fb0b45e...,"Manager, Data AnalyticsClassification: BAND 3J...",0,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0
2,0,Data Manager (Remote),https://ca.indeed.com/company/ITFO-Communicati...,"At ITFO, we are in the business of connecting ...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,"Manager, Data Insights",https://ca.indeed.com/rc/clk?jk=d2841378e12536...,"Reports ToDirector, AnalyticsLocationSysco Can...",0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,"Manager, Data Centre Infastructure",https://ca.indeed.com/rc/clk?jk=7a32cce564d29f...,Who is eHealth Saskatchewan?eHealth Saskatchew...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
jobmarket_df_topics.iloc[:, 4:].sum()

ETL/ELT                         0
Ad-hoc                          4
EDA                            18
data analysis                  14
Business Intelligence tools     4
dashboards                     15
visualization                  19
business analysis               5
analytical tools                4
ROI analysis                    0
machine learning               13
supervised                      3
unsupervised                    1
feature engineering             0
neural networks                 0
artificial intelligence         4
deep learning                   3
reinforcement learning          0
computer vision                 2
natural language processing     1
recommendation system           0
data mining                     6
data warehousing                4
databases                      20
data architechture              0
math                           19
statistics                     25
text analytics                  0
dtype: int64

In [None]:
jobmarket_df_topics.shape

(132, 32)

# Tools

In [None]:
tools = ['Excel', 'Python', 'R', 'SQL', 'NoSQL', 'pyspark', 'java', 'Javascript', 'html', 'Julia', 'Swift', 'Bash', 'Matlab', 'GCP', 'AWS',\
         'Azure', 'Big data', 'Hadoop', 'MySQL', 'PostgreSQL', 'Oracle', 'MongoDB', 'Snowflake', 'IBM', 'Redshift', 'Power BI', 'Tableau', 'SAP', 'Qlik']

jobmarket_df_tools = jobmarket_df.copy()

for item in tools:
  try:
    jobmarket_df_tools[item] = jobmarket_df_tools['Job Description'].str.contains(item, na=False, case=False)
    jobmarket_df_tools[[item]] = jobmarket_df_tools[[item]].astype(int)
  except:
    pass


In [None]:
jobmarket_df_tools.head()

Unnamed: 0,Job Title Index,Job Title,Link,Job Description,Excel,Python,R,SQL,NoSQL,pyspark,java,Javascript,html,Julia,Swift,Bash,Matlab,GCP,AWS,Azure,Big data,Hadoop,MySQL,PostgreSQL,Oracle,MongoDB,Snowflake,IBM,Redshift,Power BI,Tableau,SAP,Qlik
0,0,Product Data Manager,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,FortNine is a rapidly growing motorcycle and p...,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,"BAND 3 - Manager, Data Analytics",https://ca.indeed.com/rc/clk?jk=7b3c5d5fb0b45e...,"Manager, Data AnalyticsClassification: BAND 3J...",1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1
2,0,Data Manager (Remote),https://ca.indeed.com/company/ITFO-Communicati...,"At ITFO, we are in the business of connecting ...",1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,"Manager, Data Insights",https://ca.indeed.com/rc/clk?jk=d2841378e12536...,"Reports ToDirector, AnalyticsLocationSysco Can...",1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,0,"Manager, Data Centre Infastructure",https://ca.indeed.com/rc/clk?jk=7a32cce564d29f...,Who is eHealth Saskatchewan?eHealth Saskatchew...,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
jobmarket_df_tools.shape

(132, 33)

In [None]:
jobmarket_df_tools.iloc[:, 4:].sum()

Excel          88
Python         16
R             132
SQL            28
NoSQL           1
pyspark         1
java            0
Javascript      0
html            1
Julia           0
Swift           0
Bash            1
Matlab          0
GCP             7
AWS            14
Azure           8
Big data        6
Hadoop          3
MySQL           2
PostgreSQL      0
Oracle          3
MongoDB         1
Snowflake       4
IBM             3
Redshift        1
Power BI        7
Tableau        15
SAP            16
Qlik            4
dtype: int64

# Soft Skills

In [None]:
soft_skills = ['learning', 'story telling', 'presentation', 'decision making', 'research',  'creativity', 'critical thinking', 'analytical thinking', 'curiosity',\
               'domain knowledge', 'communication', 'report writing', 'team', 'business acumen']

jobmarket_df_softskills = jobmarket_df.copy()

for item in soft_skills:
  try:
    jobmarket_df_softskills[item] = jobmarket_df_softskills['Job Description'].str.contains(item, na=False, case=False)
    jobmarket_df_softskills[[item]] = jobmarket_df_softskills[[item]].astype(int)
  except:
    pass


In [None]:
jobmarket_df_softskills.head()

Unnamed: 0,Job Title Index,Job Title,Link,Job Description,learning,story telling,presentation,decision making,research,creativity,critical thinking,analytical thinking,curiosity,domain knowledge,communication,report writing,team,business acumen
0,0,Product Data Manager,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,FortNine is a rapidly growing motorcycle and p...,1,0,0,0,0,0,0,0,1,0,0,0,1,0
1,0,"BAND 3 - Manager, Data Analytics",https://ca.indeed.com/rc/clk?jk=7b3c5d5fb0b45e...,"Manager, Data AnalyticsClassification: BAND 3J...",0,0,1,1,1,0,0,0,0,0,0,0,1,0
2,0,Data Manager (Remote),https://ca.indeed.com/company/ITFO-Communicati...,"At ITFO, we are in the business of connecting ...",1,0,0,0,0,0,0,0,0,0,1,0,1,0
3,0,"Manager, Data Insights",https://ca.indeed.com/rc/clk?jk=d2841378e12536...,"Reports ToDirector, AnalyticsLocationSysco Can...",1,0,1,0,1,0,0,0,0,0,1,0,1,0
4,0,"Manager, Data Centre Infastructure",https://ca.indeed.com/rc/clk?jk=7a32cce564d29f...,Who is eHealth Saskatchewan?eHealth Saskatchew...,0,0,1,0,0,0,0,0,0,0,0,0,1,0


In [None]:
jobmarket_df_softskills.shape

(132, 18)

In [None]:
jobmarket_df_softskills.iloc[:, 4:].sum()

learning                53
story telling            0
presentation            40
decision making         14
research                46
creativity               9
critical thinking        7
analytical thinking      1
curiosity               11
domain knowledge         2
communication           91
report writing           0
team                   125
business acumen          7
dtype: int64

# Export to CSV

In [None]:
jobmarket_df_topics.to_csv("CA-JobMarket-DataManager-topics.csv")
jobmarket_df_tools.to_csv("CA-JobMarket-DataManager-tools.csv")
jobmarket_df_softskills.to_csv("CA-JobMarket-DataManager-softskills.csv")