<h1>Job Market Trends</h1>
<h2>Extract, Transform, and Load Data</h2>

Data Analyst vs Data Scientist job

In [1]:
import os
import codecs
from bs4 import BeautifulSoup
import csv

<h2>Part 1: Access data files within a Directory</h2>

The job postings are stored as files within a directory, so we will create a function to iterate through files in a directory to be able to open each one.

In [2]:
# first check that we are in the correct directory
print(os.getcwd())

/Users/jennifer/nlp-jobmarket


In [3]:
# print a list of the files in the working directory
!ls

[31m1A main_etl_analyst_csv.ipynb[m[m
1A main_etl_analyst_csv_UPDATE.ipynb
[31m1B main_etl_analyst_sql.ipynb[m[m
1B main_etl_analyst_sql_UPDATE.ipynb
[31m1B main_etl_scientist_sql.ipynb[m[m
[30m[43m24 Jun popup window[m[m
[31m2A main_csv_jobdesc_nlp_preproc.ipynb[m[m
2B Stemming code that didn't work.ipynb
2B main_sql_jobdesc_nlp_preproc.html
[31m2B main_sql_jobdesc_nlp_preproc.ipynb[m[m
2B main_sql_jobdesc_nlp_topicmodeling.ipynb
3B main_sql_nlp_tfidf_modelling.ipynb
[1m[36mData Analyst[m[m
[1m[36mData Scientist[m[m
README.md
joblist.sqlite
main_etl_scientist_sql.py
[31mmain_jobdesc_eda.ipynb[m[m
results.csv
[1m[36mtest_folder[m[m
[30m[43mtest_folder2[m[m


In [4]:
def get_raw_data(directory):
    '''Open file containing html of job description and prepare soup object.'''
    fileList = []
    soupList = []
    # Iterate through each file in directory
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            # add each filename to list
            fileList.append(file)
            print(fileList)
            # open and load html
            with codecs.open(directory + "/"+ file, 'r', "utf-8") as f:
                job_html = f.read()
                job_soup = BeautifulSoup(job_html, "html.parser")
                soupList.append(job_soup)
    return soupList

In [5]:
# Check to make sure all items are in list
#len(soupList)

Great. We are able to open each of the .txt files that are in our directory of interest.

<h2>Part 2 : Opening and extracting information from files</h2>

First, we will use two test files to test to make sure we can pull out the information we want. This is because some companies have ratings available and some do not. This changes the html code slightly and caused some problems. Below is the result from one of the two test files.

"24 Jun popup window/Untitled 14-52-48.txt"

Untitled 14-42-55.txt - hourly rate

Untitled 15-6-7.txt - no $

Untitled 15-7-2.txt - no $

Untitled 15-23-34.txt - uses h1 tag, company: try1 correct, try2 incorrect, no $

In [17]:
with codecs.open("24 Jun popup window/Untitled 14-42-55.txt", 'r', "utf-8") as f:
    job_html = f.read()
job_soup = BeautifulSoup(job_html, "html.parser")

#print(job_soup)

In [18]:
# job_title
try:
    job_title = job_soup.find("div", id="vjs-jobtitle").text.strip()
    print('Try 1: ', job_title)
except:
    pass

try:
    job_title = job_soup.find("h1", id="vjs-jobtitle").text.strip()
    print('Try 2: ', job_title)
except:
    pass

Try 1:  Developer (Machine Learning)


The above code was good for only some of the job listings (many of which seem to have a hyperlink).
Try to find another way to extract company information from the job descriptions where NaN appeared.

In [19]:
# company
try:
    company = job_soup.find("span", id="vjs-cn").text.strip()
    print('try 1: ', company)
except:
    pass

try 1:  University Health Network


In [20]:
job_location = job_soup.find("span", id="vjs-loc").text.strip().replace("- ", "")
print(job_location)

Toronto, ON


In [32]:
# Salary
# In this version of html of Indeed's website, it is not easy to find a specific tag for salary
# Try to extract anything that contains a salary then extract text from there


# extract hourly rate
try:
    job_salary = job_soup.find("span", id="vjs-loc").next_sibling.text.strip()
except AttributeError:
    job_salary = "NaN"
print(job_salary)

NaN


In [24]:
for element in job_soup.find("div").next_elements:
    print(repr(element))

<div id="vjs-header-jobinfo"><div id="vjs-cmL"><img alt="University Health Network logo" id="vjs-img-cmL" role="presentation" src="https://d2q79iu7y748jz.cloudfront.net/s/_logo/a27154ab749ae71b9b40e6c6fa01ea51"/></div><div id="vjs-jobinfo">
<div id="vjs-jobtitle" tabindex="0">Developer (Machine Learning)</div>
<div>
<span id="vjs-cn"><a href="/cmp/University-Health-Network" onmousedown="this.href = appendParamsOnce(this.href, 'campaignid\x3d2pane-name\x26from\x3d2pane\x26fromjk\x3d7a2cdb3a10781934\x26jcid\x3d113b7be91d4edc65')" rel="nofollow noopener" target="_blank">University Health Network</a></span>
<a class="turnstileLink slNoUnderline" data-tn-element="reviewStars" data-tn-variant="113b7be91d4edc65" href="/cmp/University-Health-Network/reviews" onmousedown="this.href = appendParamsOnce(this.href, 'campaignid\x3d2pane-review\x26from\x3d2pane\x26fromjk\x3d7a2cdb3a10781934\x26jcid\x3d113b7be91d4edc65\x26jt\x3dDeveloper+%28Machine+Learning%29')" target="_blank"><span class="ratings">

In [21]:
print(job_soup.prettify())

<div class="vjs-header-with-company-logo branding--hi vjs-header-no-shadow" id="vjs-header">
 <div id="vjs-header-jobinfo">
  <div id="vjs-cmL">
   <img alt="University Health Network logo" id="vjs-img-cmL" role="presentation" src="https://d2q79iu7y748jz.cloudfront.net/s/_logo/a27154ab749ae71b9b40e6c6fa01ea51"/>
  </div>
  <div id="vjs-jobinfo">
   <div id="vjs-jobtitle" tabindex="0">
    Developer (Machine Learning)
   </div>
   <div>
    <span id="vjs-cn">
     <a href="/cmp/University-Health-Network" onmousedown="this.href = appendParamsOnce(this.href, 'campaignid\x3d2pane-name\x26from\x3d2pane\x26fromjk\x3d7a2cdb3a10781934\x26jcid\x3d113b7be91d4edc65')" rel="nofollow noopener" target="_blank">
      University Health Network
     </a>
    </span>
    <a class="turnstileLink slNoUnderline" data-tn-element="reviewStars" data-tn-variant="113b7be91d4edc65" href="/cmp/University-Health-Network/reviews" onmousedown="this.href = appendParamsOnce(this.href, 'campaignid\x3d2pane-review\x26f

In [33]:
try:
    job_description = job_soup.find("div", id="vsj-desc").text.strip().replace("\n", " ")
    print('try 1: ', job_description)
except:
    pass

try:
    job_description = job_soup.find("div", id="vjs-content").text.strip().replace("\n", " ")
    print("try 2: ",job_description)
except:
    pass

try 2:  What is the opportunity? Corporate Systems helps RBC functions & businesses achieve business objectives through app development & technology support. We’re also the centre of excellence for employee social collaboration & mobile apps. We’re building a team that embraces innovation & enthusiasm to bring fresh perspectives. We’ve been on a journey to build out high performing, highly resilient technology platforms that can grow with the continuous demands from Group Risk, Human Resources, Chief Administrative Office & Audit, Capital Markets, P&CB and Wealth. We’re looking for talented and passionate technologists to join our team! With an engineering mind-set you will work as part of an agile team to deliver high performing applications built on a cloud platform and streaming technologies. We believe in continuous growth and expanding your capabilities. Join our team today and have a big impact influencing the strength of our advanced insight and analytics. What will you do? Deve

In [None]:
# Write a function to automatically determine whether the label should be 0 or 1 based on the extracted job title

def get_label(job_title):
    if 'cientist' in job_title:
        label = '1'
    else:
        label = '0' #analyst
    return label

job_title1 = "Data Scientist"
job_title2 = "Data Analyst"

label = get_label(job_title1)
print("job_title1 label: ", label)

label1 = get_label(job_title2)
print("job_title2 label: ", label1)

In [None]:
job_record = {'jobtitle': job_title,
              'company': company,
              'location': job_location,
              'salary': job_salary,
              'jobdescription': job_description,
              'label': 1
             }
print(job_record)

<h2>Part 3 : Put it all together</h2>

Put all the steps together so that we can easily extract job information from each text file and keep a record of which files we have opened.

In [52]:
# Works!
import os
import codecs
from bs4 import BeautifulSoup
import csv

def get_raw_data(directory):
    '''Open file containing html of job description and prepare soup object.'''
    fileList = []
    soupList = []
    # Iterate through each file in directory
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            # add each filename to list
            fileList.append(file)
            #print(fileList)
            # open and load html
            with codecs.open(directory + "/"+ file, 'r', "utf-8") as f:
                job_html = f.read()
                job_soup = BeautifulSoup(job_html, "html.parser")
                soupList.append(job_soup)
    print("soup_list is done.")
    return soupList

# From the loaded text, extract job information using beautiful soup
def get_job_record(job_soup):
    '''Create a record of information for one job.'''
    # Title
    try:
        job_title = job_soup.find("div", id="vjs-jobtitle").text.strip()
    except:
        try:
            job_title = job_soup.find("h1", id="vjs-jobtitle").text.strip()
        except:
            job_title = "NaN"
    
    # Company
    try:
        company = job_soup.find("span", id="vjs-cn").text.strip()
    except:    
        company = "NaN"

    # Location
    try:
        job_location = job_soup.find("span", id="vjs-loc").text.strip().replace("- ", "")
    except:
        job_location = "NaN"
    
    # Job Description
    try:
        job_description = job_soup.find("div", id="vsj-desc").text.strip().replace("\n", " ")
    except:
        try:
            job_description = job_soup.find("div", id="vjs-content").text.strip().replace("\n", " ")
        except:
            job_summary = "NaN"
    
    job_record = [job_title, company, job_location, job_description]
    return job_record

def main_etl(directory):
    '''This function loads text data, extracts pertinent job information, and saves data in a csv file.'''
    #while True:
    soupList = get_raw_data(directory)
        
    # add each job record to a list
    job_records = []
    for job_soup in soupList:
        job_record = get_job_record(job_soup)
        job_records.append(job_record)
    
    print("Added items to job_records list. Length of job_records is: ", len(job_records))

    # add job records to csv by row
    with open('results.csv', 'w', newline = '', encoding = 'utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Job Title', 'Company', 'Location', 'Job Description'])
        writer.writerows(job_records)

Let's test out the functionality on another folder containing files with job description in html format.

In [44]:
print(os.getcwd())

/Users/jennifer/nlp-jobmarket


In [45]:
!ls

[31m1A main_etl_analyst_csv.ipynb[m[m
1A main_etl_analyst_csv_UPDATE.ipynb
[31m1B main_etl_analyst_sql.ipynb[m[m
1B main_etl_analyst_sql_UPDATE.ipynb
[31m1B main_etl_scientist_sql.ipynb[m[m
[30m[43m24 Jun popup window[m[m
[31m2A main_csv_jobdesc_nlp_preproc.ipynb[m[m
2B Stemming code that didn't work.ipynb
2B main_sql_jobdesc_nlp_preproc.html
[31m2B main_sql_jobdesc_nlp_preproc.ipynb[m[m
2B main_sql_jobdesc_nlp_topicmodeling.ipynb
3B main_sql_nlp_tfidf_modelling.ipynb
[1m[36mData Analyst[m[m
[1m[36mData Scientist[m[m
README.md
joblist.sqlite
main_etl_scientist_sql.py
[31mmain_jobdesc_eda.ipynb[m[m
results.csv
[1m[36mtest_folder[m[m
[30m[43mtest_folder2[m[m


In [53]:
%%timeit
dataAnalyst = main_etl("24 Jun popup window")

soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
soup_list is done.
Added items to job_records list. Length of job_records is:  75
205 ms ± 928 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
