<h1>Job Market Trends</h1>
<h2>Extract, Transform, and Load Data</h2>

Data Analyst vs Data Scientist job

In [2]:
import os
import codecs
from bs4 import BeautifulSoup
import csv

<h2>Part 1: Access data files within a Directory</h2>

The job postings are stored as files within a directory, so we will create a function to iterate through files in a directory to be able to open each one.

In [3]:
# first check that we are in the correct directory
print(os.getcwd())

/Users/jennifer/nlp-jobmarket


In [4]:
# print a list of the files in the working directory
!ls

[31m1A main_etl_analyst_csv.ipynb[m[m
1A main_etl_analyst_csv_UPDATE.ipynb
[31m1B main_etl_analyst_sql.ipynb[m[m
1B main_etl_analyst_sql_UPDATE.ipynb
[31m1B main_etl_scientist_sql.ipynb[m[m
[30m[43m24 Jun popup window[m[m
[31m2A main_csv_jobdesc_nlp_preproc.ipynb[m[m
2B Stemming code that didn't work.ipynb
2B main_sql_jobdesc_nlp_preproc.html
[31m2B main_sql_jobdesc_nlp_preproc.ipynb[m[m
2B main_sql_jobdesc_nlp_topicmodeling.ipynb
3B main_sql_nlp_tfidf_modelling.ipynb
[1m[36mData Analyst[m[m
[1m[36mData Scientist[m[m
README.md
joblist.sqlite
main_etl_scientist_sql.py
[31mmain_jobdesc_eda.ipynb[m[m
results.csv
[1m[36mtest_folder[m[m
[30m[43mtest_folder2[m[m


In [5]:
def get_raw_data(directory):
    '''Open file containing html of job description and prepare soup object.'''
    fileList = []
    soupList = []
    # Iterate through each file in directory
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            # add each filename to list
            fileList.append(file)
            print(fileList)
            # open and load html
            with codecs.open(directory + "/"+ file, 'r', "utf-8") as f:
                job_html = f.read()
                job_soup = BeautifulSoup(job_html, "html.parser")
                soupList.append(job_soup)
    return soupList

In [6]:
# Check to make sure all items are in list
#len(soupList)

Great. We are able to open each of the .txt files that are in our directory of interest.

<h2>Part 2 : Opening and extracting information from files</h2>

First, we will use two test files to test to make sure we can pull out the information we want. This is because some companies have ratings available and some do not. This changes the html code slightly and caused some problems. Below is the result from one of the two test files.

"24 Jun popup window/Untitled 14-52-48.txt"

Untitled 14-42-55.txt - hourly rate

Untitled 15-6-7.txt - no $

Untitled 15-7-2.txt - no $

Untitled 15-23-34.txt - uses h1 tag, company: try1 correct, try2 incorrect, no $

Untitled 14-22-8.txt Canadian Tire

Untitled 14-46-33.txt Antuit AI

In [16]:
with codecs.open("24 Jun popup window/Untitled 14-46-33.txt", 'r', "utf-8") as f:
    job_html = f.read()
job_soup = BeautifulSoup(job_html, "html.parser")

#print(job_soup)

In [17]:
# job_title
try:
    job_title = job_soup.find("div", id="vjs-jobtitle").text.strip()
    print('Try 1: ', job_title)
except:
    pass

try:
    job_title = job_soup.find("h1", id="vjs-jobtitle").text.strip()
    print('Try 2: ', job_title)
except:
    pass

Try 1:  Lead Data Scientist


The above code was good for only some of the job listings (many of which seem to have a hyperlink).
Try to find another way to extract company information from the job descriptions where NaN appeared.

In [18]:
# company
try:
    company = job_soup.find("span", id="vjs-cn").text.strip()
    print('try 1: ', company)
except:
    pass

try 1:  Antuit.ai


In [19]:
job_location = job_soup.find("span", id="vjs-loc").text.strip().replace("- ", "")
print(job_location)

Toronto, ON


In [20]:
print(job_soup.find_all("span", attrs = {"id": None, "class": None, "aria-hidden": None}))

[<span>Full-time, Permanent</span>]


In [21]:
# Salary - extract hourly rate

try:
    job_salary = job_soup.find("span", attrs = {"id": None, "class": None, "aria-hidden": None}).text.strip()
except AttributeError:
    job_salary = "NaN"
print(job_salary)

Full-time, Permanent


For the salary, since there were no specific attributes associated with the tag, here we indicated what the attributes are not instead of what they are. This is why for some of the job postings, the salary doesn't appear and 

In [22]:
print(job_soup.prettify())

<div class="vjs-header-no-shadow" id="vjs-header">
 <div id="vjs-header-jobinfo">
  <div id="vjs-jobinfo">
   <div id="vjs-jobtitle" tabindex="0">
    Lead Data Scientist
   </div>
   <div>
    <span id="vjs-cn">
     Antuit.ai
    </span>
    <span id="vjs-loc">
     <span aria-hidden="true">
      -
     </span>
     Toronto, ON
    </span>
   </div>
   <div>
    <span>
     Full-time, Permanent
    </span>
    <span aria-hidden="true" class="remote-bullet">
     -
    </span>
    <span class="remote">
     Remote
    </span>
   </div>
  </div>
 </div>
 <div id="vjs-x">
  <button aria-label="Close job details" class="icl-CloseButton vjs-x-button-close">
   <svg class="icl-Icon icl-Icon--md icl-Icon--black close" role="img" viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg">
    <rect fill="#fff" fill-opacity=".5" height="24" rx="12" width="24">
    </rect>
    <path clip-rule="evenodd" d="m15.536 7.8987c-0.1953-0.19526-0.5119-0.19526-0.7071 0l-2.8284 2.8284-2.8285-2.8284c-0.19526

In [13]:
try:
    job_description = job_soup.find("div", id="vsj-desc").text.strip().replace("\n", " ")
    print('try 1: ', job_description)
except:
    pass

try:
    job_description = job_soup.find("div", id="vjs-content").text.strip().replace("\n", " ")
    print("try 2: ",job_description)
except:
    pass

try 2:  The CTC Personalization & Customer Analytics team is the central hub for engaging consumers with exciting and inspirational loyalty and product offers through better use of customer data, driving incremental sales and profit. The Promo Analytics and Operations team is accountable for creating a high performance, cross-banner, silo-free source of all customer data enabling quick and efficient customer insights, audience creation, customer journey analyses and advanced customer modelling. The Data Scientist role provides technical leadership in all facets of the project, from selecting key customer features and interactions to ensuring accurate data blending to deriving new customer attributes through descriptive and predictive analytics through a deep understanding of applying the data science lifecycle to customer modelling. The Data Scientist will be the subject matter expert in combining customer-related data sources, collaborating with teams that produce customer insights, d

In [23]:
# Write a function to automatically determine whether the label should be 0 or 1 based on the extracted job title

def get_label(job_title):
    if 'cientist' in job_title:
        label = '1'
    else:
        label = '0' #analyst
    return label

job_title1 = "Data Scientist"
job_title2 = "Data Analyst"

label = get_label(job_title1)
print("job_title1 label: ", label)

label1 = get_label(job_title2)
print("job_title2 label: ", label1)

job_title1 label:  1
job_title2 label:  0


In [26]:
job_record = [label, job_title, company, job_location, job_salary, job_description]
print(job_record)

['1', 'Lead Data Scientist', 'Antuit.ai', 'Toronto, ON', 'Full-time,\xa0Permanent', 'The CTC Personalization & Customer Analytics team is the central hub for engaging consumers with exciting and inspirational loyalty and product offers through better use of customer data, driving incremental sales and profit. The Promo Analytics and Operations team is accountable for creating a high performance, cross-banner, silo-free source of all customer data enabling quick and efficient customer insights, audience creation, customer journey analyses and advanced customer modelling. The Data Scientist role provides technical leadership in all facets of the project, from selecting key customer features and interactions to ensuring accurate data blending to deriving new customer attributes through descriptive and predictive analytics through a deep understanding of applying the data science lifecycle to customer modelling. The Data Scientist will be the subject matter expert in combining customer-rel

<h2>Part 3 : Put it all together</h2>

Put all the steps together so that we can easily extract job information from each text file and keep a record of which files we have opened.

In [71]:
# Works!
import os
import codecs
from bs4 import BeautifulSoup
import csv

def get_raw_data(directory):
    '''Open file containing html of job description and prepare soup object.'''
    fileList = []
    soupList = []
    # Iterate through each file in directory
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            # add each filename to list
            fileList.append(file)
            #print(fileList)
            # open and load html
            with codecs.open(directory + "/"+ file, 'r', "utf-8") as f:
                job_html = f.read()
                job_soup = BeautifulSoup(job_html, "html.parser")
                soupList.append(job_soup)
    print("soup_list is done.")
    return soupList

# From the loaded text, extract job information using beautiful soup
def get_job_record(job_soup):
    '''Create a record of information for one job.'''
    # Title
    try:
        job_title = job_soup.find("div", id="vjs-jobtitle").text.strip()
    except:
        try:
            job_title = job_soup.find("h1", id="vjs-jobtitle").text.strip()
        except:
            job_title = "NaN"
    
    # Company
    try:
        company = job_soup.find("span", id="vjs-cn").text.strip()
    except:    
        company = "NaN"

    # Location
    try:
        job_location = job_soup.find("span", id="vjs-loc").text.strip().replace("- ", "")
    except:
        job_location = "NaN"
    
    # Job Salary
    try:
        job_salary = job_soup.find("span", attrs = {"id": None, "class": None, "aria-hidden": None}).text.strip()
    except AttributeError:
        job_salary = "NaN"
    
    # Job Description
    try:
        job_description = job_soup.find("div", id="vsj-desc").text.strip().replace("\n", " ")
    except:
        try:
            job_description = job_soup.find("div", id="vjs-content").text.strip().replace("\n", " ")
        except:
            job_summary = "NaN"
    
    job_record = [label, job_title, company, job_location, job_salary, job_description]
    return job_record

def main_etl(directory):
    '''This function loads text data, extracts pertinent job information, and saves data in a csv file.'''
    #while True:
    soupList = get_raw_data(directory)
        
    # add each job record to a list
    job_records = []
    for job_soup in soupList:
        job_record = get_job_record(job_soup)
        job_records.append(job_record)
    
    print("Added items to job_records list. Length of job_records is: ", len(job_records))

    # add job records to csv by row
    with open('results.csv', 'w', newline = '', encoding = 'utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Label', 'Job Title', 'Company', 'Location', 'Salary', 'Job Description'])
        writer.writerows(job_records)

Let's test out the functionality on another folder containing files with job description in html format.

In [72]:
print(os.getcwd())

/Users/jennifer/nlp-jobmarket


In [73]:
!ls

[31m1A main_etl_analyst_csv.ipynb[m[m
1A main_etl_analyst_csv_UPDATE.ipynb
[31m1B main_etl_analyst_sql.ipynb[m[m
1B main_etl_analyst_sql_UPDATE.ipynb
[31m1B main_etl_scientist_sql.ipynb[m[m
[30m[43m24 Jun popup window[m[m
[31m2A main_csv_jobdesc_nlp_preproc.ipynb[m[m
2B Stemming code that didn't work.ipynb
2B main_sql_jobdesc_nlp_preproc.html
[31m2B main_sql_jobdesc_nlp_preproc.ipynb[m[m
2B main_sql_jobdesc_nlp_topicmodeling.ipynb
3B main_sql_nlp_tfidf_modelling.ipynb
[1m[36mData Analyst[m[m
[1m[36mData Scientist[m[m
README.md
joblist.sqlite
main_etl_scientist_sql.py
[31mmain_jobdesc_eda.ipynb[m[m
results.csv
[1m[36mtest_folder[m[m
[30m[43mtest_folder2[m[m


In [74]:
dataAnalyst = main_etl("24 Jun popup window")

soup_list is done.
Added items to job_records list. Length of job_records is:  75
