# Jobs Classification

Given *onet.csv* and *alternate-titles.csv* as template documents and *jobs.csv* as items to classify, perform a classification of jobs.

Result should be jobs-answers.csv with the following fields "title", "jobdesc", "code", "matched_title", where "title" and "jobdesc" are from *jobs.json* and "code", "matched_title" is the best matching template code, title from *onet.csv*

Any technique is allowed, though utilizing some techniques from NLP domain is preferred.

In [30]:
import pandas as pd
onet = pd.read_csv("onet.csv")
onet.head()

Unnamed: 0,code,title,uri,description,parents
0,11-1011.00,Chief Executives,https://www.onetonline.org/link/summary/11-101...,Determine and formulate policies and provide o...,"[""Management Occupations"",""Top Executives"",""Ch..."
1,11-1011.03,Chief Sustainability Officers,https://www.onetonline.org/link/summary/11-101...,"Communicate and coordinate with management, sh...","[""Management Occupations"",""Top Executives"",""Ch..."
2,11-1021.00,General and Operations Managers,https://www.onetonline.org/link/summary/11-102...,"Plan, direct, or coordinate the operations of ...","[""Management Occupations"",""Top Executives"",""Ge..."
3,11-1031.00,Legislators,https://www.onetonline.org/link/summary/11-103...,"Develop, introduce, or enact laws and statutes...","[""Management Occupations"",""Top Executives"",""Le..."
4,11-2011.00,Advertising and Promotions Managers,https://www.onetonline.org/link/summary/11-201...,"Plan, direct, or coordinate advertising polici...","[""Management Occupations"",""Advertising, Market..."


In [31]:
onet['parents'].iloc[1]

'["Management Occupations","Top Executives","Chief Executives","Chief Executives"]'

In [32]:
onet.columns

Index(['code', 'title', 'uri', 'description', 'parents'], dtype='object')

In [33]:
import json
json.loads(onet.iloc[0].parents)

['Management Occupations',
 'Top Executives',
 'Chief Executives',
 'Chief Executives']

In [34]:
onet.iloc[0].description

"Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers. Tasks - Direct or coordinate an organization's financial or budget activities to fund operations, maximize investments, or increase efficiency. - Confer with board members, organization officials, or staff members to discuss issues, coordinate activities, or resolve problems. - Analyze operations to evaluate performance of a company or its staff in meeting objectives or to determine areas of potential cost reduction, program improvement, or policy change. - Direct, plan, or implement policies, objectives, or activities of organizations or businesses to ensure continuing operations, to maximize returns on investments, or to increase productivity. - Pr

In [35]:
alternate_titles = pd.read_csv("alternate-titles.tsv", delimiter="\t")
alternate_titles.head()

Unnamed: 0,code,alternate_title
0,11-1011.00,"[""Aeronautics Commission Director"",""Agricultur..."
1,11-1011.03,"[""Chief Environmental Commitment Officer (CECO..."
2,11-1021.00,"[""Building Manager"",""Business Manager"",""Chief ..."
3,11-1031.00,"[""Alderman"",""Assembly Member"",""Assembly Person..."
4,11-2011.00,"[""Account Director"",""Account Executive"",""Accou..."


In [36]:
alternate_titles.columns

Index(['code', 'alternate_title'], dtype='object')

In [88]:
jobs = pd.read_json("jobs.json")#.head(50)
jobs.head(50)

Unnamed: 0,_id,_score,jobdesc,title
0,7dab585f01fd,,We are currently seeking a Warehouse Associate...,Warehouse Associate
1,f71bc73edb39,,"For 70 years, Charles River employees have wor...",Supply Chain Associate II
2,c49152708aa9,,Looking for an individual who will be responsi...,Azure DevOps Engineer
3,d1b402364b59,,Salesforce Administrator: REMOTE – C-4 Analyti...,Salesforce Administrator: REMOTE
4,495cc0d7c26d,,DescriptionHCR ManorCare provides a range of s...,Floor Care
5,6a18532d3ac9,,It's time to take control of your career! When...,FT/PT LPN - $34+/hr Plus Bonuses!
6,8365d8230c6f,,Per Assessment Nurse Practitioner House Calls ...,Nurse Practitioner
7,4a01dbf371b2,,Acara Solutions is seeking experienced electri...,Panel Wirers - FlexCell
8,8a83545f84be,,Great Clips salon owners are hiring hair styli...,Hair Stylist - Shoppers Plaza
9,a81ef50e5420,,Job DescriptionAmerican Homes 4 RentAs one of ...,Systems Manager III - Lead (Remote)


In [68]:
jobs.columns

Index(['_id', '_score', 'jobdesc', 'title'], dtype='object')

In [69]:
jobs.tail(50)

Unnamed: 0,_id,_score,jobdesc,title
1000,6221598c6ce2,,&lt;h3&gt;Ready to be a Titan?&lt;/h3&gt; &lt;...,Director of Engineering - Remote
1001,d4dd360db299,,"Shipping & Receiving Clerk, Portland, ORBONUS*...",Shipping and Receiving Clerk
1002,f98e107e6030,,CDL A Truck Drivers Needed!We offer 100% paid ...,Class A CDL Truck Driver
1003,3e9813a20e08,,OverviewJoin ECS as a Virtual Customer Support...,Customer Service Representatives to Work from ...
1004,dc0742bdbb87,,Job DescriptionWerner is Hiring CDL-A Company ...,"Need CDL-A Truck Driver Now, 03/08/2022, Multi..."
1005,82b287a4395b,,Job DescriptionRegistered Nurse - Immediate an...,Registered Nurse - RN - Long Term Acute Care (...
1006,4f1912339687,,"Job DescriptionABOUT Effective, secure communi...",Information Professional
1007,894cfc879493,,Reports To (Title): Market LeaderDepartment: F...,Restaurant Manager
1008,409eac16231d,,We are hiring dedicated sales professionals to...,Remote Sales Associate
1009,7379a0887924,,"Start an exciting career, helping seniors. We'...",Compassionate Caregivers | No Experience Needed


In [70]:
jobs.iloc[42].jobdesc

'Real Estate Agent RecruiterPalo Alto, CAReal Estate Agent Recruiter Palo Alto CA    We\'re searching for a motivated Real Estate Agent Recruiter to join the Keller Williams Office in Palo Alto, CA. Are you an excellent communicator that can easily turn challenges into opportunities? Do you take pride in your ability to connect with people, and enjoy learning about their experiences and aspirations?     If you\'re passionate about helping others achieve their professional goals, then we\'re looking for someone like you.         We Know You     When it comes to recruiting, you\'re in a league of your own. You set the bar high for yourself, and you always strive to achieve more than you originally intended. You\'re comfortable chatting with people from all backgrounds and cultures and you can easily relate to anyone you meet. You\'re a sales pro, and one of your favorite things about real estate is that it can offer unlimited financial freedom. When the going gets tough, you\'re unstoppa

In [71]:
jobs.head()

Unnamed: 0,_id,_score,jobdesc,title
0,7dab585f01fd,,We are currently seeking a Warehouse Associate...,Warehouse Associate
1,f71bc73edb39,,"For 70 years, Charles River employees have wor...",Supply Chain Associate II
2,c49152708aa9,,Looking for an individual who will be responsi...,Azure DevOps Engineer
3,d1b402364b59,,Salesforce Administrator: REMOTE – C-4 Analyti...,Salesforce Administrator: REMOTE
4,495cc0d7c26d,,DescriptionHCR ManorCare provides a range of s...,Floor Care


In [72]:
alternate_titles.head()

Unnamed: 0,code,alternate_title
0,11-1011.00,"[""Aeronautics Commission Director"",""Agricultur..."
1,11-1011.03,"[""Chief Environmental Commitment Officer (CECO..."
2,11-1021.00,"[""Building Manager"",""Business Manager"",""Chief ..."
3,11-1031.00,"[""Alderman"",""Assembly Member"",""Assembly Person..."
4,11-2011.00,"[""Account Director"",""Account Executive"",""Accou..."


In [73]:
eval(alternate_titles["alternate_title"].tolist()[0])

['Aeronautics Commission Director',
 'Agricultural Services Director',
 'Arts and Humanities Council Director',
 'Bank President',
 'Bureau Chief',
 'Business Development Executive',
 'Business Executive',
 'Chief Administrative Officer',
 'Chief Diversity Officer (CDO)',
 'Chief Executive Officer (CEO)',
 'Chief Financial Officer (CFO)',
 'Chief Information Officer (CIO)',
 'Chief Innovation Officer',
 'Chief Nursing Officer',
 'Chief Operating Officer (COO)',
 'Chief Technical Officer (CTO)',
 'Chief Warden',
 'City Administrator',
 'City Manager',
 'City Superintendent',
 'Classification and Treatment Director',
 'Commissioner of Internal Revenue',
 'Consumer Affairs Director',
 'Corporate Executive',
 'Correctional Agency Director',
 'Council On Aging Director',
 'Deputy District Customs Director',
 'Deputy Insurance Commissioner',
 'Director of Technology',
 'Director of Vital Statistics',
 'District Customs Director',
 'Employment Research and Planning Director',
 'Employment Ser

# **Solution**

In this solution, I used word embeddings and cosine similarity to find the best match between job titles and their alternate titles. By using pre-trained FastText word vectors, we can understand the meaning of words and phrases in a deeper way, beyond just comparing strings.

Here are the main steps I followed:

1. Data loading

2. Word Embeddings and Cosine Similarity: I obtained pre-trained FastText word vectors for the English language. These word vectors represent words and their meanings in a mathematical form. Using a function called `computer_similarity`, I calculated the cosine similarity between the input job title and a list of alternate titles.

3. Job Title Matching: For each job title in the 'title' column of the jobs dataframe, I first checked if there was an exact match in the 'alternate_title' column. If there was, I assigned the corresponding code and matched title. If not, I calculated the similarity score between the job title and each alternate title using word embeddings and cosine similarity. I selected the alternate title with the highest similarity score as the best match and assigned its code and matched title.

4. Saving Results: Finally, I saved the results by adding the 'code' and 'matched_title' columns to the jobs dataframe in jobs-answer.csv file.

By using word embeddings and cosine similarity, we can find the most appropriate alternate title for each job title, providing valuable insights and facilitating data analysis.

In [None]:
#!pip install transformers
#!pip install fasttext
import numpy as np
import fasttext.util
from sklearn.metrics.pairwise import cosine_similarity
import ast
import math

In [76]:
# Load the pre-trained FastText word vectors
fasttext.util.download_model('en', if_exists='ignore')  # Download the English pre-trained model
fasttext_model = fasttext.load_model('cc.en.300.bin')

In [77]:
def computer_similarity(input_string, list_of_strings):
    # Generate embeddings for the input strings
    input_embedding = fasttext_model.get_sentence_vector(input_string)
    list_embeddings = np.array([fasttext_model.get_sentence_vector(s) for s in list_of_strings])

    # Reshape the input embedding to match the list_embeddings shape
    input_embedding = np.reshape(input_embedding, (1, -1))

    # Compute pairwise cosine similarity
    pairwise_similarity = cosine_similarity(input_embedding, list_embeddings)

    # Find the most similar string
    most_similar_index = np.argmax(pairwise_similarity)
    most_similar_string = list_of_strings[most_similar_index]
    similarity_score = pairwise_similarity[0][most_similar_index]
    return most_similar_string, similarity_score

computer_similarity("there is a village in india", ["indian village", "indian cities", "what about a village in india"])

('what about a village in india', 0.87691)

In [78]:
alternate_titles['alternate_title']=alternate_titles['alternate_title'].dropna()

In [79]:
code_list = []
matched_title = []

for input_string in jobs["title"]:
    old_score = -0.00001
    old_matched_title = ''
    code = None

    for index, row in alternate_titles.iterrows():
        alternate_title = row["alternate_title"]
        if not isinstance(alternate_title, str):
            continue

        if input_string in alternate_title:
            code = row['code']
            old_matched_title = input_string
            break

        list_of_strings = ast.literal_eval(alternate_title)
        most_similar_string, similarity_score = computer_similarity(input_string, list_of_strings)
        if similarity_score > old_score:
            old_score = similarity_score
            old_matched_title = most_similar_string
            code = row['code']

    code_list.append(code)
    matched_title.append(old_matched_title)


In [89]:
jobs['code'] = code_list
jobs['matched_title'] = matched_title

In [90]:
jobs.head()

Unnamed: 0,_id,_score,jobdesc,title,code,matched_title
0,7dab585f01fd,,We are currently seeking a Warehouse Associate...,Warehouse Associate,53-7062.00,Warehouse Associate
1,f71bc73edb39,,"For 70 years, Charles River employees have wor...",Supply Chain Associate II,13-1081.02,Supply Chain Specialist
2,c49152708aa9,,Looking for an individual who will be responsi...,Azure DevOps Engineer,15-1252.00,DevOps Engineer
3,d1b402364b59,,Salesforce Administrator: REMOTE – C-4 Analyti...,Salesforce Administrator: REMOTE,15-1299.05,IT Administrator (Information Technology Admin...
4,495cc0d7c26d,,DescriptionHCR ManorCare provides a range of s...,Floor Care,37-2011.00,Floor Care


In [91]:
jobs["jobdesc"].iloc[4]

"DescriptionHCR ManorCare provides a range of services, including skilled nursing care, assisted living, post-acute medical and rehabilitation care, hospice care, home health care and rehabilitation therapy.The floor care position is responsible for total floor care including wash, seal, wax and buff all floors at the nursing facility.In return for your expertise, you'll enjoy excellent training, industry-leading benefits and unlimited opportunities to learn and grow. Be a part of the team leading the nation in healthcare.Location546 - ProMedica Skilled Nursing and Rehabilitation - Lely Palms, FLEducational RequirementsHigh school diploma requiredPosition RequirementsTraining and/or expertise qualifying the individual to operate floor care equipment and perform assigned duties related to floor care within the nursing facility."

In [92]:
jobs[["title", "jobdesc", "code", "matched_title"]].to_csv('jobs-answers.csv', index=False)

# Alterntive approaches

1. I also explored the approach of incorporating the `jobdesc` along with the `input_string` for cosine similarity calculation. However, when considering only the initial 10 words, it resulted in longer processing times and the obtained results were not satisfactory.

    if isinstance(jobdesc, str):
        words = jobdesc.split()[:10]  # Select the first 10 words
        input_string_jobdesc = f"{input_string} {' '.join(words)}"  # Concatenate input_string and jobdesc
        
    To enhance the result, it would be beneficial to compare the full description rather than just the first 20 words. By considering the complete job description, the accuracy of the matching process can be significantly improved.


2. Another alternative is to use a technique called TF-IDF to represent the job titles and alternate titles as numerical values, rather than relying on complex word embeddings. TF-IDF measures the importance of each word in a title compared to other titles. By converting the titles into numerical vectors using TF-IDF, we can then calculate the similarity between them using cosine similarity. This approach helps us find the best match between job titles and alternate titles based on their numerical representations.


3. A different approach is to analyze the main topics or themes present in the job titles and alternate titles. This can be done using techniques such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF). By identifying common topics in the titles, we can group similar titles together and consider them as potential matches, even if the wording is different. This method is particularly useful when titles share similar meanings but may use different words to express them.