# LinkedIn Job analysis 🗺️

## Data download

First, let's download the dataset locally for analysis

In [2]:
!kaggle datasets download arshkon/linkedin-job-postings

Dataset URL: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings
License(s): CC-BY-SA-4.0
Downloading linkedin-job-postings.zip to /content
 97% 153M/158M [00:02<00:00, 61.7MB/s]
100% 158M/158M [00:02<00:00, 75.0MB/s]


We should have a `linkedin-job-postings.zip`, so we unzip it to the `linkedin-job-postings` directory, overwritting any existing files

In [3]:
!unzip -o -d linkedin-job-postings linkedin-job-postings.zip

Archive:  linkedin-job-postings.zip
  inflating: linkedin-job-postings/companies/companies.csv  
  inflating: linkedin-job-postings/companies/company_industries.csv  
  inflating: linkedin-job-postings/companies/company_specialities.csv  
  inflating: linkedin-job-postings/companies/employee_counts.csv  
  inflating: linkedin-job-postings/jobs/benefits.csv  
  inflating: linkedin-job-postings/jobs/job_industries.csv  
  inflating: linkedin-job-postings/jobs/job_skills.csv  
  inflating: linkedin-job-postings/jobs/salaries.csv  
  inflating: linkedin-job-postings/mappings/industries.csv  
  inflating: linkedin-job-postings/mappings/skills.csv  
  inflating: linkedin-job-postings/postings.csv  


Let's list the files we obtained. We have a bunch of them but not all them are relevant for clustering

In [4]:
!find linkedin-job-postings -type f

linkedin-job-postings/postings.csv
linkedin-job-postings/jobs/job_industries.csv
linkedin-job-postings/jobs/benefits.csv
linkedin-job-postings/jobs/job_skills.csv
linkedin-job-postings/jobs/salaries.csv
linkedin-job-postings/companies/companies.csv
linkedin-job-postings/companies/company_specialities.csv
linkedin-job-postings/companies/employee_counts.csv
linkedin-job-postings/companies/company_industries.csv
linkedin-job-postings/mappings/skills.csv
linkedin-job-postings/mappings/industries.csv


Let's quickly inspect how the files look like. Other than `postings.csv`, the other files are
not that useful for inferring whether a Job is a Machine Learning / AI one.

In [5]:
from glob import glob
import pandas as pd

for f in  glob('linkedin-job-postings/**/*.csv', recursive=True):
    display(f)
    df = pd.read_csv(f)
    print(df.shape)
    display(df.sample(10))
    del df


'linkedin-job-postings/postings.csv'

(123849, 28)


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,expiry,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type
90835,3904429031,VolunteerMatch,Volunteer: Costumer Winnetka,Brief Overview Of Programs\n\nMusical Theatre ...,,,"Winnetka, IL",22503.0,4.0,,...,1716000000000.0,,Associate,This position requires the following skills: M...,1713408000000.0,www.volunteermatch.org,0,VOLUNTEER,,
118912,3906220767,Hagerty,Temporary Billing & Payments Specialist,"As a Temporary Billing & Payments Specialist, ...",,,"Traverse City, MI",59240.0,2.0,,...,1716153000000.0,,Entry level,,1713561000000.0,hagerty.wd5.myworkdayjobs.com,0,FULL_TIME,,
98479,3904981896,Phoenix Franchise Brands,Staff Accountant,STAFF ACCOUNTANT - Elevate Your Career with Ph...,65000.0,YEARLY,"Livonia, MI",82784440.0,3.0,,...,1716052000000.0,,,,1713460000000.0,,0,FULL_TIME,USD,BASE_SALARY
1346,3884435400,Brightspeed,"Technical Implementation Specialist, Digital E...","At Brightspeed, we are reimagining how people ...",,,"Charlotte, NC",2721914.0,41.0,,...,1714940000000.0,,Associate,,1712348000000.0,jobs.smartrecruiters.com,0,FULL_TIME,,
5471,3885100820,Career Systems Development Corp,Bilingual Employment Counselor - CDS,Description:\nBASIC FUNCTION:Responsible for o...,,,"Jamestown, NY",137350.0,2.0,,...,1714985000000.0,,,,1712393000000.0,www.click2apply.net,0,FULL_TIME,,
22353,3889712101,"Alliance Services, Inc.",Operating Room - Registered Nurse,Job Description\n\nAlliance Services is lookin...,,HOURLY,"Madison, WI",10826403.0,2.0,85.0,...,1715254000000.0,,Mid-Senior level,,1712662000000.0,www.ziprecruiter.com,0,FULL_TIME,USD,BASE_SALARY
104957,3905301880,New Seasons Market,Cashier,Job Brief\n\nJoin the Front End team at the Hi...,,,"Raleigh Hills, OR",56350.0,3.0,,...,1716068000000.0,,Entry level,,1713476000000.0,phf.tbe.taleo.net,0,FULL_TIME,,
91007,3904432304,HireTalent - Diversity Staffing & Recruiting Firm,Helpdesk Representative III #: 24-02655,Job Title: Helpdesk Representative III\n\nLoca...,,,"Cranberry Township, PA",1028156.0,5.0,,...,1716001000000.0,,Mid-Senior level,,1713409000000.0,www1.jobdiva.com,0,CONTRACT,,
71267,3902860724,Lush Fresh Handmade Cosmetics North America,Sales Ambassador - Briarwood Mall,Position: CasualSales Ambassador\n\nWeekly: 0-...,,,"Ann Arbor, MI",344707.0,4.0,,...,1716156000000.0,,Entry level,,1713564000000.0,boards.greenhouse.io,0,FULL_TIME,,
39583,3898167103,"Tanisha Systems, Inc",Program or Project Manager,Position – Senior Program Manager Location – P...,,,"Philadelphia, PA",2562128.0,44.0,,...,1715803000000.0,1713217000000.0,,,1713211000000.0,,0,CONTRACT,,


'linkedin-job-postings/jobs/job_industries.csv'

(164808, 2)


Unnamed: 0,job_id,industry_id
140440,3905854022,47
117438,3905325843,19
12601,3885850620,31
133066,3902322575,74
75337,3900959716,13
44240,3891281034,14
158144,3902866573,27
62289,3901164632,25
117963,3905333304,43
84621,3901352887,25


'linkedin-job-postings/jobs/benefits.csv'

(67943, 3)


Unnamed: 0,job_id,inferred,type
28497,3902786799,1,Vision insurance
33715,3901368207,1,Disability insurance
63294,3902833718,0,401(k)
14190,3895571483,1,401(k)
59520,3905867317,1,Disability insurance
9342,3891017287,0,Dental insurance
65166,3906228575,1,Tuition assistance
37882,3903474913,1,401(k)
57983,3902349830,0,Medical insurance
3650,3886214621,1,Tuition assistance


'linkedin-job-postings/jobs/job_skills.csv'

(213768, 2)


Unnamed: 0,job_id,skill_abr
42891,3891007817,RSCH
110688,3901354137,HCPR
91172,3902753783,BD
126925,3904988854,IT
131264,3901396124,ENG
179533,3904088635,RSCH
193353,3902833369,PROD
86510,3902742980,MNFC
126677,3903444449,SALE
167633,3905372690,QA


'linkedin-job-postings/jobs/salaries.csv'

(40785, 8)


Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type
9499,9500,3894297598,,19.0,,HOURLY,USD,BASE_SALARY
18760,18761,3900975574,,488668.0,,YEARLY,USD,BASE_SALARY
37997,37998,3906222735,,18.0,,HOURLY,USD,BASE_SALARY
9955,9956,3894543740,170000.0,,120000.0,YEARLY,USD,BASE_SALARY
19768,19769,3904414190,,2500.0,,MONTHLY,USD,BASE_SALARY
27369,27370,3901947245,97000.0,,77000.0,YEARLY,USD,BASE_SALARY
35328,35329,3905862813,225000.0,,200000.0,YEARLY,USD,BASE_SALARY
27703,27704,3905302384,87000.0,,68000.0,YEARLY,USD,BASE_SALARY
4137,4138,3887571633,105000.0,,95000.0,YEARLY,USD,BASE_SALARY
34114,34115,3904088118,150000.0,,120000.0,YEARLY,USD,BASE_SALARY


'linkedin-job-postings/companies/companies.csv'

(24473, 10)


Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
21430,76087697,Honest Medical Group,"At Honest, we’re committed to realizing the qu...",3.0,0,US,Nashville,0,1215 5th Ave N,https://www.linkedin.com/company/honest-medica...
7242,362321,REPRO Rising Virginia,REPRO Rising Virginia is the political arm of ...,,VA,US,Alexandria,22313-1204,P.O. Box 1204,https://www.linkedin.com/company/repro-rising-...
24102,100637578,Trail Ridge Power,We work with commercial and industrial partner...,1.0,MA,US,Cambridge,02139,0,https://www.linkedin.com/company/trail-ridge-p...
11743,2463641,Miracles for Kids,Miracles for Kids is one of the only organizat...,1.0,California,US,Irvine,92614,17848 Sky Park Circle,https://www.linkedin.com/company/miracles-for-...
19271,35668062,IFG Chicago,IFG - International Financial Group is a staff...,1.0,IL,US,Chicago,60601,200 N. LaSalle Street,https://www.linkedin.com/company/ifg-chicago
13511,3802195,RightTalents LLC,WELCOME TO RIGHTTALENTS!\nhttps://www.righttal...,1.0,New Jersey,US,Nutley,07110,639 Passaic Ave,https://www.linkedin.com/company/righttalents-llc
18775,30116522,Babyface Brows,Our Permanent Makeup studio is dedicated to en...,,NY,US,Brooklyn,11232,807 42nd St,https://www.linkedin.com/company/babyfacebrows
7624,449579,"FBC Mortgage, LLC","FBC Mortgage, LLC (“FBC”) is a Top 20 National...",4.0,FL,US,Orlando,32801,189 S. Orange Avenue,https://www.linkedin.com/company/fbc-mortgage-llc
16,1073,EY,"EY exists to build a better working world, hel...",7.0,0,GB,London,SE1 2DA,6 More London Place,https://www.linkedin.com/company/ernstandyoung
22262,81690544,Sanctuary Recovery Centers,Welcome to true healing and continued care.\nT...,2.0,Arizona,US,Phoenix,85020,11645 N Cave Creek Rd,https://www.linkedin.com/company/sanctuary-rec...


'linkedin-job-postings/companies/company_specialities.csv'

(169387, 2)


Unnamed: 0,company_id,speciality
98953,1376946,instructional design
101327,305697,Co-Occurring Disorders
131932,18048742,Cassandra Database Management
49964,417618,Cyber Security
89009,132193,intellectual property
84747,545865,Staffing
5239,765008,Headcount planning
35154,2219,Base Operation Support
133,75056372,Legal
116793,2832056,Baseline Planning and Management


'linkedin-job-postings/companies/employee_counts.csv'

(35787, 4)


Unnamed: 0,company_id,employee_count,follower_count,time_recorded
9819,11830178,634,7218,1712864984
21836,16167895,88,143663,1713453628
15522,3439,36433,1165499,1713277507
29153,166658,12135,197367,1713481802
27401,116872,261,9723,1713473925
29170,7584447,831,51685,1713481952
791,33269,2162,26604,1712350347
6022,1837539,3342,45964,1712670678
6222,1009990,93,678,1712671943
11427,84126489,466,9050,1712892675


'linkedin-job-postings/companies/company_industries.csv'

(24375, 2)


Unnamed: 0,company_id,industry
10908,988690,Advertising Services
20293,83076758,Insurance
1047,16239,Truck Transportation
11293,12939,Non-profit Organizations
6998,92407,Staffing and Recruiting
12990,11400117,Staffing and Recruiting
11382,3235980,Software Development
13984,128737,Construction
16525,6423583,Wellness and Fitness Services
1263,35475885,Staffing and Recruiting


'linkedin-job-postings/mappings/skills.csv'

(35, 2)


Unnamed: 0,skill_abr,skill_name
25,ADM,Administrative
34,MGMT,Management
18,FIN,Finance
19,OTHR,Other
3,PRDM,Product Management
17,STRA,Strategy/Planning
28,PR,Public Relations
4,DIST,Distribution
2,ADVR,Advertising
16,CUST,Customer Service


'linkedin-job-postings/mappings/industries.csv'

(422, 2)


Unnamed: 0,industry_id,industry_name
207,1128,Wholesale Motor Vehicles and Parts
74,77,Law Enforcement
311,2247,
186,861,"Boilers, Tanks, and Shipping Container Manufac..."
183,807,Primary Metal Manufacturing
23,25,Manufacturing
356,3132,Internet Publishing
214,1187,Wholesale Machinery
266,1798,Commercial and Industrial Equipment Rental
109,112,"Appliances, Electrical, and Electronics Manufa..."


Let's focus and read the `postings.csv` file. To speed things up, we will save it in parquet format.

In [6]:
from pathlib import Path
import pandas as pd


POSTINGS_PARQUET  = Path('linkedin-job-postings/postings.parquet')

if not POSTINGS_PARQUET.exists():
    postings = pd.read_csv('linkedin-job-postings/postings.csv')
    postings.to_parquet(POSTINGS_PARQUET)
    del postings

postings = pd.read_parquet(POSTINGS_PARQUET)
postings

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,expiry,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,1.715990e+12,,,Requirements: \n\nWe are seeking a College or ...,1.713398e+12,,0,FULL_TIME,USD,BASE_SALARY
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,1.715450e+12,,,,1.712858e+12,,0,FULL_TIME,USD,BASE_SALARY
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,1.715870e+12,,,We are currently accepting resumes for FOH - A...,1.713278e+12,,0,FULL_TIME,USD,BASE_SALARY
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,1.715488e+12,,,This position requires a baseline understandin...,1.712896e+12,,0,FULL_TIME,USD,BASE_SALARY
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,1.716044e+12,,,,1.713452e+12,,0,FULL_TIME,USD,BASE_SALARY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123844,3906267117,Lozano Smith,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...,195000.0,YEARLY,"Walnut Creek, CA",56120.0,1.0,,...,1.716163e+12,,Mid-Senior level,,1.713571e+12,,0,FULL_TIME,USD,BASE_SALARY
123845,3906267126,Pinterest,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...,,,United States,1124131.0,3.0,,...,1.716164e+12,,Mid-Senior level,,1.713572e+12,www.pinterestcareers.com,0,FULL_TIME,,
123846,3906267131,EPS Learning,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...,,,"Spokane, WA",90552133.0,3.0,,...,1.716164e+12,,Mid-Senior level,,1.713572e+12,epsoperations.bamboohr.com,0,FULL_TIME,,
123847,3906267195,Trelleborg Applied Technologies,Business Development Manager,The Business Development Manager is a 'hunter'...,,,"Texas, United States",2793699.0,4.0,,...,1.716165e+12,,,,1.713573e+12,,0,FULL_TIME,,


## Which techniques should we use?

We have multiple ways to go about classifying / clustering jobs depending on the type. We can perform
some techniques and try to see which one works better

- LDA (topic modelling)
- Clustering based on embeddings.
- Semi supervised classification.

The first two are aren't promising because of some reasons: clustering and LDA need an hyperparameter selection: the number of topics. Too few topics can imply a coarse classification that put too many different job types in the same topic and too many topics can split our jobs of interest into two different topics.

## LDA

Regardless of how good we think LDA is, we will try it anyway. It is a simple method and can be used for unsupervised classifcation:

First, we will load a minimal set of processing components of the Spacy library. Spacy can obtain word embeddings and other more involved components which aren't necessary and can slow down the process, which is somewhat slow already :)

In [7]:
import spacy

PIPELINE_COMPONENTS = ['tagger', 'parser', 'lemmatizer', 'attribute_ruler']

try:
    nlp = spacy.load("en_core_web_md", enable=PIPELINE_COMPONENTS)
except OSError:
    print('Downloading model')
    !python -m spacy download en_core_web_md
    nlp = spacy.load("en_core_web_md", enable=PIPELINE_COMPONENTS)


print(f'NLP pipeline: {nlp.pipe_names}')

Downloading model
Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
NLP pipeline: ['tagger', 'parser', 'attribute_ruler', 'lemmatizer']


### Start small, then go big.

We will make a small analysis in a subset of the full dataset to speed up the processes of doing LDA, and computing embeddings. It is also beneficial to simplify the analysis. It can be more challenging to check individual samples of methods when the set to study is smaller.

We will select 10000 samples (around 10% of the data for analysis) and store them in the `mini_postings` variable.

In [8]:
mini_postings = postings.sample(10000, random_state=567).copy()
display(mini_postings['title'] + ' ' + mini_postings['description'])

76698     Registered Nurse (37.5) Long Island Community ...
10089     Value Stream Coach Wabtec Corporation is a lea...
122320    Doughnut Specialist If you love spreading joy,...
71946     Email Marketing Specialist Robert Half's clien...
102006    Medical Coding Supervisor Description:Job Titl...
                                ...                        
23252     PHYSICAL THERAPIST (PT) - PISGAH MANOR \nLiber...
107255    Client-Facing Recruiter, Manufacturing/Skilled...
86600     Director of Continuous Improvement Finance Per...
11018     Automotive Mobile Glass Technician Caliber Aut...
52883     Sr Director M&A Integration Location: Work is ...
Length: 10000, dtype: object

We will remove some unnecessary tokens which are not useful for LDA, as they aren't super informative of the type of job

In [9]:
from spacy.tokens import Doc

OMIT_TOKENS = {'ADV', 'ADJ', 'PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE', 'NUM', 'SYM'}

def get_tokens_for_posting(posting: Doc) -> list[str]:
      return [
         token.lemma_.lower()
         for token in posting
         if not token.is_stop and token.is_alpha and token.pos_ not in OMIT_TOKENS
      ]

get_tokens_for_posting(nlp('I am a great software and ai engineer.'))

['great', 'software', 'ai', 'engineer']

### Titles or descriptions?

 LDA require we have more than a couple of sentences to derive the conditional probabilities of topics and words per topic. Job titles alone aren't enough because there's few evidence to perform the EM step in LDA. Therefore, we extend it with descriptions. There's a risk with this: we (humans) can classify a job only by looking the title most of time and the description tend to speak about the company, which can confuse LDA into classifying "companies/jobs" instead of job themselves. This can reduce the effectiveness of the obtained topics.

To be mindful of time, we will process only `mini_postings`. If things go well, we can extend it to the full dataset.

In [12]:
from tqdm import tqdm
import pandas as pd

def process_tokens(postings: pd.DataFrame):
    str_docs = (postings['title'] + ' ' + postings['description']).astype(str)
    nlp_pipe = nlp.pipe(str_docs, n_process=-1)

    tokens = []
    for doc in tqdm(nlp_pipe, total=len(str_docs), unit="posting", desc="Processing tokens"):
        tokens.append(get_tokens_for_posting(doc))

    postings['tokens'] = tokens


process_tokens(mini_postings)

Processing tokens: 100%|██████████| 10000/10000 [09:14<00:00, 18.03posting/s]


The collected tokens look ok for our job title and description.

In [13]:
mini_postings['tokens']

76698     [registered, nurse, long, island, community, h...
10089     [value, stream, coach, wabtec, corporation, le...
122320    [doughnut, specialist, love, spreading, joy, p...
71946     [email, marketing, specialist, robert, half, c...
102006    [medical, coding, supervisor, description, job...
                                ...                        
23252     [physical, therapist, pt, pisgah, manor, liber...
107255    [client, facing, recruiter, manufacturing, ski...
86600     [director, continuous, improvement, finance, p...
11018     [automotive, mobile, glass, technician, calibe...
52883     [sr, director, integration, location, work, ro...
Name: tokens, Length: 10000, dtype: object

We filter down the dictionary of tokens to speed up the process. Terms that appear too infrequently aren't useful for deriving topics because they would appear in few documents. Also, we benefit from having a small dictionary to speed up the LDA optimization step.

In [14]:
from gensim.corpora.dictionary import Dictionary


dictionary = Dictionary(mini_postings['tokens'])
dictionary.filter_extremes(no_below=5, keep_n=5000)

pd.Series(dictionary.token2id)

able               0
accepts            1
accordance         2
according          3
accredited         4
                ... 
furthermore     4995
ship            4996
instrument      4997
experimental    4998
plc             4999
Length: 5000, dtype: int64

Now that we have a corpus (tokens that appear at least a couple of times), we can start the optimization step. For that, we use a variation of LDA that runs in multiple cores.

In [15]:
corpus = [dictionary.doc2bow(doc) for doc in mini_postings['tokens']]

In [None]:
from gensim.models import LdaMulticore
import logging

lda_model = LdaMulticore(
    corpus = corpus,
    id2word=dictionary,
    iterations=50,
    num_topics=30,
    passes=10
  )

We need to "guess" which topic corresponds to Machine Learning / AI. To do so we ask the topic distribution of a sample job description and we determine that ML is the topic with highest probability:

In the case below it would be 24 (it can change depending on how the LDA converges)

In [None]:
topics_for_ml = lda_model.get_document_topics(
    dictionary.doc2bow(get_tokens_for_posting(nlp('Machine Learning Engineer')))
)
print(topics_for_ml)
ml_topic = max(topics_for_ml, key=lambda x: x[1])[0]
print(f"ML topic: {ml_topic}")

[(16, 0.22709362), (24, 0.53955895)]
ML topic: 24


To get "how ML" a job posting is, we compute the probability of the ML topic number guessed in the previous step.

We assume a job posting is from ML if there's a good chance the document is in that topic. Formally it means the probability is above some threshold.

To compute the probability, we run the LDA model again on each document and we store the ML topic probabiliy in a separate column.

In [None]:
def get_document_topics(model: LdaMulticore, corpus: list[tuple[int, int]]):
  topics_corpus = model.get_document_topics(corpus)
  topic_corpus = []
  for topic in tqdm(topics_corpus, total=len(corpus), unit="posting", desc="Processing topics"):
    try:
      ml_topic_prob = next(prob for t, prob in topic if t == ml_topic)
    except StopIteration:
      ml_topic_prob = 0
    topic_corpus.append(ml_topic_prob)
  return topic_corpus

document_topics = get_document_topics(lda_model, corpus)


Processing topics: 100%|██████████| 10000/10000 [00:23<00:00, 432.97posting/s]


In [None]:
mini_postings['ml_topic_prob'] = document_topics

Then we try with different thresholds and we see if the documents in that topic are indeed from Machine Learning or AI. After some runs and tries, we never managed to get a set of parameters (number of topics, LDA iterations, etc) that yielded a topic where all jobs in that topics were about ML.

The model does indeed distinguish "engineering" jobs correctly, but not "ML / AI" jobs. Perhaps is too big of an ask. LDA only uses word counts and ignores word context which oversimplify the reality. Also, we include job description, and those are not different significantly for ML and other Engineering jobs.
We can try another methods, though.

In [None]:
mini_postings.loc[mini_postings['ml_topic_prob'] > 0.8]

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,tokens,ml_topic_prob
646,3868450640,Diamondpick,.Net Azure Architect (Local to WA),"10+ years of experience, previous hands on sof...",,,"Seattle, WA",14377562.0,4.0,,...,,,1.713474e+12,,0,CONTRACT,,,"[azure, architect, local, wa, years, experienc...",0.975821
83730,3904081220,,DevOps Engineer,Job Title: Sr.DevOps EngineerLocation: REMOTE\...,,,United States,,1.0,,...,Mid-Senior level,,1.713524e+12,,0,CONTRACT,,,"[devops, engineer, job, title, sr, devops, eng...",0.903256
52823,3901627838,Agile Tech Labs,Microsoft Dynamics Consultant,Job Title: Microsoft 365 Dynamic DeveloperJob ...,,,"Lansing, MI",33202713.0,36.0,,...,Mid-Senior level,,1.713210e+12,,0,CONTRACT,,,"[microsoft, dynamics, consultant, job, title, ...",0.938695
63533,3902350082,Diverse Lynx,Salesforce Architect with Copado,Role: Salesforce Architect with CopadoLocation...,,,United States,90396.0,,,...,,,1.713533e+12,,0,CONTRACT,,,"[salesforce, architect, copado, role, salesfor...",0.853906
17152,3888028934,Tiposi,Field Application Engineer (FAE),We are currently seeking a highly skilled and ...,,,"Milpitas, CA",76216204.0,4.0,,...,,,1.712442e+12,,0,FULL_TIME,,,"[field, application, engineer, fae, currently,...",0.855017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73673,3902938491,Talent Groups,Azure Data Engineer- Technical Lead,Mandatory Skills:Proven experience in data eng...,,,United States,51701268.0,170.0,,...,Mid-Senior level,,1.713279e+12,,0,FULL_TIME,,,"[azure, data, technical, lead, mandatory, skil...",0.878692
109237,3905353031,JBA International,Senior System Engineer,Company Overview:\nA leading Law Firm client o...,,,Los Angeles Metropolitan Area,219642.0,2.0,,...,Mid-Senior level,,1.713486e+12,,0,FULL_TIME,,,"[senior, system, engineer, company, overview, ...",0.878748
53473,3901648108,MCubeSoft,Java Backend Developer (with GraphQL),Job Title: Java (with GraphQL)Location: Austin...,,,"Austin, TX",74604183.0,130.0,,...,,,1.713211e+12,,0,FULL_TIME,,,"[java, backend, developer, graphql, job, title...",0.954542
83906,3904090106,Alldus,Senior ServiceNow Developer,Senior ServiceNow DeveloperFull Time-Remote bu...,150000.0,YEARLY,"Austin, TX",12618960.0,5.0,,...,Mid-Senior level,,1.713528e+12,,0,FULL_TIME,USD,BASE_SALARY,"[senior, servicenow, developer, senior, servic...",0.919060


In [None]:
mini_postings.loc[mini_postings['title'].str.lower().str.contains('machine learning'), 'topic'].value_counts()

topic
21    6
44    2
43    2
48    1
15    1
Name: count, dtype: int64

## Clustering using KNN

The following method rely on generating an embedding of the job posting and using a distance or metric to cluster them into K topics. Supposing we have those embeddings we can generate clusters.

I won't explore KNN or clustering here because it can have the same problems as LDA. Using clustering using K-means or other methods based on embedings can work, we can even use cosine similarity as a distance measure for the clustering under some conditions. The problem with clustering is that we rely on a good choice of K (number of clusters) to avoid splitting ML/AI into several topics or merging unrelated jobs into the same cluster.


## Embedding semi supervised classification

This idea is related to clustering and it is  relatively simple and is relied in semisupervised classification. We classify some samples which are "easy" to classify and we determine that a job is from ML if it is "similar" to a job we manually classified as ML.

Formally, the cluster of ML and AI jobs are the set of jobs whose is near to any of the anchors. This way we can use human knowledge to "improve" our unsupervised classification.

First of all, we will install Sentence Transformers, which will help us to craft embeddings from the clusters.

In [16]:
try:
  from sentence_transformers import SentenceTransformer
except ModuleNotFoundError:
  !pip install sentence_transformers
  from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transform

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Then, we will explore the `mini_postings` subset of jobs to ease the analysis. If our technique works for the this set, which we use as training, we will expand it to the rest of the set for validation.

In [17]:
from tqdm import tqdm
import pandas as pd


def compute_embeddings(model: SentenceTransformer, postings: pd.DataFrame):
  texts = (
      'Job title: ' +
      postings['title'].astype(str)
  )
  embeddings = []

  for text in tqdm(texts, unit="posting", desc="Computing embeddings"):
    embeddings.append(model.encode(text))

  postings['embeddings'] = embeddings

compute_embeddings(model, mini_postings)

Computing embeddings: 100%|██████████| 10000/10000 [04:17<00:00, 38.89posting/s]


We can use the `mini_postings` set for manual classification because it's smaller and faster to classify; also we can use the rest of the job postings as for validation. In this set we will select the "anchors" or positive samples of jobs for ML by searching for key phrases like "Machine learning" or "Artificial Intelligence".  We know all those jobs belong to the cluster of AI jobs.

In [18]:
some_ml_jobs = mini_postings.loc[
    mini_postings['title'].str.lower().str.contains('machine learning') |
    mini_postings['title'].str.lower().str.contains('artificial intelligence')
    ]
some_ml_jobs

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,tokens,embeddings
93346,3904714315,Square One Resources,"Artificial Intelligence Engineer ($225,000 - $...","Artificial Intelligence EngineerSalary: $225,0...",300000.0,YEARLY,"Washington, United States",23254.0,2.0,,...,Mid-Senior level,,1713563000000.0,,0,FULL_TIME,USD,BASE_SALARY,"[artificial, intelligence, engineer, artificia...","[-0.07574634, -0.015229617, 0.049184386, 0.043..."
36598,3895587299,TikTok,"Machine Learning Engineer, E-Commerce",Responsibilities\n\n TikTok is the leading des...,337250.0,YEARLY,"Seattle, WA",33246798.0,5.0,,...,,,1712896000000.0,careers.tiktok.com,0,FULL_TIME,USD,BASE_SALARY,"[machine, learning, engineer, e, commerce, res...","[-0.084470145, -0.01710696, 0.07959444, 0.0296..."
117491,3905888074,hackajob,Artificial Intelligence Engineer,hackajob transforms your job search into a per...,270000.0,YEARLY,"McLean, VA",5396873.0,1.0,,...,Mid-Senior level,,1713540000000.0,,0,FULL_TIME,USD,BASE_SALARY,"[artificial, intelligence, engineer, hackajob,...","[-0.10161027, 0.018857576, 0.05280516, 0.04283..."
100516,3905213436,Alsym Energy,Chemist Machine Learning,Job Title: Chemist with AI Expertise (Scientis...,,,"Woburn, MA",80862166.0,3.0,,...,Mid-Senior level,,1713460000000.0,,0,FULL_TIME,,,"[chemist, machine, learning, job, title, chemi...","[-0.08512944, -0.013795285, 0.04394056, 0.0528..."
68602,3902780924,Capital One,"Senior Machine Learning Engineer (Python, PySp...","Center 1 (19052), United States of America, Mc...",,,"McLean, VA",1419.0,4.0,,...,Mid-Senior level,,1713404000000.0,dsp.prng.co,0,FULL_TIME,,,"[senior, machine, learning, engineer, python, ...","[-0.049656063, -0.008628289, 0.023114905, 0.00..."
42779,3899542407,Thomas Thor,Artificial Intelligence Engineer,Seeking an AI System Engineer to Revolutionize...,100.0,HOURLY,United States,733272.0,251.0,,...,Mid-Senior level,,1713281000000.0,,0,CONTRACT,USD,BASE_SALARY,"[artificial, intelligence, engineer, seeking, ...","[-0.10161027, 0.018857576, 0.05280516, 0.04283..."
115905,3905858580,Barrington James,Senior Machine Learning Engineer,-- Senior Machine Learning Scientist - Drug Di...,,,United States,112333.0,3.0,,...,Mid-Senior level,,1713537000000.0,,0,FULL_TIME,,,"[senior, machine, learning, engineer, senior, ...","[-0.08761789, 0.021859843, 0.07799667, 0.00514..."
96284,3904945029,,Artificial Intelligence Engineer,"Company Overview:\nAt LessAlone.AI, we are pio...",,,United States,,4.0,,...,,,1713452000000.0,,0,INTERNSHIP,,,"[artificial, intelligence, engineer, company, ...","[-0.10161027, 0.018857576, 0.05280516, 0.04283..."
98514,3904983008,TikTok,"Machine Learning Engineer, Search Ads - USDS",Responsibilities\n\n TikTok is the leading des...,168150.0,YEARLY,"Seattle, WA",33246798.0,4.0,,...,,,1713458000000.0,careers.tiktok.com,0,FULL_TIME,USD,BASE_SALARY,"[machine, learning, engineer, search, ads, usd...","[-0.08666537, -0.0468133, 0.028774887, 0.06730..."
55989,3901937564,Lorven Technologies Inc.,AI /Machine Learning Engineer / Architect,"Hi,\nOur client is a currently looking for AI ...",,,"Nashville, TN",1683442.0,1.0,,...,Mid-Senior level,,1713468000000.0,,0,CONTRACT,,,"[ai, learning, engineer, architect, hi, client...","[-0.08616383, -0.00078839425, 0.034689542, 0.0..."


Then we will find the highest similarity of the rest of the jobs in this cluster against any of the anchors; this way our ML job cluster is the set of the jobs that are similar to at least one of the anchors.

In [19]:
import numpy as np

some_embeddings = np.vstack(some_ml_jobs['embeddings'].values)

mini_postings_embeddings = np.vstack(mini_postings['embeddings'].values)
mini_postings_embeddings.shape

(10000, 384)

To compute this similarity we use `model.similarity`. We only compute it against the anchors, not the entire set of postings, as the similarity computation operation complexity and space requirement scale quadratically to the number of elements to compare

In [20]:
max_similarities = model.similarity(mini_postings_embeddings, some_embeddings).max(axis=1)
mini_postings['max_similarity'] = max_similarities.values
max_similarities

torch.return_types.max(
values=tensor([0.4604, 0.5447, 0.4733,  ..., 0.5016, 0.4667, 0.6116]),
indices=tensor([ 4, 12,  6,  ...,  6, 14, 13]))

Finally, we can tune the similarity cutoff to define how close a job posting must be to an anchor to consider it as part of the cluster. In this case, we selected $0.85$. Values around 0.7 and 0.9 should be OK. It depends on how many false positives we want to admit in our cluster. A false positive is a non-ML job present in our cluster. For example "Senior Full Stack Software Engineer | Data Center", an Engineering job which is not necessary AI.

In [21]:
mini_postings.loc[mini_postings['max_similarity'] >= 0.85].sort_values('max_similarity')

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,tokens,embeddings,max_similarity
122066,3906251569,Visa,Senior Manager of Data Science,Company Description\n\nVisa is a world leader ...,210850.0,YEARLY,"Washington, DC",2190.0,4.0,,...,,1.713567e+12,jobs.smartrecruiters.com,0,FULL_TIME,USD,BASE_SALARY,"[senior, manager, data, science, company, desc...","[-0.06684876, 0.043088328, 0.03930743, 0.04812...",0.771953
70934,3902845624,Prysmian,Senior Product Development Engineer,Prysmian is the world leader in the energy and...,,,"Willimantic, CT",11316.0,2.0,,...,,1.713562e+12,,0,FULL_TIME,,,"[senior, product, development, engineer, prysm...","[-0.08412757, 0.04888611, 0.040766906, -0.0592...",0.772281
64462,3902358993,Microsoft,Principle Software Engineer,The Artificial Intelligence (AI) Frameworks te...,282200.0,YEARLY,United States,1035.0,4.0,,...,,1.713540e+12,careers.microsoft.com,0,FULL_TIME,USD,BASE_SALARY,"[principle, software, engineer, artificial, in...","[-0.10990921, 0.0634587, 0.030148502, -0.04658...",0.774104
3792,3884854235,Neighborly Software,Senior Full Stack Software Engineer,Who We Are\nNeighborly Software was built to h...,,,Atlanta Metropolitan Area,18009120.0,31.0,,...,,1.712377e+12,,0,FULL_TIME,,,"[senior, stack, software, engineer, neighborly...","[-0.08194241, 0.021732403, 0.05001767, -0.0432...",0.776185
6185,3885109960,dot. cards,Senior Full Stack Software Engineer,Senior Software Engineer - Full Stack Consumer...,,,Los Angeles Metropolitan Area,69192017.0,4.0,,...,,1.712459e+12,,0,FULL_TIME,,,"[senior, stack, software, engineer, senior, so...","[-0.08194241, 0.021732403, 0.05001767, -0.0432...",0.776185
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100516,3905213436,Alsym Energy,Chemist Machine Learning,Job Title: Chemist with AI Expertise (Scientis...,,,"Woburn, MA",80862166.0,3.0,,...,,1.713460e+12,,0,FULL_TIME,,,"[chemist, machine, learning, job, title, chemi...","[-0.08512944, -0.013795285, 0.04394056, 0.0528...",1.000000
112076,3905611091,FreeWheel,Sr Machine Learning Engineer,"FreeWheel, a Comcast company, provides compreh...",,,"Reston, VA",458871.0,3.0,,...,,1.713505e+12,jobs.comcast.com,0,FULL_TIME,,,"[sr, machine, learning, engineer, freewheel, c...","[-0.10226952, -0.0018580917, 0.033680152, 0.05...",1.000000
93346,3904714315,Square One Resources,"Artificial Intelligence Engineer ($225,000 - $...","Artificial Intelligence EngineerSalary: $225,0...",300000.0,YEARLY,"Washington, United States",23254.0,2.0,,...,,1.713563e+12,,0,FULL_TIME,USD,BASE_SALARY,"[artificial, intelligence, engineer, artificia...","[-0.07574634, -0.015229617, 0.049184386, 0.043...",1.000000
115905,3905858580,Barrington James,Senior Machine Learning Engineer,-- Senior Machine Learning Scientist - Drug Di...,,,United States,112333.0,3.0,,...,,1.713537e+12,,0,FULL_TIME,,,"[senior, machine, learning, engineer, senior, ...","[-0.08761789, 0.021859843, 0.07799667, 0.00514...",1.000000


Let's do this again with all postings, given the method works for our smaller set.

In [22]:
compute_embeddings(model, postings)

Computing embeddings: 100%|██████████| 123849/123849 [51:49<00:00, 39.83posting/s]


We save the embeddings just in case we don't want to wait 50+ minutes comuting the embeddings.

In [23]:
postings_embeddings = np.vstack(postings['embeddings'].values)
np.savez('postings_embeddings.npz', postings_embeddings)

We do this again, now using all the job postings and we compute the similarity against our anchors.

Note: We could search for more anchors in the entire job postings set, but we want to keep the set of anchors small, and try to evaluate how the algorithm works in "unseen data"

In [24]:
max_similarities = model.similarity(postings_embeddings, some_embeddings).max(axis=1)
postings['max_similarity'] = max_similarities.values
max_similarities

torch.return_types.max(
values=tensor([0.5359, 0.4114, 0.4871,  ..., 0.4498, 0.6434, 0.5461]),
indices=tensor([12, 10, 15,  ..., 10, 15, 11]))

In [31]:
postings.loc[postings['max_similarity'] >= 0.85].sort_values('max_similarity')

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,embeddings,max_similarity
75453,3903451918,SEPHORA,AI/ML Engineering Manager,Your role at Sephora:This is an opportunity fo...,200000.0,YEARLY,San Francisco Bay Area,6814.0,50.0,,...,Mid-Senior level,,1.713455e+12,,0,FULL_TIME,USD,BASE_SALARY,"[-0.07451673, -0.039411265, 0.04947235, 0.0234...",0.851571
52488,3901481818,Shure Incorporated,"Engineer Associate Staff, DSP Machine Learning",Calling all audio industry disruptors! We are ...,,,"Niles, IL",6579.0,53.0,,...,Entry level,,1.713208e+12,careersus-shure.icims.com,0,FULL_TIME,,,"[-0.12563434, -0.03097785, 0.10350252, 0.04282...",0.852664
121211,3906237182,General Mills,Sr Data Engineer,"Employer: General Mills, Inc.\n\nJob Title: Sr...",174600.0,YEARLY,"Minneapolis, MN",2822.0,3.0,,...,Mid-Senior level,,1.713564e+12,careers.generalmills.com,0,FULL_TIME,USD,BASE_SALARY,"[-0.1133646, 0.046189804, 0.01754632, 0.038489...",0.852850
96474,3904948162,Apex Systems,Sr Data Engineer,Job#: 2023101\n\nJob Description:\n\n Senior D...,,,"San Antonio, TX",4787.0,6.0,,...,Mid-Senior level,,1.713454e+12,www.apexsystems.com,0,FULL_TIME,,,"[-0.1133646, 0.046189804, 0.01754632, 0.038489...",0.852850
70533,3902838650,Experis,Sr Data Engineer,Job title: Senior Data Engineer (Databricks Pl...,,,"Philadelphia, PA",2203697.0,5.0,,...,Mid-Senior level,,1.713561e+12,click.appcast.io,0,CONTRACT,,,"[-0.1133646, 0.046189804, 0.01754632, 0.038489...",0.852850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88004,3904392124,SAE International,Senior Machine Learning Engineer,Job Summary\n\nSAE is seeking an experienced S...,,,"Warrendale, PA",25098.0,46.0,,...,Mid-Senior level,,1.713394e+12,recruiting.ultipro.com,0,FULL_TIME,,,"[-0.08761789, 0.021859843, 0.07799667, 0.00514...",1.000000
68594,3902780915,Capital One,Senior Machine Learning Engineer,"West Creek 1 (12071), United States of America...",,,"Richmond, VA",1419.0,19.0,,...,Mid-Senior level,,1.713404e+12,dsp.prng.co,0,FULL_TIME,,,"[-0.08761789, 0.021859843, 0.07799667, 0.00514...",1.000000
91970,3904572433,Square One Resources,Senior Machine Learning Engineer,"Senior Machine Learning EngineerSalary: $190,0...",210000.0,YEARLY,"California, United States",23254.0,4.0,,...,Mid-Senior level,,1.713551e+12,,0,FULL_TIME,USD,BASE_SALARY,"[-0.08761789, 0.021859843, 0.07799667, 0.00514...",1.000000
83960,3904092471,Quantiphi,Senior Machine Learning Engineer,Company ProfileQuantiphi is an award-winning A...,,,United States,3618960.0,4.0,,...,Mid-Senior level,,1.713531e+12,,0,FULL_TIME,,,"[-0.08761789, 0.021859843, 0.07799667, 0.00514...",1.000000
