<a href="https://colab.research.google.com/github/grantgasser/moonhub/blob/master/Moonhub_Kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Moonhub Acronym Expansion and Variations using KMeans

**Approach:** 
1. Get job titles from LinkedIn dataset (`people_sample.jsonl`)
2. Get embeddings for each title (`title_to_embedding`)
3. Find k=1000 clusters to bucket similar job titles (`kmeans`)
4. For a given title, find its cluster and return the similar variations (`get_variations()`)


**Analysis:** We can "tighten" or "loosen" the relations by increasing or decreasing `k`. For example if we want to force the variations to be _very_ similar, we can increase `k` such that we only receive a few variations for a given title.

On the other hand if we want to loosen our clusters and ensure we return all variations that are _somewhat_ similar, we can decrease `k`.


**Improvement:** Increase the size of the dataset. I capped the set at ~5600 since it took roughly 30 minutes to query `openai` for the embeddings.

In [10]:
import json
import os
import requests
import io
import time

import warnings
warnings.filterwarnings("ignore")

#from IPython.display import Image, clear_output
from PIL import Image
import urllib.request
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle

In [5]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.0-py3-none-any.whl (70 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/70.1 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.0


In [7]:
import openai
from google.colab import drive

drive.mount('/content/drive')

API_KEY_PATH = '/content/drive/MyDrive/openai_api_key.txt'

with open(API_KEY_PATH, 'r') as f:
  openai.api_key  = f.read().strip()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
%%time

candidates = []
load_errors = 0
with open('/content/drive/MyDrive/people_sample.jsonl', 'r') as f:
    for line in f:
        try:
            candidates.append(json.loads(line))
        except ValueError as e:
            load_errors += 1

print(f'Read {len(candidates)} lines, could not read {load_errors} lines.')

Read 200000 lines, could not read 0 lines.
CPU times: user 27 s, sys: 4.69 s, total: 31.6 s
Wall time: 38.6 s


In [7]:
df = pd.DataFrame(candidates)

df.head(2)

Unnamed: 0,profile_pic,last_name,profile_id,updated_at,education,connection_count,first_name,user_id,title,headline,...,created_at,locality,experience,summary,skills,specialties,publications,position,follower_count,patents
0,https://s3.amazonaws.com/media.mixrank.com/pro...,Ran,-2147480832,2022-02-24T07:35:42.327624,"[{'activities': None, 'school': {'id': None, '...",43.0,Yu,,Executive Assistant,百岳特生物科技（上海） - Executive Assistant,...,2021-05-27T07:33:06.478974,"The Hague, South Holland, Netherlands","[{'end_date': '2019-07-01', 'locality': 'Nethe...",Enthusiastic individual with strong communicat...,,,,"{'title': 'Executive Assistant', 'linkedin_com...",,
1,,Duel,-2147477352,2021-04-18T02:18:24.714766,,,David,,Senior Affiliate Manager at AFFBROS,Senior Affiliate Manager at AFFBROS,...,2019-04-24T19:07:29.90645,,"[{'end_date': None, 'locality': None, 'linkedi...",,,,,{'title': 'Senior Affiliate Manager at AFFBROS...,,


In [9]:
# We have 86,683 unique titles with "Owner" and "Software Engineer" being most common
df['title'].value_counts()

Owner                                        1751
Software Engineer                            1446
Founder                                      1006
Director                                      914
Project Manager                               843
                                             ... 
Founder & Lead Activist                         1
Matematik öğretmeni                             1
Projektleitung                                  1
Director, Contents Data & Direct Supplier       1
Global Commercial Manager                       1
Name: title, Length: 86683, dtype: int64

**Pre-processing**

Room for improvement when it comes to pre-processing:
- Handle different languages, maybe translate
- Use `experience[0]` rather than `title`

Fortunately, other than language differences, embeddings are robust to other characters or quirks of natural language.

In [48]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import re

stop_words = set(stopwords.words('english'))

"""Strips, lowercases, and removes non-strings"""
def preprocess_titles(titles):
  new_titles = []

  for title in titles:
    if isinstance(title, str):
      # Remove company names and location information
      title = title.split(' at ')[0]
      title = title.split(' - ')[0]
      title = title.split(' | ')[0]

      # Lowercase and remove numbers
      title = title.lower()
      title = re.sub(r'\d+', '', title)
      title = title.strip()

      # Tokenize
      tokens = word_tokenize(title)

      # Remove punctuation
      tokens = [token for token in tokens if token not in string.punctuation]

      # Remove stop words
      tokens = [token for token in tokens if token not in stop_words]

      # Store new title
      new_title = ''.join(tokens)
      new_titles.append(title)

  return new_titles

print('Sample:\n')
preprocess_titles(df['title'].iloc[:10])

Sample:



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['executive assistant',
 'senior affiliate manager',
 'cyber security stream delivery lead',
 'long term engineer',
 'sports medicine specialist',
 'assembler',
 'financial analyst',
 'mortgage advisor',
 'co-founder/partner']

In [8]:
MAP_FILE_PATH = '/content/drive/MyDrive/title_to_embedding.pkl'

In [66]:
# {"software engineer": [.01, -.04, ...]}
#title_to_embedding = {}

with open(MAP_FILE_PATH, 'rb') as f:
  title_to_embedding = pickle.load(f)

print(f'Read {len(title_to_embedding)} titles/embeddings.')

Read 691 titles/embeddings.


In [68]:
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

"""Gets the embedding for each title to store in a map"""
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def get_embeddings(titles):

  # Do the preprocessing (set so we avoid dups)
  processed_titles = set(preprocess_titles(titles))

  for title in processed_titles:
    # Get embedding and create mapping  
    try:
      embedding = openai.Embedding.create(input = [title], model="text-embedding-ada-002")['data'][0]['embedding']
      title_to_embedding[title] = embedding
    except:
      print(f'Error with title: {title}')

  # Save to drive
  with open(MAP_FILE_PATH, 'wb') as f:
    pickle.dump(title_to_embedding, f)
  

In [69]:
%%time

LAST_IDX = 1000
NUM = 9000

print(f'\nGetting {NUM} more embeddings\n')
get_embeddings(df['title'].iloc[LAST_IDX:LAST_IDX+NUM])
#title_to_embedding['executive assistant'][:5]


Getting 9000 more embeddings

CPU times: user 34.2 s, sys: 2.19 s, total: 36.4 s
Wall time: 22min 1s


In [13]:
with open('/content/drive/MyDrive/title_to_embedding.pkl', 'rb') as f:
  title_to_embedding = pickle.load(f)

print(f'We now have {len(title_to_embedding)} titles/embeddings.')

We now have 5682 titles/embeddings.


In [14]:
title_to_embedding['software engineer'][:5]

[0.0034764218144118786,
 -0.008387231267988682,
 0.006127814296633005,
 -0.02333362028002739,
 -0.013775600120425224]

## Fit Kmeans


In [54]:
train = pd.DataFrame(list(title_to_embedding.items()), columns=['title', 'embedding'])

train.head()

Unnamed: 0,title,embedding
0,executive assistant,"[-0.03487851843237877, -0.0056572575122118, 0...."
1,senior affiliate manager,"[-0.021203506737947464, -0.04454905539751053, ..."
2,cyber security stream delivery lead,"[0.008899341337382793, -0.036653582006692886, ..."
3,long term engineer,"[-0.00446087634190917, -0.012415383942425251, ..."
4,sports medicine specialist,"[-0.0020971628837287426, 0.010949467308819294,..."


In [49]:
# Fit Kmeans
from sklearn.cluster import KMeans

k = 1000
embeddings = train['embedding'].to_list()
kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto')
train['labels'] = kmeans.fit_predict(embeddings)

train.head(5)

Unnamed: 0,title,embedding,labels
0,executive assistant,"[-0.03487851843237877, -0.0056572575122118, 0....",591
1,senior affiliate manager,"[-0.021203506737947464, -0.04454905539751053, ...",279
2,cyber security stream delivery lead,"[0.008899341337382793, -0.036653582006692886, ...",511
3,long term engineer,"[-0.00446087634190917, -0.012415383942425251, ...",56
4,sports medicine specialist,"[-0.0020971628837287426, 0.010949467308819294,...",497


In [50]:
# Create cluster_to_titles map
from collections import defaultdict

cluster_to_titles = defaultdict(list)

for idx, row in train.iterrows():
  cluster_to_titles[row['labels']].append(row['title'])

cluster_to_titles[45]

['vp procurement neutral grey llc',
 'vp sales & marketing specializing in ddos protection & high bandwidth dedicated/cloud server hosting',
 'vp analytics',
 'vp of products',
 'vp imaging',
 'vp products',
 'vp consumer marketing',
 'vp, products',
 'vp, product',
 'vp, growth']

Now predict cluster and return all variations of a given string title (i.e. "nlp scientist")

In [51]:
"""Gets the variations of a given job title by finding its cluster and returning all titles within that cluster"""
def get_variations(given_title):
  if title_to_embedding.get(given_title):
    print('Get existing embedding\n')
    embedding = title_to_embedding[given_title]
  else:
    print('Get new embedding\n')
    embedding = openai.Embedding.create(input = [given_title], model="text-embedding-ada-002")['data'][0]['embedding']

  cluster = kmeans.predict([embedding])[0]
  variations = cluster_to_titles[cluster]

  print(f'Pulling titles from cluster {cluster}\n')
  print(f"The variations of '{given_title}' we have are: {variations}")

## Results Sample

In [52]:
get_variations('machine learning engineer')

Get existing embedding

Pulling titles from cluster 158

The variations of 'machine learning engineer' we have are: ['software systems engineer, self-driving', 'machine learning engineer', 'artificial intelligence/machine learning engineer', 'machine learning intern', 'machine learning engineer and software developer']


In [53]:
get_variations('taxi driver')

Get new embedding

Pulling titles from cluster 837

The variations of 'taxi driver' we have are: ['lyft driver', 'door dasher', 'private driver']


## Summary

Seems to work quite well! 