<a href="https://colab.research.google.com/github/grantgasser/moonhub/blob/master/Moonhub_Acronym_Expansion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
import numpy as np

In [53]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [54]:
import openai
from google.colab import drive

drive.mount('/content/drive')

API_KEY_PATH = '/content/drive/MyDrive/openai_api_key.txt'

with open(API_KEY_PATH, 'r') as f:
  openai.api_key  = f.read().strip()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [57]:
from sklearn.metrics.pairwise import cosine_similarity

# Test cosine similarity
nlp_scientist_embedding = openai.Embedding.create(input = ["nlp scientist"], model="text-embedding-ada-002")['data'][0]['embedding']
nlp_researcher_embedding = openai.Embedding.create(input = ["nlp researcher"], model="text-embedding-ada-002")['data'][0]['embedding']

cosine_similarity(np.array([nlp_scientist_embedding, nlp_researcher_embedding]))[1][0]

0.9728179683871935

- Don't forget about adjectives: "senior", "sr.", "staff", "principal", "lead", etc.

- Misspellings

- Or the classic "software engineer **(ml)**", "machine learning engineer **(nlp)**"

Store the most common title "software engineer" and its variations like so:

```
job_title_mapping = {
    most_common_title_string: [most_common_title_embedding, [variation1, variation2, ...]]
}
```

**Potential Improvement:** Create a `JobTitleMapping` class and restrict typing

In [74]:
job_title_mapping = {
    'nlp scientist': [None, ['nlp researcher','natural language processing scientist', 'natural language processing researcher']],
    'nlp engineer': [None, ['natural language processing engineer', 'machine learning engineer (nlp)', 'ml engineer (nlp)', 'mle (nlp)']],
    'machine learning engineer': [None, ['mle', 'ml engineer', 'ml eng', 'machine learning eng']],
    'software engineer': [None, ['swe', 'software eng', 'software developer', 'software dev']]
}

# Get embeddings once, ahead of time
for most_common_title, embedding_and_variations in job_title_mapping.items():
  most_common_title_embedding = openai.Embedding.create(input = [most_common_title], model="text-embedding-ada-002")['data'][0]['embedding']
  embedding_and_variations[0] = most_common_title_embedding

In [75]:
from types import MethodDescriptorType
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff


"""
Gets variations of a given title with various acronyms
 
 (i.e. "nlp_scientist" => ["nlp researcher", "natural language processing scientist"...]
"""
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def get_title_variations(title):
  # Get embedding of given title
  title = title.strip().lower()
  title_embedding = openai.Embedding.create(input = [title], model="text-embedding-ada-002")['data'][0]['embedding']

  # Check similarity to given "most common" job titles
  highest_similarity_val = float('-inf')
  highest_similarity_title = None
  for most_common_title, (most_common_title_embedding, variations) in job_title_mapping.items():
    current_similarity = cosine_similarity(np.array([title_embedding, most_common_title_embedding]))[1][0]

    if current_similarity > highest_similarity_val:
      highest_similarity_title = most_common_title
      highest_similarity_val = current_similarity

  # Now we have the variations of `title`
  variations = job_title_mapping[highest_similarity_title][1]

  # Add to our knowledge base: if this is a new variation, let's add it to our list
  if title not in variations:
    job_title_mapping[highest_similarity_title][1].append(title)

  # Leave out `title` if its in variation
  # Not concerned about time complexity given there are only so many ways to represent a single job
  return highest_similarity_val, set([highest_similarity_title] + variations) - set([title])

### Results

We can see that it returns the relevant variations _and_ will add an unseen variation (e.g. 'Senior software engineer' to the set of titles. 

In [76]:
get_title_variations('mle')

(0.8053917071349258,
 {'machine learning eng',
  'machine learning engineer',
  'ml eng',
  'ml engineer'})

In [77]:
get_title_variations('Senior software engineer')

(0.9335964545280065,
 {'software dev',
  'software developer',
  'software eng',
  'software engineer',
  'swe'})

In [78]:
get_title_variations('sr. software engineer')

(0.9205671151784326,
 {'senior software engineer',
  'software dev',
  'software developer',
  'software eng',
  'software engineer',
  'swe'})


**Limitation/Improvement Opportunity:**
What about when we ask for a new title completely unrelated to what is stored?

In [79]:
# Try one
get_title_variations("sales development representative")

(0.7910379410037526,
 {'senior software engineer',
  'software dev',
  'software developer',
  'software eng',
  'software engineer',
  'sr. software engineer',
  'swe'})

Set threshold at .8? That seems a bit high. 

**Alternative:** We can initially enter/set the titles we care about initially. 50 or 100 job titles. A slightly tedious but robust solution. 

We could also grab a dataset of most common job titles from the web. If we're focused on the tech industry, we can grab the most common jobs in tech.