<a href="https://colab.research.google.com/github/grantgasser/moonhub/blob/master/Moonhub_Acronym_Expansion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import numpy as np

In [1]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.26.5.tar.gz (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai
  Building wheel for openai (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.26.5-py3-none-any.whl size=67620 sha256=b232b3c991743885d92da7eb745d5d1363904cc9a19de6f08a8d37b4ab64be1e
  Stored in directory: /root/.cache/pip/wheels/a7/47/99/8273a59fbd59c303e8ff175416d5c1c9c03a2e83ebf7525a99
Successfully built openai
Installing collected packages: openai
Successfully installed openai-0.26.5


In [3]:
import openai
from google.colab import drive

drive.mount('/content/drive')

API_KEY_PATH = '/content/drive/MyDrive/openai_api_key.txt'

with open(API_KEY_PATH, 'r') as f:
  openai.api_key  = f.read().strip()

Mounted at /content/drive


In [5]:
nlp_scientist_embedding = openai.Embedding.create(input = ["nlp scientist"], model="text-embedding-ada-002")['data'][0]['embedding']

nlp_researcher_embedding = openai.Embedding.create(input = ["nlp researcher"], model="text-embedding-ada-002")['data'][0]['embedding']

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

# After trying a few examples, it appears these embeddings separate common female names from common male names slightly
print(cosine_similarity(np.array([nlp_scientist_embedding, nlp_researcher_embedding])))

[[1.         0.97279782]
 [0.97279782 1.        ]]


In [17]:
cosine_similarity(np.array([nlp_scientist_embedding, nlp_researcher_embedding]))[1][0]

0.9727978167954503

- Don't forget about adjectives: "senior", "sr.", "staff", "principal", "lead", etc.

- Misspellings

- Or the classic "software engineer **(ml)**", "machine learning engineer **(nlp)**"

Store the most common title "software engineer" and its variations like so:

```
job_title_mapping = {
    most_common_title_string: [most_common_title_embedding, [variation1, variation2, ...]]
}
```

**Potential Improvement:** Create a `JobTitleMapping` class and restrict typing

In [47]:
job_title_mapping = {
    'nlp scientist': [None, ['nlp researcher','natural language processing scientist', 'natural language processing researcher']],
    'nlp engineer': [None, ['natural language processing engineer', 'machine learning engineer (nlp)', 'ml engineer (nlp)', 'mle (nlp)']],
    'machine learning engineer': [None, ['mle', 'ml engineer', 'ml eng', 'machine learning eng']],
    'software engineer': [None, ['swe', 'software eng', 'software developer', 'software dev']]
}

# Get embeddings once, ahead of time
for most_common_title, embedding_and_variations in job_title_mapping.items():
  most_common_title_embedding = openai.Embedding.create(input = [most_common_title], model="text-embedding-ada-002")['data'][0]['embedding']
  embedding_and_variations[0] = most_common_title_embedding

In [48]:
from types import MethodDescriptorType
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff


"""
Gets variations of a given title with various acronyms
 
 (i.e. "nlp_scientist" => ["nlp researcher", "natural language processing scientist"...]
"""
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def get_title_variations(title):
  # Get embedding of given title
  title = title.strip().lower()
  title_embedding = openai.Embedding.create(input = [title], model="text-embedding-ada-002")['data'][0]['embedding']

  # Check similarity to given "most common" job titles
  highest_similarity_val = float('-inf')
  highest_similarity_title = None
  for most_common_title, (most_common_title_embedding, variations) in job_title_mapping.items():
    current_similarity = cosine_similarity(np.array([title_embedding, most_common_title_embedding]))[1][0]

    if current_similarity > highest_similarity_val:
      #print(f'New highest similarity between {title} and {most_common_title}: {current_similarity} vs. {highest_similarity_val}')
      highest_similarity_title = most_common_title
      highest_similarity_val = current_similarity


  # Now we have the variations of `title`
  variations = job_title_mapping[highest_similarity_title][1]

  # Add to our knowledge base: if this is a new variation, let's add it to our list
  if title not in variations:
    job_title_mapping[highest_similarity_title][1].append(title)

  # Leave out `title` if its in variation
  # Not concerned about time complexity given there are only so many ways to represent a single job
  return set([highest_similarity_title] + variations) - set([title])

### Results

We can see that it returns the relevant variations _and_ will add an unseen variation (e.g. 'Senior software engineer' to the set of titles. 

In [49]:
get_title_variations('mle')

{'machine learning eng', 'machine learning engineer', 'ml eng', 'ml engineer'}

In [50]:
get_title_variations('Senior software engineer')

{'software dev',
 'software developer',
 'software eng',
 'software engineer',
 'swe'}

In [51]:
get_title_variations('sr. software engineer')

{'senior software engineer',
 'software dev',
 'software developer',
 'software eng',
 'software engineer',
 'swe'}