# HNSW Generation Notebook

This notebook is responsible for generating the Hierarchical Navigable Small World (HNSW) graph and saving it to a file for later use in the job market analysis application. The reason we are generating a "duplicate" HNSW when we already have one in Weaviate is that for this application, we need to have our own custom search. 

So we load in the vectors with uuid's that we downloaded from our Weaviate cluster into this one, generate a graph, save it to a dictionary, and then we can use it later on the site app.nbdata.co . More info on the custom search can be found in the readme or on the notion page! 

## Overview


1. Load the job postings dataset and their text embeddings.
2. Initialize the HNSW object with the appropriate parameters.
3. Add each job posting and its embedding to the HNSW graph.
4. Save the HNSW graph to a file for later use.
5. Perform some testing on 3d plots.
6. Perform some testing on prompts and generation.

In [11]:
# from hnsw.hnsw_python import HNSW
from main.hnsw.hnsw_python import HNSW
import pandas as pd
import numpy as np

import pprint
import weaviate
import pickle

In [12]:
import openai

In [4]:
listing_df = pd.read_feather('data/postings_w_embeddings_v2.fth')
listing_df.columns

Index(['job_id', 'scraped', 'company_id', 'work_type', 'formatted_work_type',
       'location', 'job_posting_url', 'applies', 'original_listed_time',
       'remote_allowed', 'application_url', 'application_type', 'expiry',
       'inferred_benefits', 'closed_time', 'formatted_experience_level',
       'years_experience', 'description', 'title', 'skills_desc', 'views',
       'job_region', 'listed_time', 'degree', 'posting_domain', 'sponsored',
       'country', 'country_code', 'job_functions', 'industry_names',
       'company_name', 'description_company', 'company_size', 'state',
       'country_company', 'city', 'zip_code', 'address', 'url', 'text',
       'entities_COMPANY', 'entities_METHODS', 'entities_TOOLS',
       'entities_EXPERIENCE', 'entities_LEVEL', 'entities_REMOTE',
       'entities_RESPONSABILITY', 'entities_TITLE', 'entities_QUALIFICATION',
       'vector', 'wv_uuid', 'annotations'],
      dtype='object')

In [4]:
listing_df =  pd.read_feather('data/data_ner_embeddings.fth')
listing_df.columns

Index(['job_id', 'scraped', 'company_id', 'work_type', 'formatted_work_type',
       'location', 'job_posting_url', 'applies', 'original_listed_time',
       'remote_allowed', 'application_url', 'application_type', 'expiry',
       'inferred_benefits', 'closed_time', 'formatted_experience_level',
       'years_experience', 'description', 'title', 'skills_desc', 'views',
       'job_region', 'listed_time', 'degree', 'posting_domain', 'sponsored',
       'country', 'country_code', 'job_functions', 'industry_names',
       'company_name', 'description_company', 'company_size', 'state',
       'country_company', 'city', 'zip_code', 'address', 'url', 'text',
       'entities_COMPANY', 'entities_METHODS', 'entities_TOOLS',
       'entities_EXPERIENCE', 'entities_LEVEL', 'entities_REMOTE',
       'entities_RESPONSABILITY', 'entities_TITLE', 'entities_QUALIFICATION',
       'vector', 'wv_uuid', 'annotations', 'average_embedding',
       'num_embedding', 'embedding_indexes'],
      dtype='objec

In [7]:
listing_df[['wv_uuid']].to_feather('data/data_wv_uuid.fth')

In [5]:
listing_df = listing_df.drop(columns=['vector'])
listing_df.rename(columns={'average_embedding':'vector'}, inplace=True)

In [6]:
listing_df.to_feather('data/data_ner_embeddings_V3_SEC.fth')

In [10]:
listing_df.query('wv_uuid == "a4c3e389-a64d-4bf2-b23f-6be8f854d42b"').job_id

Series([], Name: job_id, dtype: int64)

In [1]:
len("efa83e0c-0485-4d29-9450-c3659954581f")

36

In [18]:
df_from_listing = listing_df[['job_id', 'vector', 'wv_uuid']].copy()
df_from_listing['vector'] = df_from_listing['vector'].apply(lambda x: np.array(x))
df_from_listing

Unnamed: 0,job_id,vector,wv_uuid
0,3940522647,"[-0.02221361800496067, -0.004398933944425413, ...",efa83e0c-0485-4d29-9450-c3659954581f
1,3940943977,"[-0.019391799678227732, -0.002695409581065178,...",3390cdfe-4ede-4853-b619-5f239e5440be
2,3940421349,"[-0.019783145304124145, 0.007412008540834197, ...",8fa0e76b-64d5-4249-aa52-8582719263bf
3,3940514543,"[-0.02427271308584346, 0.0004912702044950695, ...",be25a046-c6b3-4997-baca-ef07dbd09d67
4,3941779467,"[-0.024040757431066595, 0.023920072009786963, ...",fc339ec2-b14c-46e6-8595-f09c6799e9a9
...,...,...,...
99508,3940812784,"[-0.02023694177235787, -0.0034155561806983306,...",17f43adc-1508-4e0a-ba87-99f5ed580260
99509,3966761724,"[-0.02417820757411483, -0.0036339300785882852,...",0c18adc3-d1b1-4151-a6c3-4e1a44afa40c
99510,3940961160,"[-0.00516968808531399, 0.006455827741673501, 0...",b1929ec7-306c-48af-ba67-c0d5fd785841
99511,3941379263,"[-0.012572629726491868, 0.001053989011173447, ...",a7114086-a202-4fe4-87a3-4bdee2007ebf


In [20]:
vectors_listings = np.stack(df_from_listing['vector'].tolist())
vectors_listings.shape
data = vectors_listings

In [16]:
dim = vectors_listings.shape[1]
num_elements = vectors_listings.shape[0]
data = vectors_listings
data_labels = df_from_listing['job_id'].values
hnsw = HNSW('cosine',df=df_from_listing, m=20,m0=40, ef=100)

In [17]:
for index, i in enumerate(data):
    if index % 1000 == 0:
        pprint.pprint('train No.%d' % index)
    hnsw.add(i)

'train No.0'
'train No.1000'
'train No.2000'
'train No.3000'
'train No.4000'
'train No.5000'
'train No.6000'
'train No.7000'
'train No.8000'
'train No.9000'
'train No.10000'
'train No.11000'
'train No.12000'
'train No.13000'
'train No.14000'
'train No.15000'
'train No.16000'
'train No.17000'
'train No.18000'
'train No.19000'
'train No.20000'
'train No.21000'
'train No.22000'
'train No.23000'
'train No.24000'
'train No.25000'
'train No.26000'
'train No.27000'
'train No.28000'
'train No.29000'
'train No.30000'
'train No.31000'
'train No.32000'
'train No.33000'
'train No.34000'
'train No.35000'
'train No.36000'
'train No.37000'
'train No.38000'
'train No.39000'
'train No.40000'
'train No.41000'
'train No.42000'
'train No.43000'
'train No.44000'
'train No.45000'
'train No.46000'
'train No.47000'
'train No.48000'
'train No.49000'
'train No.50000'
'train No.51000'
'train No.52000'
'train No.53000'
'train No.54000'
'train No.55000'
'train No.56000'
'train No.57000'
'train No.58000'
'train No.

In [21]:
#Check it can find it's own data
hnsw.search(data[0], 2)

[(0, 0.0), (82504, 0.013301795109639358)]

In [22]:
min_dist = 1
min_index = 0
max_dist = 0
max_index = 0
for i in range(1000):
    dist = 1 - np.dot(data[0], data[i])/(np.linalg.norm(data[0])*(np.linalg.norm(data[i])))
    if dist < min_dist:
        min_dist = dist
        min_index = i
    if dist > max_dist:
        max_dist = dist
        max_index = i
        print(f"max_index: {max_index}, max_dist: {max_dist}")
        

max_index: 1, max_dist: 0.2274328948847013
max_index: 2, max_dist: 0.25316812175978276
max_index: 5, max_dist: 0.3232100244303666
max_index: 26, max_dist: 0.33485546483113005
max_index: 35, max_dist: 0.3381619810842019
max_index: 251, max_dist: 0.34613895154514795
max_index: 334, max_dist: 0.40410717203989044
max_index: 640, max_dist: 0.4387150115322521


In [23]:
#Number of Layers
len(hnsw._graphs)

4

In [24]:
search_list = hnsw.serach_along_axis(data[0], 5)

min_dist: 0.0, (0, 0.0), [-0.02221362 -0.00439893  0.03778345 ... -0.02267204  0.01819182
  0.00676272]

max_vectors = [(24577, 1.4497503924839703), (32198, 1.4537960896837472), (72336, 1.4857112637848084), (85199, 1.502616621561403), (86330, 1.5281002390398704)]
max_dist: [0.5502496075160297, 0.5462039103162528, 0.5142887362151916, 0.49738337843859715, 0.47189976096012964], (24577, 1.4497503924839703)


In [25]:
from typing import List
import os
import openai
from main.keys import open_ai_key, weaviate_url, weaviate_key
import os
import weaviate
import openai

os.environ["OPENAI_APIKEY"] = open_ai_key
os.environ["WCD_URL"] = weaviate_url
os.environ["WCD_API_KEY"] = weaviate_key

openai_api_key = os.environ.get("OPENAI_APIKEY", "<your OpenAI API key if not set as env var>")
openai.api_key = openai_api_key
# Define a function to call the endpoint and obtain embeddings
def vectorize(texts: List[str]) -> List[List[float]]:

    response = openai.embeddings.create(
        input=texts, model="text-embedding-3-small"
    )

    return response.data[0].embedding

x_text = "Machine Learning Engineer"
x_vector = vectorize([x_text])

y_text = "Data Scientist"
y_vector = vectorize([y_text])

z_text = "Accountant"
z_vector = vectorize([z_text])

In [26]:
hnsw.df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99513 entries, 0 to 99512
Data columns (total 54 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   job_id                      99513 non-null  int64  
 1   scraped                     99513 non-null  int64  
 2   company_id                  99513 non-null  float64
 3   work_type                   99513 non-null  object 
 4   formatted_work_type         99513 non-null  object 
 5   location                    99513 non-null  object 
 6   job_posting_url             99513 non-null  object 
 7   applies                     22006 non-null  float64
 8   original_listed_time        99513 non-null  float64
 9   remote_allowed              99513 non-null  float64
 10  application_url             95894 non-null  object 
 11  application_type            99513 non-null  object 
 12  expiry                      99513 non-null  float64
 13  inferred_benefits           0 n

In [41]:
#Search along axis with vectors we generated from openAI
x_list = hnsw.serach_along_axis(x_vector, k=50, n=10)
y_list = hnsw.serach_along_axis(y_vector, k=50, n=10)
z_list = hnsw.serach_along_axis(z_vector, k=50, n=10)

min_dist: 0.3344864182008471, (66250, 0.3344864182008471), [-3.05277924e-03  3.51996277e-06  3.83750286e-02 ... -2.07611188e-02
  2.08118904e-02  7.52957848e-04]

max_vectors = [(79495, 1.18143497462973), (85013, 1.184873144662384), (65885, 1.1886757669893047), (98045, 1.1898247615400237), (65160, 1.1906538723942872), (51890, 1.1916904426262358), (17921, 1.1933274432776164), (82486, 1.194017850442944), (93586, 1.1948736716684976), (15430, 1.1948736716684976), (10593, 1.196928207262062), (55138, 1.1975978561341523), (13965, 1.1979946013149718), (50684, 1.2007413635559434), (96257, 1.200906734246058), (91386, 1.2053261719172692), (90412, 1.2065584605868733), (7243, 1.2067298319396247), (63494, 1.2091906772299397), (45652, 1.210731017342467), (66583, 1.2116006251709988), (54032, 1.2122905175613417), (42830, 1.2127726762965785), (55077, 1.2134384293989857), (31607, 1.214509171990443), (15903, 1.2147492278576744), (27284, 1.215023962793461), (75031, 1.2161136189767587), (74841, 1.2172244528

In [43]:
full_list = x_list|y_list|z_list
distance_list = []
for uuid,array in full_list.items():
    print(uuid, array)
    x_dist = hnsw.distance(x_vector, array)
    y_dist = hnsw.distance(y_vector, array)
    z_dist = hnsw.distance(z_vector, array)
    distance_list.append((uuid, x_dist, y_dist, z_dist))

49b4c37d-c69b-433f-8ee8-13ec8cf94f77 [-3.05277924e-03  3.51996277e-06  3.83750286e-02 ... -2.07611188e-02
  2.08118904e-02  7.52957848e-04]
64741739-49da-4633-83ec-200698c02df7 [-0.01536675 -0.00158799  0.03449921 ... -0.01977081  0.01581342
  0.0004023 ]
77d39379-94e6-441f-a6e1-83d6286575b2 [-0.01332302  0.00042782  0.04214333 ... -0.01345114  0.01609977
  0.00396079]
d7ea77c8-cbc2-4b10-90a5-815e212b03ce [-0.0181454  -0.00158145  0.04517341 ... -0.01708312  0.01444402
  0.00357879]
3934503a-4faa-4b5b-8c2c-7289c841b7a3 [-0.0181454  -0.00158145  0.04517341 ... -0.01708312  0.01444402
  0.00357879]
442b72d7-1c31-490b-9ce0-9e20a2e3d470 [-0.0181454  -0.00158145  0.04517341 ... -0.01708312  0.01444402
  0.00357879]
8aa7bddd-295c-46c5-8776-ad24455ee703 [-0.0181454  -0.00158145  0.04517341 ... -0.01708312  0.01444402
  0.00357879]
645b7e55-ff3e-4672-8943-4d0a2cf6cde0 [-0.01252114  0.00012793  0.04340964 ... -0.01285087  0.01694614
  0.00268447]
e6bb49cf-aee1-49e1-92bd-55c978adb312 [-0.0160258

In [26]:
#Graph for saving data so we can easily load it on the server
hnsw_save_dict = {'entrypoint': hnsw._enter_point, 'm': hnsw._m, 'm0': hnsw._m0, 'ef': hnsw._ef, 'data': hnsw.data, 'graphs': hnsw._graphs}

In [16]:
#Pickle hnsw_save_dict
# with open('hnsw_save_dict_V3_SEC.pkl', 'wb') as f:
#     pickle.dump(hnsw_save_dict, f)

#Load hnsw_save_dict to test
with open('hnsw_save_dict_V3_SEC.pkl', 'rb') as f:
    hnsw_save_dict = pickle.load(f)

hnsw = HNSW('cosine', listing_df, m=20, m0=40, ef=100)
hnsw._graphs = hnsw_save_dict['graphs']
hnsw._enter_point = hnsw_save_dict['entrypoint']
hnsw.data = hnsw_save_dict['data']

print(f"hnsw_obj initialized")

hnsw_obj initialized


In [14]:
hnsw_save_dict['graphs'][2]

{466: {5494: 0.07708688715787237,
  5606: 0.09041156668517714,
  10264: 0.08687905409833918,
  13881: 0.08694844337508534,
  17702: 0.08790280470496648,
  20647: 0.09620298121795434,
  22621: 0.08479963932332868,
  25060: 0.0858659798792405,
  34782: 0.07832542363682793,
  36321: 0.08183322234003121,
  53271: 0.09247711659981117,
  61322: 0.0705433711069906,
  67618: 0.07837095181701392,
  69666: 0.07986168450941622,
  70186: 0.07370641084980833,
  71523: 0.08576367206821311,
  72383: 0.06951658585004394,
  75467: 0.08602827947529823,
  77321: 0.1563079179164245,
  83945: 0.09029309440308131},
 1353: {1906: 0.2543685442432668,
  2156: 0.22553434089149615,
  2714: 0.2547715306228997,
  2929: 0.15626360534742512,
  3404: 0.2171882209009498,
  6317: 0.23899684521807418,
  10264: 0.2604154513424024,
  10371: 0.18889453092732011,
  13679: 0.19901833121003143,
  15091: 0.22196101317568784,
  17920: 0.23266487739905162,
  20259: 0.2418698725401618,
  24796: 0.22657579957049911,
  35536: 0.222

In [44]:
distance_df = pd.DataFrame(data=distance_list, columns=['uuid', 'x_dist', 'y_dist', 'z_dist'])
distance_df = distance_df.merge(listing_df, left_on='uuid', right_on='wv_uuid')
distance_df

Unnamed: 0,uuid,x_dist,y_dist,z_dist,job_id,scraped,company_id,work_type,formatted_work_type,location,...,entities_LEVEL,entities_REMOTE,entities_RESPONSABILITY,entities_TITLE,entities_QUALIFICATION,wv_uuid,annotations,vector,num_embedding,embedding_indexes
0,49b4c37d-c69b-433f-8ee8-13ec8cf94f77,0.334486,0.523652,0.710615,3940854002,1,1072429.0,FULL_TIME,Full-time,"Austin, Texas Metropolitan Area",...,[],[],"[**ML Engineer, collaborate with data science ...",[Machine Learning Engineer],[],49b4c37d-c69b-433f-8ee8-13ec8cf94f77,"[{'end': 25, 'label': 'TITLE', 'start': 0, 'te...","[-0.0030527792405337095, 3.5199627745896578e-0...",4,"[1775, 35399, 77383, 82913, 85580, 144566, 373..."
1,64741739-49da-4633-83ec-200698c02df7,0.339148,0.496511,0.703695,3940018380,1,1456380.0,FULL_TIME,Full-time,"London, England, United Kingdom",...,[Senior],[],"[design, build, and deploy production-grade so...",[Machine Learning Engineer],[],64741739-49da-4633-83ec-200698c02df7,"[{'end': 6, 'label': 'LEVEL', 'start': 0, 'tex...","[-0.01536674847981582, -0.0015879875004646325,...",9,"[6356, 49699, 76282, 76294, 76299, 76618, 8491..."
2,77d39379-94e6-441f-a6e1-83d6286575b2,0.341286,0.502081,0.673617,3939314477,1,3516935.0,FULL_TIME,Full-time,"New York, NY",...,[Senior],[],[have the opportunity to continuously learn an...,[Machine Learning Engineer],"[Bachelor’s degree., Master's or doctoral degr...",77d39379-94e6-441f-a6e1-83d6286575b2,"[{'end': 6, 'label': 'LEVEL', 'start': 0, 'tex...","[-0.013323024251800764, 0.0004278155631674543,...",8,"[2761, 49698, 76282, 76294, 76299, 76461, 8208..."
3,d7ea77c8-cbc2-4b10-90a5-815e212b03ce,0.351619,0.507872,0.682071,3939315205,1,3516935.0,FULL_TIME,Full-time,"Philadelphia, PA",...,[],[],[have the opportunity to continuously learn an...,[Lead Machine Learning Engineer],"[Bachelor’s degree \n, Master's or doctoral de...",d7ea77c8-cbc2-4b10-90a5-815e212b03ce,"[{'end': 30, 'label': 'TITLE', 'start': 0, 'te...","[-0.018145395289916037, -0.001581446780404473,...",8,"[2761, 33097, 76282, 76294, 76299, 76461, 8208..."
4,3934503a-4faa-4b5b-8c2c-7289c841b7a3,0.351619,0.507872,0.682071,3939314404,1,3516935.0,FULL_TIME,Full-time,"Annapolis, MD",...,[],[],[have the opportunity to continuously learn an...,[Lead Machine Learning Engineer],"[Bachelor’s degree \n, Master's or doctoral de...",3934503a-4faa-4b5b-8c2c-7289c841b7a3,"[{'end': 30, 'label': 'TITLE', 'start': 0, 'te...","[-0.018145395289916037, -0.001581446780404473,...",8,"[2761, 33097, 76282, 76294, 76299, 76461, 8208..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1371,a080c9fa-d8cf-482c-be84-475a30b1216a,0.695609,0.742077,0.705127,3939546448,1,1358653.0,FULL_TIME,Full-time,"Wichita, KS",...,[],[],[Estimate of weekly payments is intended for i...,[Travel Nurse RN - Med/Surg],[],a080c9fa-d8cf-482c-be84-475a30b1216a,"[{'end': 26, 'label': 'TITLE', 'start': 0, 'te...","[-0.021400268791088212, -0.01575994441130509, ...",6,"[12311, 62995, 76282, 76297, 76453, 234460, 36..."
1372,17836dec-4b12-47f4-8c9f-ebe5553d6a5e,0.696016,0.758684,0.705107,3939520217,1,1358653.0,FULL_TIME,Full-time,"Wichita, KS",...,[],[],[provide an experience that's simply unmatched...,[Travel Nurse RN - Med/Surg],[],17836dec-4b12-47f4-8c9f-ebe5553d6a5e,"[{'end': 26, 'label': 'TITLE', 'start': 0, 'te...","[-0.025098913417120155, -0.011631644408529004,...",6,"[12311, 62911, 76282, 76297, 76453, 402290, 41..."
1373,81ee44c5-02d3-4b78-9dbc-b9cf7fb5c884,0.704974,0.756150,0.705107,3939595364,1,1358653.0,FULL_TIME,Full-time,"Evansville, IN",...,[],[],"[ONC - Skills \n, Stand out in the competitive...","[Travel Registered Nurse - Oncology - $1,908 /...",[],81ee44c5-02d3-4b78-9dbc-b9cf7fb5c884,"[{'end': 50, 'label': 'TITLE', 'start': 0, 'te...","[-0.021229377119905416, -0.013566014512131611,...",6,"[12311, 72572, 76282, 76297, 76453, 278754, 33..."
1374,02e8e307-3eaa-49b9-9103-da4eb32686d7,0.693578,0.754231,0.705594,3939950250,1,1358653.0,FULL_TIME,Full-time,"Phoenix, AZ",...,[],[],[Estimate of weekly payments is intended for i...,"[Travel Registered Nurse - Med/Surg, Registere...",[],02e8e307-3eaa-49b9-9103-da4eb32686d7,"[{'end': 34, 'label': 'TITLE', 'start': 0, 'te...","[-0.02056002547033131, -0.01579916214880844, 0...",6,"[12311, 72096, 76282, 76297, 76453, 234460, 22..."


In [40]:
#Intitial distance dataframe that we use to show the initial plot on load, so we have something there
distance_df[['title','company_name', 'uuid', 'x_dist', 'y_dist', 'z_dist', 'wv_uuid']].to_feather('data/distance_df.fth')

In [31]:
distance_df = pd.read_feather('data/distance_df.fth')

In [39]:
distance_df

Unnamed: 0,uuid,x_dist,y_dist,z_dist,job_id,scraped,company_id,work_type,formatted_work_type,location,...,entities_LEVEL,entities_REMOTE,entities_RESPONSABILITY,entities_TITLE,entities_QUALIFICATION,wv_uuid,annotations,vector,num_embedding,embedding_indexes
0,49b4c37d-c69b-433f-8ee8-13ec8cf94f77,0.334486,0.523652,0.710615,3940854002,1,1072429.0,FULL_TIME,Full-time,"Austin, Texas Metropolitan Area",...,[],[],"[**ML Engineer, collaborate with data science ...",[Machine Learning Engineer],[],49b4c37d-c69b-433f-8ee8-13ec8cf94f77,"[{'end': 25, 'label': 'TITLE', 'start': 0, 'te...","[-0.0030527792405337095, 3.5199627745896578e-0...",4,"[1775, 35399, 77383, 82913, 85580, 144566, 373..."
1,64741739-49da-4633-83ec-200698c02df7,0.339148,0.496511,0.703695,3940018380,1,1456380.0,FULL_TIME,Full-time,"London, England, United Kingdom",...,[Senior],[],"[design, build, and deploy production-grade so...",[Machine Learning Engineer],[],64741739-49da-4633-83ec-200698c02df7,"[{'end': 6, 'label': 'LEVEL', 'start': 0, 'tex...","[-0.01536674847981582, -0.0015879875004646325,...",9,"[6356, 49699, 76282, 76294, 76299, 76618, 8491..."
2,77d39379-94e6-441f-a6e1-83d6286575b2,0.341286,0.502081,0.673617,3939314477,1,3516935.0,FULL_TIME,Full-time,"New York, NY",...,[Senior],[],[have the opportunity to continuously learn an...,[Machine Learning Engineer],"[Bachelor’s degree., Master's or doctoral degr...",77d39379-94e6-441f-a6e1-83d6286575b2,"[{'end': 6, 'label': 'LEVEL', 'start': 0, 'tex...","[-0.013323024251800764, 0.0004278155631674543,...",8,"[2761, 49698, 76282, 76294, 76299, 76461, 8208..."
3,d7ea77c8-cbc2-4b10-90a5-815e212b03ce,0.351619,0.507872,0.682071,3939315205,1,3516935.0,FULL_TIME,Full-time,"Philadelphia, PA",...,[],[],[have the opportunity to continuously learn an...,[Lead Machine Learning Engineer],"[Bachelor’s degree \n, Master's or doctoral de...",d7ea77c8-cbc2-4b10-90a5-815e212b03ce,"[{'end': 30, 'label': 'TITLE', 'start': 0, 'te...","[-0.018145395289916037, -0.001581446780404473,...",8,"[2761, 33097, 76282, 76294, 76299, 76461, 8208..."
4,3934503a-4faa-4b5b-8c2c-7289c841b7a3,0.351619,0.507872,0.682071,3939314404,1,3516935.0,FULL_TIME,Full-time,"Annapolis, MD",...,[],[],[have the opportunity to continuously learn an...,[Lead Machine Learning Engineer],"[Bachelor’s degree \n, Master's or doctoral de...",3934503a-4faa-4b5b-8c2c-7289c841b7a3,"[{'end': 30, 'label': 'TITLE', 'start': 0, 'te...","[-0.018145395289916037, -0.001581446780404473,...",8,"[2761, 33097, 76282, 76294, 76299, 76461, 8208..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1279,b6ece5dc-f5a5-4540-a201-16e4a7d04efa,0.690287,0.736975,0.741332,3942020179,1,11402317.0,FULL_TIME,Full-time,"Garden City, KS",...,[],[],[],[Travel RN - Med Surg],[],b6ece5dc-f5a5-4540-a201-16e4a7d04efa,"[{'end': 20, 'label': 'TITLE', 'start': 0, 'te...","[-0.027615307923406364, -0.0033340769819915294...",5,"[4773, 69970, 76282, 76297, 76453]"
1280,3a1fb5b1-fa50-4fbd-a608-98c5e61081ea,0.689148,0.745511,0.741183,3939421962,1,1358653.0,FULL_TIME,Full-time,"Wichita, KS",...,[],[],[],[Travel Nurse RN - Med/Surg],"[PT's and PTA's \n, CNA's \n \nSince we are ne...",3a1fb5b1-fa50-4fbd-a608-98c5e61081ea,"[{'end': 26, 'label': 'TITLE', 'start': 0, 'te...","[-0.03319906753798326, -0.0005788077833130956,...",6,"[12311, 61842, 76282, 76297, 76453, 112498]"
1281,52e53406-32c1-4096-a66a-6212ff8a12eb,0.700044,0.767919,0.741140,3941382021,1,1358653.0,PART_TIME,Part-time,"Owensboro, KY",...,[],[],[],"[Travel Nurse RN - Med/Surg, Registered Nurse \n]",[],52e53406-32c1-4096-a66a-6212ff8a12eb,"[{'end': 26, 'label': 'TITLE', 'start': 0, 'te...","[-0.03315471671521664, -0.009832729570916854, ...",7,"[12311, 62269, 76282, 76297, 76453, 80459, 110..."
1282,70904fb5-25c4-4765-a4fe-329e37ef5dde,0.706278,0.756998,0.740969,3939254040,1,1358653.0,FULL_TIME,Full-time,"Salina, KS",...,[],[],[],[Travel Nurse RN - Med/Surg],[],70904fb5-25c4-4765-a4fe-329e37ef5dde,"[{'end': 26, 'label': 'TITLE', 'start': 0, 'te...","[-0.03232770636677742, -0.011976820882409811, ...",5,"[12311, 63862, 76282, 76297, 76453]"


In [37]:
distance_df.columns

Index(['uuid', 'x_dist', 'y_dist', 'z_dist', 'job_id', 'scraped', 'company_id',
       'work_type', 'formatted_work_type', 'location', 'job_posting_url',
       'applies', 'original_listed_time', 'remote_allowed', 'application_url',
       'application_type', 'expiry', 'inferred_benefits', 'closed_time',
       'formatted_experience_level', 'years_experience', 'description',
       'title', 'skills_desc', 'views', 'job_region', 'listed_time', 'degree',
       'posting_domain', 'sponsored', 'country', 'country_code',
       'job_functions', 'industry_names', 'company_name',
       'description_company', 'company_size', 'state', 'country_company',
       'city', 'zip_code', 'address', 'url', 'text', 'entities_COMPANY',
       'entities_METHODS', 'entities_TOOLS', 'entities_EXPERIENCE',
       'entities_LEVEL', 'entities_REMOTE', 'entities_RESPONSABILITY',
       'entities_TITLE', 'entities_QUALIFICATION', 'wv_uuid', 'annotations',
       'vector', 'num_embedding', 'embedding_indexes'],
 

In [45]:
import pandas as pd
import plotly.graph_objects as go
from plotly.offline import plot
# Assuming df is your DataFrame and it has columns 'tsne-2d-one', 'tsne-2d-two' and 'title'

x = distance_df['x_dist'].values
y = distance_df['y_dist'].values
z = distance_df['z_dist'].values
global_min = min(x.min(), y.min(), z.min())
global_max = max(x.max(), y.max(), z.max())
custom_data = distance_df['wv_uuid'].values
fig = go.Figure(data=go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=8,
        color=x+y+z,  # set color to cluster values
        colorscale='Viridis',  # choose a colorscale
        opacity=0.8,
        colorbar=dict(title='Semantic Distance'),
        
    ),
    # marker=dict(size=8, opacity=0.5),
    text=distance_df['title']+' | '+distance_df['company_name'],  # this will set the hover text
    hoverinfo='text',
    name='Semantic Distance',
    customdata = custom_data

    
))

fig.update_layout(title='t-SNE plot',
                  scene=dict(
                      xaxis_title=x_text,
                      yaxis_title=y_text,
                      zaxis_title=z_text,
                      aspectmode='cube',
                      xaxis=dict(range=[global_min, global_max]),  # set range for x axis
                      yaxis=dict(range=[global_min, global_max]),  # set range for y axis
                      zaxis=dict(range=[global_min, global_max]),  # set range for z axis
                  ),
                  hovermode='closest')


fig.show()

plot(fig)

'temp-plot.html'

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.offline import plot
# Assuming df is your DataFrame and it has columns 'tsne-2d-one', 'tsne-2d-two' and 'title'

x = distance_df['x_dist'].values
y = distance_df['y_dist'].values
z = distance_df['z_dist'].values
global_min = min(x.min(), y.min(), z.min())
global_max = max(x.max(), y.max(), z.max())
fig = go.Figure(data=go.Scatter(
    x=x,
    y=y,
    # z=z,
    mode='markers',
    marker=dict(
        size=8,
        color=x+y,  # set color to cluster values
        colorscale='Viridis',  # choose a colorscale
        opacity=0.8,
        colorbar=dict(title='Semantic Distance')
    ),
    # marker=dict(size=8, opacity=0.5),
    text=distance_df['title']+' | '+distance_df['company_name'],  # this will set the hover text
    hoverinfo='text',
    name='Semantic Distance'

    
))

fig.update_layout(title='t-SNE plot',
                  scene=dict(
                      xaxis_title=x_text,
                      yaxis_title=y_text,
                    #   zaxis_title=z_text,
                      aspectmode='cube',
                      xaxis=dict(range=[global_min, global_max]),  # set range for x axis
                      yaxis=dict(range=[global_min, global_max]),  # set range for y axis
                    #   zaxis=dict(range=[global_min, global_max]),  # set range for z axis
                  ),
                  hovermode='closest')


fig.show()
config = {
    'displayModeBar': False,  # Remove the mode bar
    'showTips': False,        # Remove the tooltips
    # 'staticPlot': True        # Make the plot static (non-interactive)
}


plot(fig)

'temp-plot.html'

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.offline import plot
from sklearn.linear_model import LinearRegression

def update_plot(x, y):
    # Calculate the line of best fit
    model = LinearRegression()
    model.fit(x.reshape(-1, 1), y)
    y_pred = model.predict(x.reshape(-1, 1))

    # Sort x and y_pred based on x values
    sorted_indices = np.argsort(x)
    x_sorted = x[sorted_indices]
    y_pred_sorted = y_pred[sorted_indices]

    fig = go.Figure(data=[
        go.Scatter(
            x=x,
            y=y,
            mode='markers',
            marker=dict(
                size=8,
                color=x+y,  # set color to cluster values
                colorscale='Viridis',  # choose a colorscale
                opacity=0.8,
                colorbar=dict(title='Semantic Distance')
            ),
            text=distance_df['title']+' | '+distance_df['company_name'],  # this will set the hover text
            hoverinfo='text',
            name='Semantic Distance',
            showlegend=False  # Set showlegend to False for the scatter trace
        ),
        go.Scatter(
            x=x_sorted,
            y=y_pred_sorted,
            mode='lines',
            line=dict(color='red', width=2),
            name='Line of Best Fit',
            showlegend=False  # Set showlegend to False for the line trace
        )
    ])

    fig.update_layout(
        title=f'Listing Cosine Distance: {x_text} v {y_text}',
        xaxis_title=f'Cosine distance from {x_text}',
        yaxis_title=f'Cosine distance from {y_text}',
        hovermode='closest'
    )

    return fig

# Assuming df is your DataFrame and it has columns 'tsne-2d-one', 'tsne-2d-two' and 'title'
x = distance_df['x_dist'].values
y = distance_df['y_dist'].values

fig = update_plot(x, y)
fig.show()
config = {
    'displayModeBar': False,  # Remove the mode bar
    'showTips': False,        # Remove the tooltips
    # 'staticPlot': True        # Make the plot static (non-interactive)
}

plot(fig, filename='ml_vs_data_scientist.html', config=config)


'ml_vs_data_scientist.html'

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.offline import plot
from sklearn.linear_model import LinearRegression

def update_plot(x, y):
    # Calculate the line of best fit
    model = LinearRegression()
    model.fit(x.reshape(-1, 1), z)
    y_pred = model.predict(x.reshape(-1, 1))

    # Sort x and y_pred based on x values
    sorted_indices = np.argsort(x)
    x_sorted = x[sorted_indices]
    y_pred_sorted = y_pred[sorted_indices]

    fig = go.Figure(data=[
        go.Scatter(
            x=x,
            y=z,
            mode='markers',
            marker=dict(
                size=8,
                color=x+y,  # set color to cluster values
                colorscale='Viridis',  # choose a colorscale
                opacity=0.8,
                colorbar=dict(title='Semantic Distance')
            ),
            text=distance_df['title']+' | '+distance_df['company_name'],  # this will set the hover text
            hoverinfo='text',
            name='Semantic Distance',
            showlegend=False  # Set showlegend to False for the scatter trace
        ),
        go.Scatter(
            x=x_sorted,
            y=y_pred_sorted,
            mode='lines',
            line=dict(color='red', width=2),
            name='Line of Best Fit',
            showlegend=False  # Set showlegend to False for the line trace
        )
    ])

    fig.update_layout(
        title=f'Listing Cosine Distance: {x_text} v {z_text}',
        xaxis_title=f'Cosine distance from {x_text}',
        yaxis_title=f'Cosine distance from {y_text}',
        hovermode='closest'
    )

    return fig

# Assuming df is your DataFrame and it has columns 'tsne-2d-one', 'tsne-2d-two' and 'title'
x = distance_df['x_dist'].values
y = distance_df['y_dist'].values

fig = update_plot(x, y)
fig.show()
config = {
    'displayModeBar': False,  # Remove the mode bar
    'showTips': False,        # Remove the tooltips
    # 'staticPlot': True        # Make the plot static (non-interactive)
}

plot(fig, filename='ml_vs_accountant.html', config=config)



'ml_vs_accountant.html'

In [None]:
min_x_vector = distance_df[distance_df['x_dist'] == distance_df['x_dist'].min()]['uuid'].values[0]
min_y_vector = distance_df[distance_df['y_dist'] == distance_df['y_dist'].min()]['uuid'].values[0]
min_z_vector = distance_df[distance_df['z_dist'] == distance_df['z_dist'].min()]['uuid'].values[0]

In [None]:
client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCD_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCD_API_KEY")),
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]  # Replace with your inference API key
    }
)
import json
listings = client.collections.get("JobListings")

In [None]:
listings = client.collections.get("JobListings")

In [None]:
x_wev = listings.query.fetch_object_by_id(min_x_vector,include_vector=True)

### From here on out we are just testing some prompt to use on the site

In [None]:
task = """Analyze the following three job listings:

You will be past them in this prompt

Based on the provided job listings, please generate a detailed explanation of how these job listings relate to each other in terms of their responsibilities, requirements, and domain. Identify any common themes, skills, or qualifications that are shared among the job listings.

Additionally, extract and list the specific technologies and methods mentioned in each job listing. Provide a clear and concise summary of the key technologies and methods used in these job roles.

Please format your response as follows:

Relationship Explanation:
[Provide a detailed explanation of how the job listings relate to each other]

Technologies and Methods:
Job Listing 1:
- [Technology/Method 1]
- [Technology/Method 2]
- ...

Job Listing 2:
- [Technology/Method 1]
- [Technology/Method 2]
- ...

Job Listing 3:
- [Technology/Method 1]
- [Technology/Method 2]
- ...

Summary:
[Provide a concise summary of the key technologies and methods used across the job listings]"""

In [None]:
prompt = f"""Analyze the following job listing:

The job listing is measured on three axes: [Axis 1], [Axis 2], and [Axis 3]. The cosine distances of the job listing on these axes are as follows:
- {x_text}: {x_dist}
- {y_text}: {y_dist}
- {z_text}: {z_dist}]

Based on the provided job listing and its cosine distances on the three axes, please generate a summary that describes how well the job listing aligns with each axis. Provide insights into the relevance and significance of the job listing in relation to the axes.
Mkae sure you differentiate between the axes and provide a detailed explanation of the alignment or misalignment of the job listing with each axis. Even if it aligns strongly on two or three axis, make sure to pick out a point of difference between them.
Consider the following questions in your summary:
1. How closely does the job listing match the characteristics and requirements of each axis?
2. What are the key aspects of the job listing that contribute to its alignment or misalignment with each axis?
3. Are there any notable strengths or weaknesses of the job listing in relation to the axes?
4. How do the cosine distances reflect the overall fit of the job listing to the axes?

Please format your response as follows:

Summary:
[Provide a detailed summary of how the job listing aligns with each axis, considering the cosine distances and the key aspects of the job listing]

Axis Alignment:
- [Axis 1]: [Description of alignment with Axis 1]
- [Axis 2]: [Description of alignment with Axis 2]
- [Axis 3]: [Description of alignment with Axis 3]

Strengths and Weaknesses:
- Strengths: [List the notable strengths of the job listing in relation to the axes]
- Weaknesses: [List the notable weaknesses of the job listing in relation to the axes]

Overall Fit:
[Provide an assessment of the overall fit of the job listing to the axes based on the cosine distances and the analysis provided, make sure you get the order correct. Smaller cosine distance means closer alignment]"""

In [None]:
x_text

'Machine Learning Engineer'

In [None]:
output = listings.generate.near_vector(near_vector=x_wev.vector['default'], limit=3, grouped_task=task)

In [None]:
output


GenerativeReturn(objects=[GenerativeObject(uuid=_WeaviateUUIDInt('2d098399-481b-4303-ada5-8b523345098b'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'title': 'AI Software Engineer', 'description': "Artisan is revolutionizing the future of work by building AI digital workers, called Artisans, capable of fulfilling roles traditionally done by humans. Our product roadmap includes a plethora of Artisans, from a marketing manager to a sales rep, all complimented with feature-rich SaaS platforms that Artisans are able to interact with in the same way the user can. We're on a mission to create the world's most capable and human-like digital workers, and world-class SaaS for them to operate within. If you thrive in fast-paced settings and want to work on cutting-edge AI we’d love to hear from you! \nKey Responsibilities:Fine-tuning LLMs for our Artisans, he

In [None]:
single_query = listings.generate.near_vector(near_vector=x_wev.vector['default'], limit=1, grouped_task=prompt)

In [None]:
pprint.pprint(single_query.generated)

('Summary:\n'
 'The job listing for an AI Software Engineer at Artisan AI aligns well with '
 'Axis 1 and Axis 2, which are likely related to technical skills and '
 'experience in machine learning and NLP. However, it shows a misalignment '
 'with Axis 3, which could be associated with financial or accounting skills. '
 'The cosine distances indicate a closer fit with the Machine Learning '
 'Engineer and Data Scientist roles compared to an Accountant.\n'
 '\n'
 'Axis Alignment:\n'
 '- Axis 1: The job listing aligns well with Axis 1, which could be related to '
 'technical skills and experience in machine learning and NLP. The '
 'responsibilities and qualifications mentioned in the listing emphasize the '
 'need for expertise in these areas, which likely contribute to the closer '
 'alignment.\n'
 '- Axis 2: Similarly, the job listing aligns well with Axis 2, which could be '
 'associated with advanced technical skills and experience in AI software '
 'engineering. The emphasis on de