## Introductory steps & Data preprocessing

### Importing necessary libraries

In [105]:
import weaviate
from weaviate.classes.config import Configure, Property, DataType, VectorDistances

import os
import pandas as pd
import ast

### Loading the dataset

Loading dataset to a pandas dataframe. The dataset used for the workflow is the [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings) from Kaggle.

In [106]:
filename = "postings.csv"
df = pd.read_csv(filename)
df

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,expiry,closed_time,formatted_experience_level,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,1.715990e+12,,,Requirements: \n\nWe are seeking a College or ...,1.713398e+12,,0,FULL_TIME,USD,BASE_SALARY
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,1.715450e+12,,,,1.712858e+12,,0,FULL_TIME,USD,BASE_SALARY
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,1.715870e+12,,,We are currently accepting resumes for FOH - A...,1.713278e+12,,0,FULL_TIME,USD,BASE_SALARY
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,1.715488e+12,,,This position requires a baseline understandin...,1.712896e+12,,0,FULL_TIME,USD,BASE_SALARY
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,1.716044e+12,,,,1.713452e+12,,0,FULL_TIME,USD,BASE_SALARY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123844,3906267117,Lozano Smith,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...,195000.0,YEARLY,"Walnut Creek, CA",56120.0,1.0,,...,1.716163e+12,,Mid-Senior level,,1.713571e+12,,0,FULL_TIME,USD,BASE_SALARY
123845,3906267126,Pinterest,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...,,,United States,1124131.0,3.0,,...,1.716164e+12,,Mid-Senior level,,1.713572e+12,www.pinterestcareers.com,0,FULL_TIME,,
123846,3906267131,EPS Learning,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...,,,"Spokane, WA",90552133.0,3.0,,...,1.716164e+12,,Mid-Senior level,,1.713572e+12,epsoperations.bamboohr.com,0,FULL_TIME,,
123847,3906267195,Trelleborg Applied Technologies,Business Development Manager,The Business Development Manager is a 'hunter'...,,,"Texas, United States",2793699.0,4.0,,...,1.716165e+12,,,,1.713573e+12,,0,FULL_TIME,,


### Preprocessing the dataset

In our dataset, we merge the columns company_name, title, skills_desc and description in a new column named merged_col. 
Before every company name in the merged_col we add the prefix "Company Name:". Similarly, before the titles we add the prefix "Job Title:" and before the skills_desc and description we add the "Job Description:" prefix.

In [107]:
df['company_name'] = df['company_name'].fillna('')
df['title'] = df['title'].fillna('')
df['skills_desc'] = df['skills_desc'].fillna('')
df['description'] = df['description'].fillna('')
df['job_posting_url'] = df['job_posting_url'].fillna('')

df["merged_col"] = "Company Name: " + df["company_name"] + "\nJob Title: " + df["title"] + "\nJob Description:\n" + df["skills_desc"] + "\n" + df["description"]

df1 = df[["company_name", "title", "description", "skills_desc", "job_posting_url", "merged_col"]]
df1

Unnamed: 0,company_name,title,description,skills_desc,job_posting_url,merged_col
0,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,Requirements: \n\nWe are seeking a College or ...,https://www.linkedin.com/jobs/view/921716/?trk...,Company Name: Corcoran Sawyer Smith\nJob Title...
1,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",,https://www.linkedin.com/jobs/view/1829192/?tr...,Company Name: \nJob Title: Mental Health Thera...
2,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,We are currently accepting resumes for FOH - A...,https://www.linkedin.com/jobs/view/10998357/?t...,Company Name: The National Exemplar \nJob Titl...
3,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,This position requires a baseline understandin...,https://www.linkedin.com/jobs/view/23221523/?t...,"Company Name: Abrams Fensterman, LLP\nJob Titl..."
4,,Service Technician,Looking for HVAC service tech with experience ...,,https://www.linkedin.com/jobs/view/35982263/?t...,Company Name: \nJob Title: Service Technician...
...,...,...,...,...,...,...
123844,Lozano Smith,Title IX/Investigations Attorney,Our Walnut Creek office is currently seeking a...,,https://www.linkedin.com/jobs/view/3906267117/...,Company Name: Lozano Smith\nJob Title: Title I...
123845,Pinterest,"Staff Software Engineer, ML Serving Platform",About Pinterest:\n\nMillions of people across ...,,https://www.linkedin.com/jobs/view/3906267126/...,Company Name: Pinterest\nJob Title: Staff Soft...
123846,EPS Learning,"Account Executive, Oregon/Washington",Company Overview\n\nEPS Learning is a leading ...,,https://www.linkedin.com/jobs/view/3906267131/...,Company Name: EPS Learning\nJob Title: Account...
123847,Trelleborg Applied Technologies,Business Development Manager,The Business Development Manager is a 'hunter'...,,https://www.linkedin.com/jobs/view/3906267195/...,Company Name: Trelleborg Applied Technologies\...


We same the new version of our dataframe to a new csv with the name "postings_new.csv".

In [4]:
df1.to_csv("postings_new.csv", index=False)

## Collection creation and data import

### Connect to Weaviate Cloud with an API

Before running the code below, you need to add OPENAI_API_KEY, WCD_URL and WCD_API_KEY as an environmental variable.

In [None]:
# If running the code in a virtual environment, you need to uncomment the code below and add your OPENAI_API_KEY. Else, you can set it from terminal.
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# os.environ["WCD_URL"] = "YOUR_WCD_URL"
# os.environ["WCD_API_KEY"] = "YOUR_WCD_API_KEY"


In [3]:
client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCD_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCD_API_KEY")),
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    }
)

### Collection creation

Creating the collection with name "LinkedIn_Job_Postings". It has 6 properties, but we will only vectorize the "merged_col" property using the "text-embedding-3-small" vectorizer from OpenAI.

In [9]:
client.collections.create(
    "LinkedIn_Job_Postings",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(model="text-embedding-3-small"),
    vector_index_config=Configure.VectorIndex.hnsw(distance_metric=VectorDistances.COSINE),
    properties=[
        Property(name="company_name", data_type=DataType.TEXT,skip_vectorization=True),
        Property(name="title", data_type=DataType.TEXT,skip_vectorization=True),
        Property(name="description", data_type=DataType.TEXT, skip_vectorization=True),
        Property(name="skills_desc", data_type=DataType.TEXT, skip_vectorization=True),
        Property(name="merged_col", data_type=DataType.TEXT, skip_vectorization=False),
        Property(name="job_posting_url", data_type=DataType.TEXT, skip_vectorization=True),
    ],
    
)

<weaviate.collections.collection.Collection at 0x7c35f781e590>

### Import data

Let's populate our database with data. We use the data from the "postings_new.csv" that we created before and add them in batches.

In [10]:
collection = client.collections.get("LinkedIn_Job_Postings")

with client.batch.fixed_size(batch_size=200) as batch:
    with pd.read_csv(
        "postings_new.csv",
        usecols=["company_name", "title", "description", "skills_desc", "job_posting_url", "merged_col"],
        chunksize=100,
    ) as csv_iterator:
        for chunk in csv_iterator:
            for index, row in chunk.iterrows():
                batch.add_object(
                    collection="LinkedIn_Job_Postings",
                        properties = {
                            "company_name": row["company_name"],
                            "title": row["title"],
                            "description": row["description"],
                            "skills_desc": row["skills_desc"],
                            "job_posting_url": row["job_posting_url"],
                            "merged_col": row["merged_col"]
                         }
                )

## Finding the job openings that are related to machine learning or AI

Let's get the collection and search for the jobs realted to ML or AI.

In [41]:
collection = client.collections.get("LinkedIn_Job_Postings")

We create a dataframe where we will collect the unique job offers based on all of our different methods of search.

In [100]:
df_ml = pd.DataFrame(columns=('uuid', 'company_name', 'title', 'job_posting_url'))

These are the terms that we are going to search for: "Machine Learning AI", "Machine Learning", "Data Science", "Artificial Intelligence". We will skip searching for abbreviations as they may return irrelevant jobs.

In [101]:
queries = ["Machine Learning AI", "Machine Learning", "Data Science", "Artificial Intelligence"]

### Vector search

In [102]:
for q in queries:
    response = collection.query.near_text(query=q)
    for o in response.objects:
        new_row = {"uuid": o.uuid, "company_name": o.properties["company_name"], "title": o.properties["title"], "job_posting_url": o.properties["job_posting_url"]}
        if o.uuid not in df_ml["uuid"].values:
            # Create a DataFrame from the new row
            new_row_df = pd.DataFrame([new_row])
            
            # Concatenate the new row to the existing DataFrame
            df_ml = pd.concat([df_ml, new_row_df], ignore_index=True)
df_ml[["company_name", "title", "job_posting_url"]]


Unnamed: 0,company_name,title,job_posting_url
0,hackajob,Machine Learning Engineer,https://www.linkedin.com/jobs/view/3905881936/...
1,hackajob,Artificial Intelligence Engineer,https://www.linkedin.com/jobs/view/3905888074/...
2,hackajob,Data Scientist,https://www.linkedin.com/jobs/view/3905875688/...
3,hackajob,Data Scientist,https://www.linkedin.com/jobs/view/3905875652/...
4,McCulloh Consulting,Machine Learning Engineer,https://www.linkedin.com/jobs/view/3903865758/...
...,...,...,...
263,Husch Blackwell,Legal Artificial Intelligence Implementation A...,https://www.linkedin.com/jobs/view/3904988939/...
264,Deloitte,AI Solution Architect,https://www.linkedin.com/jobs/view/3884830087/...
265,Hyperion Technologies LLC,AI Engineer/Architect,https://www.linkedin.com/jobs/view/3903445676/...
266,Photon,Lead AI Engineer,https://www.linkedin.com/jobs/view/3886883337/...


So based on our vector search we have 268 job postings related to ML and AI.

### Keyword search

Let's query the terms "Machine Learning" and "AI" using the BM25 (Keyword) search and return the company title, job title and the url of the job offer. Weaviate will look for objects that contain our search terms in the property "merged_col" and will rank the results based on how many times the search terms appear in the "merged_col" property of the objects.

In [103]:
for q in queries:
    response = collection.query.bm25(
    query=q,
    query_properties=["merged_col"]
)
    for o in response.objects:
        new_row = {"uuid": o.uuid, "company_name": o.properties["company_name"], "title": o.properties["title"], "job_posting_url": o.properties["job_posting_url"]}
        if o.uuid not in df_ml["uuid"].values:
            # Create a DataFrame from the new row
            new_row_df = pd.DataFrame([new_row])
            
            # Concatenate the new row to the existing DataFrame
            df_ml = pd.concat([df_ml, new_row_df], ignore_index=True)
df_ml[["company_name", "title", "job_posting_url"]]

Unnamed: 0,company_name,title,job_posting_url
0,hackajob,Machine Learning Engineer,https://www.linkedin.com/jobs/view/3905881936/...
1,hackajob,Artificial Intelligence Engineer,https://www.linkedin.com/jobs/view/3905888074/...
2,hackajob,Data Scientist,https://www.linkedin.com/jobs/view/3905875688/...
3,hackajob,Data Scientist,https://www.linkedin.com/jobs/view/3905875652/...
4,McCulloh Consulting,Machine Learning Engineer,https://www.linkedin.com/jobs/view/3903865758/...
...,...,...,...
491,Microsoft,Senior Technical Program Manager,https://www.linkedin.com/jobs/view/3901355763/...
492,Greystone,Digital Automation Architect,https://www.linkedin.com/jobs/view/3904382623/...
493,Cargill,"Manager, Global HR Systems & Process Owner",https://www.linkedin.com/jobs/view/3904993645/...
494,Polygraf,AI & ML Engineer – Audio Specialization,https://www.linkedin.com/jobs/view/3903808878/...


Keyword search added 228 extra jobs postings related to ML and AI. So, now we have 496. Let's save them to a csv.

In [104]:
df_ml.to_csv('ml_jobs.csv', index=False)

### Closing the client

After ending with our experiments we can close the client

In [None]:
client.close() 