# Potential Talent






## Background

#### As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

#### The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

#### We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

#### Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

## Goals
#### - Predict how fit the candidate is based on their available information (variable fit)
#### - Rank candidates based on a fitness score.
#### - Re-rank candidates when a candidate is starred.

## Setup

In [1]:
!pip install numpy
!pip install scipy seaborn
!pip install torch torchvision torchaudio
!pip install tensorflow
!pip install torchview

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import logging
import random
import requests
import sys
import regex as re
import plotly.express as px
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix, make_scorer, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction import text
from sklearn.feature_selection import RFE
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import tensorflow as tf
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchview import draw_graph
import torch.nn.functional as F
from transformers import pipeline
warnings.filterwarnings('ignore', category=UserWarning)

print(torch.__version__)
tf.__version__

2.6.0+cu124


'2.18.0'

### Data extraction

In [3]:
df = pd.read_csv('talents.csv')
df


Unnamed: 0,id,title,sentence_bert_cossim
0,1,innovative and driven professional seeking a r...,1.000000
1,431,aspiring data science professional focused on ...,0.769162
2,544,data analyst data scientist business analyst d...,0.768222
3,833,data analyst turning complex data into actiona...,0.745245
4,199,ms in information systems northeastern univers...,0.727268
...,...,...,...
1260,648,research specialist university of rochester di...,0.079923
1261,730,medical biller at brick pediatric group,0.072848
1262,990,ingeniero elctrico,0.067254
1263,296,company owner at armstrong cleans carpets,0.056890


In [4]:
# remove previous evaluation results
df = df.drop('sentence_bert_cossim', axis=1)
df

Unnamed: 0,id,title
0,1,innovative and driven professional seeking a r...
1,431,aspiring data science professional focused on ...
2,544,data analyst data scientist business analyst d...
3,833,data analyst turning complex data into actiona...
4,199,ms in information systems northeastern univers...
...,...,...
1260,648,research specialist university of rochester di...
1261,730,medical biller at brick pediatric group
1262,990,ingeniero elctrico
1263,296,company owner at armstrong cleans carpets


# Rank candidates based on job title

In [8]:
# use a prompt to ask for a job title and use sentence bert to find best matches to df['title']

job_title = input("Enter the job title: ")

model = SentenceTransformer('all-mpnet-base-v2')
job_title_embedding = model.encode(job_title)
df['title_embedding'] = df['title'].apply(lambda x: model.encode(x))

df['sentence_bert_cossim'] = df['title_embedding'].apply(lambda x: util.cos_sim(job_title_embedding, x).item())

# Drop the embedding column if you don't need it anymore
df = df.drop('title_embedding', axis=1)

# Sort the DataFrame by cosine similarity in descending order to get best matches
df_ranked = df.sort_values(by='sentence_bert_cossim', ascending=False)

print(df_ranked[['id','title', 'sentence_bert_cossim']])

Enter the job title: data scientist
        id                                              title  \
75    1193                                     data scientist   
77     201                                     data scientist   
70    1138                                     data scientist   
71    1065                                     data scientist   
72     928                                     data scientist   
...    ...                                                ...   
1252  1064  aiml engineer freedom mortgage document classi...   
1262   990                                 ingeniero elctrico   
1258  1111         manager investment risk at cpp investments   
1254   680  full-stack developer react spring boot firebas...   
1264   551  python arcpy arcgis pro esri products geoai do...   

      sentence_bert_cossim  
75                1.000000  
77                1.000000  
70                1.000000  
71                1.000000  
72                1.000000  
...      

# Rank Candidates based on starred candidates

In [11]:
# compare to up to 5 ids selected by prompt

starred_ids_input = input("Enter up to 5 starred candidate IDs separated by commas: ")
starred_ids = [int(id.strip()) for id in starred_ids_input.split(',') if id.strip()]

if len(starred_ids) > 5:
    print("Warning: Only considering the first 5 IDs.")
    starred_ids = starred_ids[:5]

# Identify the rows in the dataframe corresponding to the starred IDs
starred_candidates_df = df_ranked[df_ranked['id'].isin(starred_ids)]

if not starred_candidates_df.empty:
    # Calculate the average embedding of the starred candidates' titles
    avg_starred_embedding = np.mean(starred_candidates_df['title'].apply(lambda x: model.encode(x)).tolist(), axis=0)

    # Recalculate cosine similarity using the average starred embedding
    df_ranked['reranked_cossim'] = df_ranked['title'].apply(lambda x: util.cos_sim(avg_starred_embedding, model.encode(x)).item())

    # Sort the DataFrame by the new reranked cosine similarity
    df_reranked = df_ranked.sort_values(by='reranked_cossim', ascending=False)

    print("\nReranked results based on starred candidates:")
    print(df_reranked[['id','title', 'reranked_cossim']])
else:
    print("\nNo starred candidates found with the provided IDs. Displaying original ranking.")
    print(df_ranked[['id','title', 'sentence_bert_cossim']])

Enter up to 5 starred candidate IDs separated by commas: 1,431

Reranked results based on starred candidates:
       id                                              title  reranked_cossim
0       1  innovative and driven professional seeking a r...         0.940522
1     431  aspiring data science professional focused on ...         0.940522
8     963  passionate data scientist seeking exciting opp...         0.767226
9     487  research assistant penn state seeking opportun...         0.766777
7     426  master of science in analytics at georgia inst...         0.762350
...   ...                                                ...              ...
1250  108                                   itil 4 comptia a         0.073270
1260  648  research specialist university of rochester di...         0.070922
1264  551  python arcpy arcgis pro esri products geoai do...         0.058634
1262  990                                 ingeniero elctrico         0.049575
1263  296          company owner