# **Potential Talents**

# Introduction

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

# Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

# Goal(s):

Predict how fit the candidate is based on their available information (variable fit)

# Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

# Current Challenges:

We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.

How can we filter out candidates which in the first place should not be in this list?

Can we determine a cut-off point that would work for other roles without losing high potential candidates?

Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?

In [1]:
# Montar Google Drive
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import pandas as pd
# Ruta correcta al archivo CSV en Google Drive
path_dbset = '/content/gdrive/MyDrive/Proyectos APZIVA/Proyecto 3/potential_talents.csv'

# Leer el archivo CSV usando pd.read_csv()
db = pd.read_csv(path_dbset)

# **1. VISUALIZATION AND MISSING VALUE TREATMENT**

In [3]:
# Mostrar las primeras 10 filas
print(db.head(7))


   id                                          job_title  \
0   1  2019 C.T. Bauer College of Business Graduate (...   
1   2  Native English Teacher at EPIK (English Progra...   
2   3              Aspiring Human Resources Professional   
3   4             People Development Coordinator at Ryan   
4   5    Advisory Board Member at Celal Bayar University   
5   6                Aspiring Human Resources Specialist   
6   7  Student at Humber College and Aspiring Human R...   

                              location connection  fit  
0                       Houston, Texas         85  NaN  
1                               Kanada      500+   NaN  
2  Raleigh-Durham, North Carolina Area         44  NaN  
3                        Denton, Texas      500+   NaN  
4                       İzmir, Türkiye      500+   NaN  
5           Greater New York City Area          1  NaN  
6                               Kanada         61  NaN  


In [4]:
# verify information of the dataset
db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [5]:
# Rows and columns information. resultSet(rows, columns)
db.shape

(104, 5)

In [6]:
import numpy as np

# Replace missing value representations with NaN
db.replace(["?", "N/A", "NA", "null", ""], np.nan, inplace=True)

# Check for missing values (NaN) in absolute count
print("\nMissing values (NaN) per column (absolute count):")
print(db.isnull().sum())

# Check for missing values (NaN) as a percentage
print("\nMissing values (NaN) per column (percentage):")
print((db.isnull().sum() / len(db)) * 100)


Missing values (NaN) per column (absolute count):
id              0
job_title       0
location        0
connection      0
fit           104
dtype: int64

Missing values (NaN) per column (percentage):
id              0.0
job_title       0.0
location        0.0
connection      0.0
fit           100.0
dtype: float64


# **WORD EMBEDDINGS**

**PASO 1**: Se eligen keywords  para generar un vector promediando los embeddings de estas palabras clave: "human", "resources", "professional", "specialist", "management".

In [17]:
import numpy as np

# Lista de palabras clave para el perfil ideal
#keywords = ["human", "resources", "professional", "specialist", "management"]

##**1. WORD2VEC**

##PREPROCESAMIENTO Y TOKENIZACIÓN

In [8]:
import gensim
from gensim.models import Word2Vec
import spacy

# Cargar el modelo de spaCy
nlp = spacy.load('en_core_web_sm')

# Suponiendo que ya tienes la base de datos db con los job titles
job_titles = db['job_title'].tolist()

# Función para preprocesar los títulos de trabajo
def preprocess_text(title):
    doc = nlp(title.lower())  # Convertir a minúsculas y procesar con spaCy
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]  # Filtrar palabras clave
    return tokens

# Aplicar preprocesamiento
tokenized_titles = [preprocess_text(title) for title in job_titles]

# Ver los títulos tokenizados
print(tokenized_titles)

[['bauer', 'college', 'business', 'graduate', 'magna', 'cum', 'laude', 'aspire', 'human', 'resource', 'professional'], ['native', 'english', 'teacher', 'epik', 'english', 'program', 'korea'], ['aspire', 'human', 'resource', 'professional'], ['people', 'development', 'coordinator', 'ryan'], ['advisory', 'board', 'member', 'celal', 'bayar', 'university'], ['aspire', 'human', 'resource', 'specialist'], ['student', 'humber', 'college', 'aspire', 'human', 'resource', 'generalist'], ['hr', 'senior', 'specialist'], ['student', 'humber', 'college', 'aspire', 'human', 'resource', 'generalist'], ['seek', 'human', 'resource', 'hris', 'generalist', 'position'], ['student', 'chapman', 'university'], ['svp', 'chro', 'marketing', 'communication', 'csr', 'officer', 'engie', 'houston', 'woodland', 'energy', 'gphr', 'sphr'], ['human', 'resource', 'coordinator', 'intercontinental', 'buckhead', 'atlanta'], ['bauer', 'college', 'business', 'graduate', 'magna', 'cum', 'laude', 'aspire', 'human', 'resource',

##IMPLEMENTACION DE WORD2VEC

In [13]:
# Entrenar modelo Word2Vec
word2vec_model = Word2Vec(sentences=tokenized_titles, vector_size=50, window=5, min_count=2, workers=4)
# vector_size=100: Cada palabra se representará con un vector de 100 dimensiones.
# window=5: Considera hasta 5 palabras antes y después de cada palabra en la oración.
# min_count=2: Ignora palabras que aparecen menos de 2 veces.
# workers=4: Usa 4 núcleos de CPU para acelerar el entrenamiento.

# Guardar el modelo entrenado
word2vec_model.save("word2vec_job_titles.model")

In [16]:
#Probamos el modelo, en este caso viendo las palabras más cercanas a developer
# Cargar el modelo entrenado
word2vec_model = Word2Vec.load("word2vec_job_titles.model")

# Palabras más similares a "professional"
print(word2vec_model.wv.most_similar("professional", topn=5))

# Palabras más similares a "human"
print(word2vec_model.wv.most_similar("director", topn=5))

print(word2vec_model.wv["professional"])  # Muestra el vector de 10 dimensiones

[('internship', 0.2954030930995941), ('director', 0.26510682702064514), ('development', 0.22504682838916779), ('member', 0.21023446321487427), ('epik', 0.20366783440113068)]
[('english', 0.37316837906837463), ('human', 0.2753932476043701), ('professional', 0.26510685682296753), ('svp', 0.24958498775959015), ('graduate', 0.2377435266971588)]
[ 0.01558827 -0.01889127 -0.00050856  0.0071434  -0.00200001  0.0166946
  0.01824462  0.01325948 -0.00166025  0.01537751 -0.01703005  0.0062634
 -0.0089423  -0.00991498  0.00717901  0.01104002  0.0160055  -0.01130281
  0.01465252  0.01280677 -0.00726891 -0.01731933  0.01096957  0.01294866
 -0.00138649 -0.01336945 -0.01435918 -0.00465874  0.01038828 -0.00732733
 -0.01857967  0.00751339  0.00973422 -0.0129106   0.00205488 -0.00411569
  0.00041473 -0.01959814  0.00529207 -0.0096197   0.00244596 -0.00322307
  0.00457117 -0.01568082 -0.00520606  0.00512555  0.01048487 -0.00514207
 -0.0189737   0.00926049]


## **Comparar títulos con la consulta ("Human Resources")**
## Compare titles with query ("Human Resources")

In [27]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Función para obtener el vector promedio de un conjunto de palabras
def get_vector(tokens, word2vec_model):
    vectors = [word2vec_model.wv[word] for word in tokens if word in word2vec_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(word2vec_model.vector_size)

# Generar vectores para cada título de trabajo
job_vectors = np.array([get_vector(tokens, word2vec_model) for tokens in tokenized_titles])

# Tokenizar la consulta
query_tokens = preprocess_text("Human Resources")

# Obtener el vector de la consulta
query_vector = get_vector(query_tokens, word2vec_model)

# Calcular similitud con cada título
similarities = cosine_similarity([query_vector], job_vectors)[0]

# Agregar la similitud al DataFrame original
db["fit_score"] = similarities

# Ordenar candidatos según el puntaje de similitud
db_sorted = db.sort_values(by="fit_score", ascending=False)

# Mostrar los mejores candidatos
print(db_sorted[["job_title", "fit_score"]].head(30))

                                             job_title  fit_score
88                     Director Human Resources  at EY   0.851377
72   Aspiring Human Resources Manager, seeking inte...   0.843279
99   Aspiring Human Resources Manager | Graduating ...   0.827566
67             Human Resources Specialist at Luxottica   0.812408
77              Human Resources Generalist at Schwan's   0.810463
100              Human Resources Generalist at Loparex   0.810463
70     Human Resources Generalist at ScottMadden, Inc.   0.810463
73                        Human Resources Professional   0.802178
78   Liberal Arts Major. Aspiring Human Resources A...   0.774547
76   Human Resources|\nConflict Management|\nPolici...   0.768837
80   Senior Human Resources Business Partner at Hei...   0.748280
87                    Human Resources Management Major   0.745463
96               Aspiring Human Resources Professional   0.736790
45               Aspiring Human Resources Professional   0.736790
75   Aspir