# Job Applicants Recommender:

The use of artificial intelligence (AI) has been gaining more and more notoriety in various sectors of society. In addition to AI, talent recruitment is also becoming increasingly digital, where processes tend to become more objective and reach millions of people through social networks.

The recruiting sector becomes hostage to a pile of candidates, where many of them are perfect for the position and others less so. According to RecruiterBox, professionals in the human resources area spend, on average, around 24 hours just to filter talent, costing the company around R$1000.00 if they pay an extra salary just for one position.

Accompanying the rapid growth of digitization and the transformation of HR areas, tools emerge that are capable of streamlining processes via artificial intelligence, with the objective of analyzing candidate for candidate and guaranteeing the perfect match between the available position.

With that in mind, I created a demo web app to apply knowledge of artificial intelligence in this area and make an intelligent selection of developers in selective processes using Natural Language Processing (NLP) techniques, reducing the selection time to seconds and the possibility of cost reductions of up to 50%.

In [12]:
!pip install unidecode

Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/9e/25/723487ca2a52ebcee88a34d7d1f5a4b80b793f179ee0f62d5371938dfa01/Unidecode-1.2.0-py2.py3-none-any.whl (241kB)
[K     |█▍                              | 10kB 15.5MB/s eta 0:00:01[K     |██▊                             | 20kB 7.3MB/s eta 0:00:01[K     |████                            | 30kB 6.7MB/s eta 0:00:01[K     |█████▍                          | 40kB 3.6MB/s eta 0:00:01[K     |██████▉                         | 51kB 3.7MB/s eta 0:00:01[K     |████████▏                       | 61kB 4.3MB/s eta 0:00:01[K     |█████████▌                      | 71kB 4.6MB/s eta 0:00:01[K     |██████████▉                     | 81kB 4.9MB/s eta 0:00:01[K     |████████████▏                   | 92kB 5.1MB/s eta 0:00:01[K     |█████████████▋                  | 102kB 4.2MB/s eta 0:00:01[K     |███████████████                 | 112kB 4.2MB/s eta 0:00:01[K     |████████████████▎               | 122kB 4.2MB/

## Imports

In [35]:
import random
import numpy as np
import pandas as pd
import unidecode

import nltk
nltk.download('rslp')
nltk.download('stopwords')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Package rslp is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
def limpa_dados(txt):
    """
    This function remove accents, dots, comma and other special characters.
    """
    txt = unidecode.unidecode(str(txt))
    try:
        txt = txt.replace(",", " ")
        txt = txt.replace(".", " ")
        txt = txt.replace('*', " ")
    except:
        pass

    txt = txt.lower()
    
    return txt

# Applicants Section

## Reading Applicants Data

In this section I am reading the applicants dataset (synthetic) in order to use this in the NLP model.

In [3]:
# Read the csv file
df = pd.read_csv('aplicantes_bckp.csv',encoding='ISO-8859-1')

In [4]:
# Shows the 5th first elements
df.head()

Unnamed: 0,AplicanteID,Nome,Sobrenome,Email,Cidade,Estado,LadoAplicacao,TipoTrabalho,Tecnologias,MelhorTecnologia,Ingles,ExperienciaTrabalho,DescricaoExperiencia
0,1,Joao,Sales,teste@demo.com.br,Garopaba,SC,backend,Físico Remoto,PowerBI Tableau R Python SQL Docker,Python,Avancado,Pleno,Estou procurando uma vaga de business intelige...
1,2,Pedro,Rodrigues,teste@demo.com.br,Sao Bernardo do Campo,SP,backend,Físico Remoto,DeepLearning Python Sckitlearn C C# JavaScript,DeepLearning,Avancado,Senior,Gostaria de uma vaga no time de Cientista de D...
2,3,Lucas,Pereira,teste@demo.com.br,Sao Paulo,SP,frontend,Remoto,ReactNative Bootstrap Java JavaScript jQuery,React,Basico,Pleno,Gostaria de uma vaga no campo de programacao f...
3,4,Marcelo,Brum,teste@demo.com.br,Canoas,RS,backend,Fisico,PowerBI R Python SQL Excel GoogleSheets,R,Avancado,Junior,Procuro vaga de junior de cientista de dados a...
4,5,Roger,Machado,teste@demo.com.br,Curitiba,PR,frontend,Fisico Remoto,Python Scitkitlearn Prophet,Prophet,Basico,Senior,Gostaria de uma vaga no time de Cientista de D...


## Preprocessing 

This section is the most important section of this notebook. Here, I will transform the structured data (generated by the formulary) in a non-structered data, which will be necessary to use NLP techniques. Then, I pick up all the columns and transform it in a text to clean it and apply stemming technique before the TF-IDFT transformation. 

In [5]:
# Transform my structured data in a text
df['all_concat'] = df['Cidade'] + " " + df['Estado'] + " "+ df['LadoAplicacao']+ " " + df['TipoTrabalho'] + " " + "conhece" + " " + df['Tecnologias'] + " " + df['MelhorTecnologia'] + " " +df['Ingles']+ " " + df['ExperienciaTrabalho'] + " " + df['DescricaoExperiencia'] 

In [6]:
# Shows one example
df['all_concat'][0]

'Garopaba SC backend Físico Remoto conhece PowerBI Tableau R Python SQL Docker Python Avancado Pleno Estou procurando uma vaga de business inteligence, aplicado ao setor de delivery de comida'

In [9]:
# Create a new DF using just the text and the ApplicantID
new_df = df[['AplicanteID', 'all_concat']]

In [14]:
# Apply the clean_data function to the whole datset
new_df['all_concat'] = new_df['all_concat'].apply(lambda x: limpa_dados(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [15]:
# Shows one example
new_df['all_concat'][0]

'garopaba sc backend fisico remoto conhece powerbi tableau r python sql docker python avancado pleno estou procurando uma vaga de business inteligence  aplicado ao setor de delivery de comida'

**Stemming:**

 Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
 
 Example: Children -> Child

In [18]:

stemmer = nltk.stem.RSLPStemmer()
stop = nltk.corpus.stopwords.words('portuguese')

# Remove stop words
only_text = new_df['all_concat'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
only_text.head()

0    garopaba sc backend fisico remoto conhece powe...
1    sao bernardo campo sp backend fisico remoto co...
2    sao paulo sp frontend remoto conhece reactnati...
3    canoas rs backend fisico conhece powerbi r pyt...
4    curitiba pr frontend fisico remoto conhece pyt...
Name: all_concat, dtype: object

In [19]:
# Word tokenization
only_text = only_text.apply(lambda x : x.split(" "))

In [20]:
only_text

0     [garopaba, sc, backend, fisico, remoto, conhec...
1     [sao, bernardo, campo, sp, backend, fisico, re...
2     [sao, paulo, sp, frontend, remoto, conhece, re...
3     [canoas, rs, backend, fisico, conhece, powerbi...
4     [curitiba, pr, frontend, fisico, remoto, conhe...
5     [cabo, frio, rj, frontend, fisico, conhece, re...
6     [buzios, rj, frontend, fisico, remoto, conhece...
7     [rio, janeiro, rj, backend, remoto, conhece, s...
8     [sao, bernardo, campo, sp, backend, fisico, co...
9     [sao, bernardo, campo, sp, frontend, remoto, c...
10    [aracaju, backend, fisico, conhece, python, py...
11    [aracaju, backend, fisico, conhece, php, ruby,...
12    [sao, paulo, sp, frontend, fisico, conhece, ht...
13    [florianopolis, sc, backend, fisico, conhece, ...
14    [florianopolis, sc, gestao, remoto, conhece, r...
15    [porto, alegre, rs, backend, fisico, conhece, ...
16    [sao, paulo, sp, backend, fisico, conhece, dee...
17    [sao, bernardo, campo, sp, backend, fisico

In [21]:
# Apply stemming to the dataset
only_text = only_text.apply(lambda x : [stemmer.stem(y) for y in x])
print(only_text.head())

0    [garopab, sc, backend, fisic, remot, conhec, p...
1    [sao, bernard, camp, sp, backend, fisic, remot...
2    [sao, paul, sp, frontend, remot, conhec, react...
3    [cano, rs, backend, fisic, conhec, powerb, r, ...
4    [curitib, pr, frontend, fisic, remot, conhec, ...
Name: all_concat, dtype: object


In [22]:
# Join the words to re-create the phrases after stemming
only_text = only_text.apply(lambda x : " ".join(x))
print(only_text.head())

0    garopab sc backend fisic remot conhec powerb t...
1    sao bernard camp sp backend fisic remot conhec...
2    sao paul sp frontend remot conhec reactnativ b...
3    cano rs backend fisic conhec powerb r python s...
4    curitib pr frontend fisic remot conhec python ...
Name: all_concat, dtype: object


In [23]:
final_df = pd.DataFrame()

In [24]:
final_df['text'] = only_text
final_df['AplicanteID'] = new_df['AplicanteID']

In [25]:
final_df.head()

Unnamed: 0,text,AplicanteID
0,garopab sc backend fisic remot conhec powerb t...,1
1,sao bernard camp sp backend fisic remot conhec...,2
2,sao paul sp frontend remot conhec reactnativ b...,3
3,cano rs backend fisic conhec powerb r python s...,4
4,curitib pr frontend fisic remot conhec python ...,5


**TF-IDF:**

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

In [27]:
# Starts the vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Apply to the data
tfidf_app = tfidf_vectorizer.fit_transform((final_df['text'])) #fitting and transforming the vector
tfidf_app

<30x131 sparse matrix of type '<class 'numpy.float64'>'
	with 553 stored elements in Compressed Sparse Row format>

# Enterprise Section

All the techniques aforementioned are used in the enterprise section.

## Reading Enterprise Data

In [28]:
empresas_df = pd.read_csv('empresas_bckp.csv', encoding='ISO-8859-1')

In [29]:
empresas_df

Unnamed: 0,VagaID,NomeEmpresa,Setor,Cidade,Estado,NomeVaga,LadoAplicacao,TipoTrabalho,TecnologiasNecessarias,Ingles,InglesObrigatorio,Experiencia,DescricaoVaga
0,1,Globo,Tecnologia da informacao e servicos; Telecomun...,Rio de Janeiro,RJ,Cientista de Dados,backend,Fisico,PowerBI GoogleAnalytics GoogleDataStudio SQL R...,Intermediario,Nao,Pleno,Prover ao time Globoplay dados e informacao qu...
1,2,Infosys,Tecnologia da informacao e servicos,Sao Paulo,SP,Desenvolver React,frontend,Remoto,AWS React JavaScript ReactJS,Avancado,Sim,Pleno Senior,A Infosys está procurando um React Developer p...


## Preprocessing

In [30]:
empresas_df['all_concat'] = empresas_df['Cidade'] + " " + empresas_df['Estado'] + " "  + empresas_df['Setor'] + " " + empresas_df['NomeVaga'] + " "  + empresas_df['LadoAplicacao'] + " " +  empresas_df['TipoTrabalho'] + " " + empresas_df['TecnologiasNecessarias'] + " " + empresas_df['Ingles'] + " " +empresas_df['Experiencia'] + " " + empresas_df['TipoTrabalho']

In [31]:
empresas_df['all_concat'][0]

'Rio de Janeiro RJ Tecnologia da informacao e servicos; Telecomunicacoes Cientista de Dados backend Fisico PowerBI GoogleAnalytics GoogleDataStudio SQL R Python Java Machine Learning  Intermediario Pleno Fisico'

In [32]:
empresas_df['all_concat'] = empresas_df['all_concat'].apply(lambda x: limpa_dados(x))

In [33]:
VagaID = 2
index = np.where(empresas_df['VagaID'] == VagaID)[0][0]
vaga = empresas_df.iloc[[index]]
vaga

Unnamed: 0,VagaID,NomeEmpresa,Setor,Cidade,Estado,NomeVaga,LadoAplicacao,TipoTrabalho,TecnologiasNecessarias,Ingles,InglesObrigatorio,Experiencia,DescricaoVaga,all_concat
1,2,Infosys,Tecnologia da informacao e servicos,Sao Paulo,SP,Desenvolver React,frontend,Remoto,AWS React JavaScript ReactJS,Avancado,Sim,Pleno Senior,A Infosys está procurando um React Developer p...,sao paulo sp tecnologia da informacao e servic...


In [34]:
#fitting and transforming the vector
vaga_tfidf = tfidf_vectorizer.transform(vaga['all_concat']) 
vaga_tfidf

<1x131 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

# Cosine Similarity 

This section will use the weighted array of applicants and compare it with the weighted array of enterprises, in order to find the best matches. 

**Cosine Similarity:** 

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

In [36]:
# Calculate the similarity of two arrays
output = [cosine_similarity(vaga_tfidf, candidate) for candidate in tfidf_app]

In [37]:
output

[array([[0.]]),
 array([[0.14495325]]),
 array([[0.36666596]]),
 array([[0.]]),
 array([[0.05164615]]),
 array([[0.26835093]]),
 array([[0.05591006]]),
 array([[0.]]),
 array([[0.14866374]]),
 array([[0.19802732]]),
 array([[0.]]),
 array([[0.05015067]]),
 array([[0.27667929]]),
 array([[0.]]),
 array([[0.15470045]]),
 array([[0.]]),
 array([[0.18608373]]),
 array([[0.18189931]]),
 array([[0.0446273]]),
 array([[0.04808267]]),
 array([[0.10468186]]),
 array([[0.]]),
 array([[0.]]),
 array([[0.0632705]]),
 array([[0.]]),
 array([[0.]]),
 array([[0.23959655]]),
 array([[0.]]),
 array([[0.03694011]]),
 array([[0.43994609]])]

In [38]:
# Append the distance array
distances = []
for i in range(0,len(final_df)):
  distances.append(output[i][0][0])

In [39]:
# Put it to the dataframe
final_df['distances'] = distances

# Top 5 Matches

By sorting the lowest distances, we can find the best applicants for the job description.

In [40]:
# Sort the dataframe by distance
top_5 = final_df.sort_values('distances', ascending = False)['AplicanteID'][:5].values

In [41]:
# Best recommendations
recomendacao_aplicantes = pd.DataFrame(columns = ['Top 5','VagaID', 'AplicanteID Recomendado'])

for i in range(0,len(top_5)):
  recomendacao_aplicantes.at[i, 'Top 5'] =  i+1
  recomendacao_aplicantes.at[i, 'VagaID'] = VagaID 
  recomendacao_aplicantes.at[i, 'AplicanteID Recomendado'] = top_5[i]

In [42]:
recomendacao_aplicantes

Unnamed: 0,Top 5,VagaID,AplicanteID Recomendado
0,1,2,30
1,2,2,3
2,3,2,13
3,4,2,6
4,5,2,27


In [44]:
# Shows the job
vaga

Unnamed: 0,VagaID,NomeEmpresa,Setor,Cidade,Estado,NomeVaga,LadoAplicacao,TipoTrabalho,TecnologiasNecessarias,Ingles,InglesObrigatorio,Experiencia,DescricaoVaga,all_concat
1,2,Infosys,Tecnologia da informacao e servicos,Sao Paulo,SP,Desenvolver React,frontend,Remoto,AWS React JavaScript ReactJS,Avancado,Sim,Pleno Senior,A Infosys está procurando um React Developer p...,sao paulo sp tecnologia da informacao e servic...


In [43]:
# Shows the best applicants
df.set_index('AplicanteID').loc[top_5]

Unnamed: 0_level_0,Nome,Sobrenome,Email,Cidade,Estado,LadoAplicacao,TipoTrabalho,Tecnologias,MelhorTecnologia,Ingles,ExperienciaTrabalho,DescricaoExperiencia,all_concat
AplicanteID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
30,André,Duarte,teste@demo.com.br,Osasco,SP,frontend,Remoto,React ReactJS AWS SQL,ReactJS,Avancado,Senior,Procuro uma vaga de programador front-end para...,Osasco SP frontend Remoto conhece React ReactJ...
3,Lucas,Pereira,teste@demo.com.br,Sao Paulo,SP,frontend,Remoto,ReactNative Bootstrap Java JavaScript jQuery,React,Basico,Pleno,Gostaria de uma vaga no campo de programacao f...,Sao Paulo SP frontend Remoto conhece ReactNati...
13,Gabriela,Rumi,teste@demo.com.br,Sao Paulo,SP,frontend,Fisico,HTML CCS JavaScript Bootstrap,HTML,Basico,Senior,Procuro uma vaga de programador front-end,Sao Paulo SP frontend Fisico conhece HTML CCS ...
6,Antônia,Garibaldi,teste@demo.com.br,Cabo Frio,RJ,frontend,Fisico,React jQuery,jQuery,Intermediario,Junior,Procuro uma vaga de programador front-end,Cabo Frio RJ frontend Fisico conhece React jQu...
27,Pedro,Araújo,teste@demo.com.br,Novo Hamburgo,RS,fullstack,Remoto,HTML CCS JavaScript Python C AWS React ReactNa...,JavaScript,Intermediario,Senior,Gostaria de uma vaga no campo de programacao f...,Novo Hamburgo RS fullstack Remoto conhece HTML...


# Some further improvements:

* Sometimes filling out a form can be exhausting, so using a computer vision technique called EasyOCR is suggested. This technique can read a summary in seconds and then can be applied to the NLP model. 