**VanHackaton challenge: How can we match you to your dream job? (Leonardo Bentes)**
===



# I - Setup

## Libraries

In [0]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from scipy.sparse import hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import tensorflow
from tensorflow import keras
from tensorflow.keras.models import load_model

# text treatment
import nltk
import re
from string import punctuation

## Shared functions

In [0]:
# Text preprocessing for vectorization
def columnTreatment(dataframe, column, returnList=True):
  aux = dataframe[column].str.replace('\.$', '')
  if returnList:
    aux = aux.str.split(pat=',')
  return aux

In [0]:
def seriesVectorize(series):
  text = []
  for i in range(series.shape[0]):
    text = text + series[i]
  
  text = list(dict.fromkeys(text))
  text = pd.Series(text)

  vectorizer = CountVectorizer(ngram_range=(1,3))
  vectorizer.fit(text)

  return vectorizer

In [0]:
# Function to strip HTML tags from job description.
# **NOTE**: After several tries, including description to the prediciton seemed 
# to give worse results than without it. Despite this function is no longer 
# used in this notebook, I left it here as documentation and backup in case
# I decide to try it again in the future
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
    tags_removed = TAG_RE.sub('', text);
    tags_removed = tags_removed.replace('&nbsp;', ' ')
    tags_removed = tags_removed.replace('\n', ' ')
    tags_removed = tags_removed.replace('\r', ' ')
    return tags_removed

In [0]:
# Defining experience level as:
# Entry - up to 3 years (0)
# Mid - between 4 and 7 years (1)
# Senior - 8+ years (2)
# Source: https://talent.works/2018/03/28/the-science-of-the-job-search-part-iii-61-of-entry-level-jobs-require-3-years-of-experience/

def experienceLevel(yearsOfExperience):
  if yearsOfExperience <= 3:
    return 0
  elif yearsOfExperience <= 7:
    return 1
  return 2

## Acquire data

In [0]:
import urllib.request
import shutil
import os

def download(url, filename):
  if not os.path.isfile(filename):
    print('downloading %s' % filename)
    with urllib.request.urlopen(url) as response, open(filename, 'wb') as out_file:
      shutil.copyfileobj(response, out_file)

In [0]:
download('https://drive.google.com/uc?export=download&id=1zAg8WaqxxOL9r6n8Mss5rD7ic_ZPz-JN', 'AvailableCandidates.csv')
download('https://drive.google.com/uc?export=download&id=1Iqqs9IDUVvhkXQw7Nu3otaJ83YL2fS3u', 'HiredCandidates.csv')
download('https://drive.google.com/uc?export=download&id=1Ud6VYR26rm7o6RzAaT8gwmESeKdUS1qN', 'HiredJobDetails.csv')
download('https://drive.google.com/uc?export=download&id=1h7E5uxOYvslS74Pw1_OLtBlrZ5_J95nT', 'JobsToPredict.csv')

downloading AvailableCandidates.csv
downloading HiredCandidates.csv
downloading HiredJobDetails.csv
downloading JobsToPredict.csv


In [0]:
available_candidates_df = pd.read_csv('AvailableCandidates.csv')
hired_candidates_df = pd.read_csv('HiredCandidates.csv')
hired_job_details_df = pd.read_csv('HiredJobDetails.csv')
jobs_to_predict_df = pd.read_csv('JobsToPredict.csv')

print('available candidates shape =', available_candidates_df.shape)
print('hired candidates shape =', hired_candidates_df.shape)
print('hired jobs details shape =', hired_job_details_df.shape)
print('job to predict shape =', jobs_to_predict_df.shape)

available candidates shape = (10000, 5)
hired candidates shape = (202, 6)
hired jobs details shape = (160, 4)
job to predict shape = (10, 4)


# II - Understand data

## Available Candidates (AC)

In [0]:
available_candidates_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
UserId               10000 non-null int64
Skills               10000 non-null object
UsersPosition        9997 non-null object
YearsOfExperience    10000 non-null int64
EnglishLevel         10000 non-null int64
dtypes: int64(3), object(2)
memory usage: 390.8+ KB


In [0]:
available_candidates_df.describe()

Unnamed: 0,UserId,YearsOfExperience,EnglishLevel
count,10000.0,10000.0,10000.0
mean,658855.2005,8.553,1.4753
std,19099.316565,5.176338,1.583237
min,533334.0,0.0,0.0
25%,646461.75,5.0,0.0
50%,662517.0,8.0,0.0
75%,672958.25,11.0,3.0
max,682045.0,50.0,4.0


In [0]:
available_candidates_df.head()

Unnamed: 0,UserId,Skills,UsersPosition,YearsOfExperience,EnglishLevel
0,533334,"game maker 1.4, lua, fabric8, h264 encoding, t...",Software Engineer,0,2
1,533339,"ruby on rails, sql server, javascript, html5, ...",Web Developer Fullstack,9,4
2,533343,"ruby on rails, scrum, javascript, wordpress, h...",Full stack Ruby on Rails developer,10,0
3,533344,"javascript, spring, jsp, hibernate, jpa, servl...",Results-driven Software Engineer,4,0
4,533348,"2d, ux, ilustrator, art direction, logo design...",UI/UX Designer & Digital Visual Design,10,0


In [0]:
available_candidates_df['Skills']

0       game maker 1.4, lua, fabric8, h264 encoding, t...
1       ruby on rails, sql server, javascript, html5, ...
2       ruby on rails, scrum, javascript, wordpress, h...
3       javascript, spring, jsp, hibernate, jpa, servl...
4       2d, ux, ilustrator, art direction, logo design...
                              ...                        
9995    python, c, c++, c for microcontroller, artific...
9996    sql, javascript, laravel, html, bootstrap, css...
9997    sql, ruby on rails, javascript, python, html5,...
9998    testing, test planning and test script, jira, ...
9999    sql, docker, spring boot, kubernetes, linux, p...
Name: Skills, Length: 10000, dtype: object

In [0]:
available_candidates_df['UsersPosition']

0                                  Software Engineer
1                           Web Developer Fullstack 
2                 Full stack Ruby on Rails developer
3                   Results-driven Software Engineer
4             UI/UX Designer & Digital Visual Design
                            ...                     
9995    Computer Engineer and Open Source enthusiast
9996                             Front End Developer
9997       back end developer  - desktop application
9998                             Senior Test Manager
9999                             Hardcore programmer
Name: UsersPosition, Length: 10000, dtype: object

In [0]:
# Check if there are users with more than one position
users_position_sr = available_candidates_df['UsersPosition'].dropna()
users_position_sr[users_position_sr.str.contains(',')]

7               Entrepreneur, Customer Success, Developer
10       DevOps, System Admin, Network Admin, IT Security
17        Support Analyst, Front-End Developer and DevOps
30                  Senior System Analyst, CSM® and CSPO®
32                            Social Media, Content & SEO
                              ...                        
9889         Talented Desktop, Web & Mobile App developer
9895                     Full Stack Engineer, Independent
9924      Data analysis,  Technical support,  Consultant,
9963    Analytics Consultant, Data Architect, Data Mod...
9991                     Physics Teacher, Java Programmer
Name: UsersPosition, Length: 594, dtype: object

## Hired Candidates (HC)

In [0]:
hired_candidates_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 6 columns):
UserId               202 non-null int64
HiredForJobId        202 non-null int64
Skills               202 non-null object
UsersPosition        202 non-null object
EnglishLevel         202 non-null int64
YearsOfExperience    202 non-null int64
dtypes: int64(4), object(2)
memory usage: 9.6+ KB


In [0]:
hired_candidates_df.describe()

Unnamed: 0,UserId,HiredForJobId,EnglishLevel,YearsOfExperience
count,202.0,202.0,202.0,202.0
mean,686442.212871,1610.316832,3.10396,8.693069
std,55340.539454,451.801346,0.735783,4.112214
min,533441.0,902.0,0.0,0.0
25%,656681.0,1359.25,3.0,6.0
50%,678221.0,1538.0,3.0,8.0
75%,685845.5,1666.5,3.0,11.0
max,955025.0,3002.0,4.0,31.0


In [0]:
hired_candidates_df.head()

Unnamed: 0,UserId,HiredForJobId,Skills,UsersPosition,EnglishLevel,YearsOfExperience
0,685478,902,".net, sql server, javascript, typescript, angu...",Full Stack Developer,0,12
1,726160,946,"java, bamboo, relational database, soap, ejb, ...",Fullstack Developer,3,4
2,656464,976,"docker, oracle db, postgresql, java, git, spri...",Developer,3,10
3,673774,976,"angular, architectural patterns, cloud computi...",Java Architect and Senior Java Developer,3,13
4,680888,976,"postgresql, architecture, junit, java, sql ser...",Software Developer,3,12


In [0]:
hired_candidates_df['Skills']

0      .net, sql server, javascript, typescript, angu...
1      java, bamboo, relational database, soap, ejb, ...
2      docker, oracle db, postgresql, java, git, spri...
3      angular, architectural patterns, cloud computi...
4      postgresql, architecture, junit, java, sql ser...
                             ...                        
197    c#, ASP.Net MVC, .net, azure, sql server, reac...
198    react native, docker, spring mvc, spring boot,...
199    python, java, quality assurance, test automati...
200    sidekiq, javascript, rspec, sql, postgresql, r...
201    git, node.js, mysql, javascript, angular, reac...
Name: Skills, Length: 202, dtype: object

In [0]:
users_position_sr = hired_candidates_df['UsersPosition']
users_position_sr[users_position_sr.str.contains(',')]

13     System Analyst/Full Stack Developer (Java, Ang...
29                Software Engineer, Fullstack Developer
30                        Android Engineer, iOS Engineer
80                            Software Developer, Andela
87     Backend focused web developer, with a Canadian...
98     Java  Developer ( J2EE ,Spring, SOA,MicroService)
117    DevOps, Infrastructure & Site Reliability Engi...
121    Expert at design, develop and maintain high-sc...
166    IT Service, Product & Software Engineering Man...
189    DevOps, Infrastructure & Site Reliability Engi...
197         .Net, Azure (IOT, Storage), React specialist
Name: UsersPosition, dtype: object

In [0]:
hired_candidates_df['EnglishLevel'].value_counts()

3    147
4     45
0      7
2      3
Name: EnglishLevel, dtype: int64

In [0]:
hired_candidates_df['YearsOfExperience'].value_counts()

9     25
8     25
5     23
6     20
4     18
12    14
10    14
7     14
11     9
13     9
15     8
3      6
14     5
19     3
16     3
1      1
31     1
22     1
17     1
18     1
0      1
Name: YearsOfExperience, dtype: int64

In [0]:
# Checking the outlier
hired_candidates_df[hired_candidates_df['YearsOfExperience'] == 0]

Unnamed: 0,UserId,HiredForJobId,Skills,UsersPosition,EnglishLevel,YearsOfExperience
65,699944,1419,"c, c for microcontroller, c++, python, network...",Back-End and System Software Engineer,3,0


In [0]:
pd.set_option('display.max_colwidth', -1)
hired_job_details_df[hired_job_details_df['JobId'] == 1419]

Unnamed: 0,JobId,Responsibilities,POSITION,Skills
157,1419,"<p>Great company in Montreal working with cutting edge technologies! They are looking to hire a Linux Administrator to help them build the largest edge computing platform in the world!</p>\n<p><strong>Required Skills:</strong></p>\n<ul>\n <li>Experience in building and administering Linux operating systems (Red Hat/CentOS, Debian/Ubuntu)</li>\n <li>Strong knowledge of REST architecture</li>\n <li>Familiar with storage in the cloud and auto-scaling</li>\n <li>Experience with configuration managers (Ansible, Chef)</li>\n <li>Experience with large-scale, high availability server architectures</li>\n <li>Experience with load balancers (55, HAProxy)</li>\n</ul>\n<p><br></p>",Technical Operations Specialist,"kubernetes, linux, python, ubuntu server"


In [0]:
pd.set_option('display.max_colwidth', 50)

## Hired Job Details (HJ)

In [0]:
hired_job_details_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 4 columns):
JobId               160 non-null int64
Responsibilities    159 non-null object
POSITION            160 non-null object
Skills              160 non-null object
dtypes: int64(1), object(3)
memory usage: 5.1+ KB


In [0]:
# keeping it... for a while
hired_job_details_df[hired_job_details_df['Responsibilities'].isna()]

Unnamed: 0,JobId,Responsibilities,POSITION,Skills
88,902,,Senior .NET Developer,".net, agile, ajax, angular, ASP.Net MVC, c#, h..."


In [0]:
hired_job_details_df.describe()

Unnamed: 0,JobId
count,160.0
mean,1641.46875
std,493.561552
min,902.0
25%,1361.75
50%,1540.0
75%,1696.5
max,3002.0


In [0]:
hired_job_details_df.head()

Unnamed: 0,JobId,Responsibilities,POSITION,Skills
0,1165,<p>AWS experience is a MUST HAVE!</p>\r\n<p>A ...,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst..."
1,1330,<ul>\n <li><u><em><strong>FLUENT SPANISH IS R...,DevOps Engineer,"docker, jenkins, kubernetes, nexus"
2,1352,<p><br></p>\n<p>SEE THE FULL JOB DESCRIPTION G...,DevOps Engineer,"aws, devops"
3,1585,<p>Company in Montreal is looking for a DevOps...,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube..."
4,1623,<p><strong>What you will do:</strong><br>\nYou...,DevOps Engineer,"ansible, aws, devops, docker, gitlab, go lang,..."


In [0]:
hired_job_details_df['Skills']

0      aws, azure, devops, docker, kubernetes, openst...
1                     docker, jenkins, kubernetes, nexus
2                                            aws, devops
3      aws, docker, ec2, elasticsearch, jenkins, kube...
4      ansible, aws, devops, docker, gitlab, go lang,...
                             ...                        
155    algorithms, data structures, java, javascript,...
156    account management, communication, experience ...
157             kubernetes, linux, python, ubuntu server
158    agile methodologies, leadership, product manag...
159    angular, java, javascript, node.js, python, re...
Name: Skills, Length: 160, dtype: object

In [0]:
users_position_sr = hired_job_details_df['POSITION']
users_position_sr[users_position_sr.str.contains(',')]

114    Senior Ruby on Rails Developer, Full Stack 
Name: POSITION, dtype: object

## Hired Job Details vs Hired Candidates

In [0]:
hired_jobs_vs_candidates_df = hired_job_details_df.merge(hired_candidates_df, left_on='JobId', right_on='HiredForJobId', suffixes=('_J', '_C'))
hired_jobs_vs_candidates_df.shape

(202, 10)

In [0]:
hired_jobs_vs_candidates_df.columns

Index(['JobId', 'Responsibilities', 'POSITION', 'Skills_J', 'UserId',
       'HiredForJobId', 'Skills_C', 'UsersPosition', 'EnglishLevel',
       'YearsOfExperience'],
      dtype='object')

In [0]:
hired_jobs_vs_candidates_df.rename(columns={'POSITION': 'Position_J', 'UsersPosition': 'Position_C'}, inplace=True)

In [0]:
hired_jobs_vs_candidates_df.head()


Unnamed: 0,JobId,Responsibilities,Position_J,Skills_J,UserId,HiredForJobId,Skills_C,Position_C,EnglishLevel,YearsOfExperience
0,1165,<p>AWS experience is a MUST HAVE!</p>\r\n<p>A ...,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",680450,1165,"tcpdump, http, itil, git flow, docker, linux, ...",Devops Engineer Senior,0,13
1,1330,<ul>\n <li><u><em><strong>FLUENT SPANISH IS R...,DevOps Engineer,"docker, jenkins, kubernetes, nexus",646309,1330,"operating systems, shell script, mysql, automa...",DevOps Engineer,3,10
2,1352,<p><br></p>\n<p>SEE THE FULL JOB DESCRIPTION G...,DevOps Engineer,"aws, devops",684038,1352,"kubernetes, openshift, git, elk stack, ci/cd a...",Senior DevOps Engineer,3,7
3,1585,<p>Company in Montreal is looking for a DevOps...,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube...",715544,1585,"devops, System Administration, docker, python,...","DevOps, Infrastructure & Site Reliability Engi...",3,9
4,1623,<p><strong>What you will do:</strong><br>\nYou...,DevOps Engineer,"ansible, aws, devops, docker, gitlab, go lang,...",679027,1623,"nginx, jenkins, cloud computing, kubernetes, j...",DevOps engineer,3,9


In [0]:
# Check for jobs with more than on hire
jobs_more_than_one_hire = hired_jobs_vs_candidates_df[hired_jobs_vs_candidates_df.duplicated(subset=['JobId'], keep=False)]
jobs_more_than_one_hire.shape

(65, 10)

In [0]:
# It seems there's nothing special about the duplicates
jobs_more_than_one_hire

Unnamed: 0,JobId,Responsibilities,Position_J,Skills_J,UserId,HiredForJobId,Skills_C,Position_C,EnglishLevel,YearsOfExperience
9,976,<p>This is&nbsp;senior Java Developer role.</p...,senior Java Developer,"agile, ansible, design patterns, hadoop, java,...",656464,976,"docker, oracle db, postgresql, java, git, spri...",Developer,3,10
10,976,<p>This is&nbsp;senior Java Developer role.</p...,senior Java Developer,"agile, ansible, design patterns, hadoop, java,...",673774,976,"angular, architectural patterns, cloud computi...",Java Architect and Senior Java Developer,3,13
11,976,<p>This is&nbsp;senior Java Developer role.</p...,senior Java Developer,"agile, ansible, design patterns, hadoop, java,...",680888,976,"postgresql, architecture, junit, java, sql ser...",Software Developer,3,12
20,1444,<p>Toronto based company looks for a Back End ...,Back End Software Developer,"any programmer language, mysql, nosql, oriente...",678956,1444,"javascript, angular, mongodb, aws, git, react....",Software Engineer,4,4
21,1444,<p>Toronto based company looks for a Back End ...,Back End Software Developer,"any programmer language, mysql, nosql, oriente...",768863,1444,"docker, java, spring boot, mongodb, python, mi...",Software Engineer,3,3
...,...,...,...,...,...,...,...,...,...,...
196,1419,<p>Great company in Montreal working with cutt...,Technical Operations Specialist,"kubernetes, linux, python, ubuntu server",714908,1419,"jenkins, shell script, middleware, infrastruct...",DevOps/Infrastructure specialist,4,4
197,1522,<p>-4+ years in software development and produ...,Technical Product Manager,"agile methodologies, leadership, product manag...",679775,1522,"sql, business analysis, tdd, project managemen...",A product is business meeting tech to create v...,3,8
198,1522,<p>-4+ years in software development and produ...,Technical Product Manager,"agile methodologies, leadership, product manag...",699111,1522,"github, docker, ansible, linux, machine learni...",Product Manager,4,9
199,1522,<p>-4+ years in software development and produ...,Technical Product Manager,"agile methodologies, leadership, product manag...",713188,1522,"marketing strategy, pricing analysis, product ...",Product Manager - Internet of Things,0,11


In [0]:
# Check for inconsistencies
hired_jobs_vs_candidates_df[hired_jobs_vs_candidates_df['JobId'] != hired_jobs_vs_candidates_df['HiredForJobId']].shape

(0, 10)

In [0]:
# Check for candidates hired more tha once with more than on hire
candidate_hired_more_than_once = hired_jobs_vs_candidates_df[hired_jobs_vs_candidates_df.duplicated(subset=['UserId'], keep=False)]
candidate_hired_more_than_once.shape

(6, 10)

In [0]:
candidate_hired_more_than_once.sort_values(by='UserId')

Unnamed: 0,JobId,Responsibilities,Position_J,Skills_J,UserId,HiredForJobId,Skills_C,Position_C,EnglishLevel,YearsOfExperience
100,1272,<p>— 5+ years of experience building and maint...,QA Automation Engineer,"api, c#, java, python, qa, selenium, sql, test...",673900,1272,"rest api, technical documentations, api, pytho...",Software QA Engineering,3,8
159,1335,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Sensei Labs' Candidates,".net, c#, front end, javascript, mobile develo...",673900,1335,"agile methodologies, sql query, bdd - behavior...",Software QA Engineering,3,8
39,1124,<p>About the hiring partner: Cryptocurrency co...,Front End Developer,"css, front end, react.js, redux, typescript",680879,1124,"mongoose, test automation, sql, redis, graphql...",React Application Architect,3,11
151,1593,"<ul>\n <li>Knowledge of modern JavaScript, HT...",Senior Software Developer,"java, javascript, node.js, python, ruby",680879,1593,"firebase, docker, bluemix, eslint, webpack, ba...",React Application Architect,3,11
3,1585,<p>Company in Montreal is looking for a DevOps...,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube...",715544,1585,"devops, System Administration, docker, python,...","DevOps, Infrastructure & Site Reliability Engi...",3,9
38,2871,<p>- 4+ years of experience working in product...,DevOps/Openstack Engineer,"bash script, ci/cd automation, devops, opensta...",715544,2871,"django, azure, server administration, cloudfor...","DevOps, Infrastructure & Site Reliability Engi...",3,9


## Jobs to Predict (JP)

In [0]:
jobs_to_predict_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
JobId               10 non-null int64
Responsibilities    10 non-null object
POSITION            10 non-null object
Skills              10 non-null object
dtypes: int64(1), object(3)
memory usage: 448.0+ bytes


In [0]:
jobs_to_predict_df.describe()

Unnamed: 0,JobId
count,10.0
mean,2998.8
std,30.312447
min,2946.0
25%,2976.25
50%,3012.0
75%,3019.5
max,3030.0


In [0]:
jobs_to_predict_df

Unnamed: 0,JobId,Responsibilities,POSITION,Skills
0,3030,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Backend Developer,"api, aws, google cloud, java, mongodb, node.js..."
1,3018,<p>A global event travel tech company with off...,Backend NodeJS Developer,"angular, http, javascript, node.js, react.js, ..."
2,3004,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Cloud DevOps Specialist,"aws, jboss, linux, perl, python"
3,2946,<p>&nbsp;<em>Good communication skills</em><br...,Game Developer,"c, c++, game development, unity"
4,2967,<p><strong>The perfect candidate:</strong></p>...,Mobile Developer (React Native),"gradle, javascript, jenkins, jest, react nativ..."
5,3020,<p>Awesome company in Toronto is hiring for se...,Senior Front End Engineer,"angular, css, html, javascript, react.js"
6,3009,<p><strong>Must-have-skills</strong></p>\n<p>-...,Senior Mobile App Developer,"dart, ionic framework, javascript, kotlin, nat..."
7,3022,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Software Developer,"angular, javascript, react.js"
8,3015,<p>Waterloo based startup is hiring a Sr Back ...,Sr Backend Developer (Python),"aws, continuos deployment, continuous integrat..."
9,2957,<p>Great startup located in the vibrant Berlin...,Sr Backend Engineer (Java),"clean code, ddd, distributed systems developme..."


In [0]:
jobs_to_predict_df['Skills']

0    api, aws, google cloud, java, mongodb, node.js...
1    angular, http, javascript, node.js, react.js, ...
2                      aws, jboss, linux, perl, python
3                      c, c++, game development, unity
4    gradle, javascript, jenkins, jest, react nativ...
5             angular, css, html, javascript, react.js
6    dart, ionic framework, javascript, kotlin, nat...
7                        angular, javascript, react.js
8    aws, continuos deployment, continuous integrat...
9    clean code, ddd, distributed systems developme...
Name: Skills, dtype: object

In [0]:
users_position_sr = jobs_to_predict_df['POSITION']
users_position_sr[users_position_sr.str.contains(',')]

Series([], Name: POSITION, dtype: object)

# III - Wrangle data

### Clean available candidates

In [0]:
available_candidates_df.dropna(inplace=True)
available_candidates_df.shape

(9997, 5)

### Skills  vectorizer

In [0]:
s0 = columnTreatment(available_candidates_df, 'Skills')
s1 = columnTreatment(hired_candidates_df, 'Skills')
s2 = columnTreatment(hired_job_details_df, 'Skills')

In [0]:
aux_series = pd.concat([s0, s1, s2])
aux_series.reset_index(inplace=True, drop=True)
aux_series.shape

(10359,)

In [0]:
skills_vectorizer = seriesVectorize(aux_series)
skills_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 3), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [0]:
  print(f'vocabulary: {len(skills_vectorizer.vocabulary_)} words')

vocabulary: 6021 words


In [0]:
#skills_vectorizer.vocabulary_

### Position vectorizer

In [0]:
s0 = columnTreatment(available_candidates_df, 'UsersPosition')
s1 = columnTreatment(hired_candidates_df, 'UsersPosition')
s2 = columnTreatment(hired_job_details_df, 'POSITION')

In [0]:
aux_series = pd.concat([s0, s1, s2])
aux_series.reset_index(inplace=True, drop=True)
aux_series.shape

(10359,)

In [0]:
aux_series

0                             [Software Engineer]
1                      [Web Developer Fullstack ]
2            [Full stack Ruby on Rails developer]
3              [Results-driven Software Engineer]
4        [UI/UX Designer & Digital Visual Design]
                           ...                   
10354          [Sr React Native Mobile Developer]
10355                 [Tech Recruiter at VanHack]
10356           [Technical Operations Specialist]
10357                 [Technical Product Manager]
10358        [Technical Team Lead (Any Language)]
Length: 10359, dtype: object

In [0]:
positions_vectorizer = seriesVectorize(aux_series)
positions_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 3), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [0]:
  print(f'vocabulary: {len(positions_vectorizer.vocabulary_)} words')

vocabulary: 17879 words


In [0]:
#positions_vectorizer.vocabulary_

###Pre-process predict dataset

In [0]:
predict_dataset_base = available_candidates_df.copy()
predict_dataset_base.reset_index(inplace=True, drop=True)
predict_dataset_base.rename(columns={'Skills': 'Skills_C', 'UsersPosition': 'Position_C'}, inplace=True)
experience_list = [experienceLevel(i) for i in predict_dataset_base['YearsOfExperience']]
predict_dataset_base['ExperienceLevel'] = experience_list

In [0]:
predict_dataset_base

Unnamed: 0,UserId,Skills_C,Position_C,YearsOfExperience,EnglishLevel,ExperienceLevel
0,533334,"game maker 1.4, lua, fabric8, h264 encoding, t...",Software Engineer,0,2,0
1,533339,"ruby on rails, sql server, javascript, html5, ...",Web Developer Fullstack,9,4,2
2,533343,"ruby on rails, scrum, javascript, wordpress, h...",Full stack Ruby on Rails developer,10,0,2
3,533344,"javascript, spring, jsp, hibernate, jpa, servl...",Results-driven Software Engineer,4,0,1
4,533348,"2d, ux, ilustrator, art direction, logo design...",UI/UX Designer & Digital Visual Design,10,0,2
...,...,...,...,...,...,...
9992,682023,"python, c, c++, c for microcontroller, artific...",Computer Engineer and Open Source enthusiast,1,3,0
9993,682026,"sql, javascript, laravel, html, bootstrap, css...",Front End Developer,11,2,2
9994,682027,"sql, ruby on rails, javascript, python, html5,...",back end developer - desktop application,8,0,2
9995,682038,"testing, test planning and test script, jira, ...",Senior Test Manager,17,3,2


# IV - Modeling and training

## Build train dataset

### Create train dataset

In [0]:
train_dataset = hired_jobs_vs_candidates_df[['Position_J', 'Skills_J', 'Position_C', 'Skills_C', 'EnglishLevel', 'YearsOfExperience']].copy()
train_dataset.shape

(202, 6)

In [0]:
train_dataset["Hired"] = 1

In [0]:
train_dataset

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,Hired
0,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",Devops Engineer Senior,"tcpdump, http, itil, git flow, docker, linux, ...",0,13,1
1,DevOps Engineer,"docker, jenkins, kubernetes, nexus",DevOps Engineer,"operating systems, shell script, mysql, automa...",3,10,1
2,DevOps Engineer,"aws, devops",Senior DevOps Engineer,"kubernetes, openshift, git, elk stack, ci/cd a...",3,7,1
3,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube...","DevOps, Infrastructure & Site Reliability Engi...","devops, System Administration, docker, python,...",3,9,1
4,DevOps Engineer,"ansible, aws, devops, docker, gitlab, go lang,...",DevOps engineer,"nginx, jenkins, cloud computing, kubernetes, j...",3,9,1
...,...,...,...,...,...,...,...
197,Technical Product Manager,"agile methodologies, leadership, product manag...",A product is business meeting tech to create v...,"sql, business analysis, tdd, project managemen...",3,8,1
198,Technical Product Manager,"agile methodologies, leadership, product manag...",Product Manager,"github, docker, ansible, linux, machine learni...",4,9,1
199,Technical Product Manager,"agile methodologies, leadership, product manag...",Product Manager - Internet of Things,"marketing strategy, pricing analysis, product ...",0,11,1
200,Technical Product Manager,"agile methodologies, leadership, product manag...",MBA + Product + Tech = 5+ years of hands-on pr...,"c#, product management, project management, io...",3,7,1


### Insert some non-hired candidates into train_set 

In order to train and predict using choosen algorithms, it's necesssary to include rows of non-hired canditates. I've tried some different approachs and,
so far, to include rows based on "average" experience level and english level seems to work best. However I'm not really happy with the results and want to try other approaches in the future.

TODO: to improve searching for position in available_candidates_df and to try other approaches

In [0]:
aux_list = []
for index, row in train_dataset.iterrows():
  position = row['Position_J'].strip(' ')

  aux_df = available_candidates_df[available_candidates_df['UsersPosition'] == position].copy()
  if(aux_df.shape[0] > 0):
    experience_list = [experienceLevel(i) for i in aux_df['YearsOfExperience']]
    aux_df['ExperienceLevel'] = experience_list

    # Select a non-hired candidate based on experience level
    aux1_df = aux_df[aux_df['ExperienceLevel'] == 1].copy()
    if(aux1_df.shape[0] > 0):
      aux1_df = aux1_df.iloc[0]
    else:
      aux1_df = aux_df.iloc[0]    
    aux_list.append([row['Position_J'], row['Skills_J'], aux1_df['UsersPosition'], aux1_df['Skills'], aux1_df['EnglishLevel'], aux1_df['YearsOfExperience'], 0])

    # Select a non-hired candidate based on english level
    aux1_df = aux_df[aux_df['EnglishLevel'] == 3].copy()
    if(aux1_df.shape[0] > 0):
      aux1_df = aux1_df.iloc[aux1_df.shape[0]-1]
    else:
      aux1_df = aux_df.iloc[aux_df.shape[0]-1]    
    aux_list.append([row['Position_J'], row['Skills_J'], aux1_df['UsersPosition'], aux1_df['Skills'], aux1_df['EnglishLevel'], aux1_df['YearsOfExperience'], 0])

len(aux_list)

238

In [0]:
aux_df = pd.DataFrame(aux_list, columns=['Position_J', 'Skills_J', 'Position_C', 'Skills_C', 'EnglishLevel','YearsOfExperience', 'Hired'])
aux_df

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,Hired
0,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",DevOps Engineer,"docker, zabbix, infrastructure, shell script, ...",4,5,0
1,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",DevOps Engineer,"uml, postgresql, digital, spring mvc, apache, ...",3,9,0
2,DevOps Engineer,"docker, jenkins, kubernetes, nexus",DevOps Engineer,"docker, zabbix, infrastructure, shell script, ...",4,5,0
3,DevOps Engineer,"docker, jenkins, kubernetes, nexus",DevOps Engineer,"uml, postgresql, digital, spring mvc, apache, ...",3,9,0
4,DevOps Engineer,"aws, devops",DevOps Engineer,"docker, zabbix, infrastructure, shell script, ...",4,5,0
...,...,...,...,...,...,...,...
233,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0
234,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0
235,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0
236,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0


In [0]:
train_dataset = train_dataset.append(aux_df)
train_dataset.reset_index(inplace=True, drop=True)
train_dataset

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,Hired
0,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",Devops Engineer Senior,"tcpdump, http, itil, git flow, docker, linux, ...",0,13,1
1,DevOps Engineer,"docker, jenkins, kubernetes, nexus",DevOps Engineer,"operating systems, shell script, mysql, automa...",3,10,1
2,DevOps Engineer,"aws, devops",Senior DevOps Engineer,"kubernetes, openshift, git, elk stack, ci/cd a...",3,7,1
3,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube...","DevOps, Infrastructure & Site Reliability Engi...","devops, System Administration, docker, python,...",3,9,1
4,DevOps Engineer,"ansible, aws, devops, docker, gitlab, go lang,...",DevOps engineer,"nginx, jenkins, cloud computing, kubernetes, j...",3,9,1
...,...,...,...,...,...,...,...
435,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0
436,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0
437,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0
438,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0


### Convert YearsOfExperience (quantitative) to ExperienceLevel (categorical)

In [0]:
experience_list = [experienceLevel(i) for i in train_dataset['YearsOfExperience']]
len(experience_list)

440

In [0]:
train_dataset['ExperienceLevel'] = experience_list
train_dataset

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,Hired,ExperienceLevel
0,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",Devops Engineer Senior,"tcpdump, http, itil, git flow, docker, linux, ...",0,13,1,2
1,DevOps Engineer,"docker, jenkins, kubernetes, nexus",DevOps Engineer,"operating systems, shell script, mysql, automa...",3,10,1,2
2,DevOps Engineer,"aws, devops",Senior DevOps Engineer,"kubernetes, openshift, git, elk stack, ci/cd a...",3,7,1,1
3,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube...","DevOps, Infrastructure & Site Reliability Engi...","devops, System Administration, docker, python,...",3,9,1,2
4,DevOps Engineer,"ansible, aws, devops, docker, gitlab, go lang,...",DevOps engineer,"nginx, jenkins, cloud computing, kubernetes, j...",3,9,1,2
...,...,...,...,...,...,...,...,...
435,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0,2
436,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0,2
437,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0,2
438,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,12,0,2


In [0]:
train_dataset.drop('YearsOfExperience', axis=1, inplace=True)
train_dataset = train_dataset[['Position_J', 'Skills_J', 'Position_C',	'Skills_C',	'EnglishLevel', 'ExperienceLevel', 'Hired']]
train_dataset

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,ExperienceLevel,Hired
0,DevOps Engineer,"aws, azure, devops, docker, kubernetes, openst...",Devops Engineer Senior,"tcpdump, http, itil, git flow, docker, linux, ...",0,2,1
1,DevOps Engineer,"docker, jenkins, kubernetes, nexus",DevOps Engineer,"operating systems, shell script, mysql, automa...",3,2,1
2,DevOps Engineer,"aws, devops",Senior DevOps Engineer,"kubernetes, openshift, git, elk stack, ci/cd a...",3,1,1
3,DevOps Engineer,"aws, docker, ec2, elasticsearch, jenkins, kube...","DevOps, Infrastructure & Site Reliability Engi...","devops, System Administration, docker, python,...",3,2,1
4,DevOps Engineer,"ansible, aws, devops, docker, gitlab, go lang,...",DevOps engineer,"nginx, jenkins, cloud computing, kubernetes, j...",3,2,1
...,...,...,...,...,...,...,...
435,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,2,0
436,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,2,0
437,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,2,0
438,Technical Product Manager,"agile methodologies, leadership, product manag...",Technical Product Manager,"sql, scrum, javascript, python, uml, c, c++, h...",0,2,0


### Vectorize textual features

In [0]:
aux_series = columnTreatment(train_dataset, 'Position_J', returnList=False)
position_JV =  positions_vectorizer.transform(aux_series)

aux_series = columnTreatment(train_dataset, 'Skills_J', returnList=False)
skills_JV =  skills_vectorizer.transform(aux_series)

aux_series = columnTreatment(train_dataset, 'Position_C', returnList=False)
position_CV =  positions_vectorizer.transform(aux_series)

aux_series = columnTreatment(train_dataset, 'Skills_C', returnList=False)
skills_CV =  skills_vectorizer.transform(aux_series)


## Training 

In [0]:
x = hstack((position_JV, skills_JV, position_CV, skills_CV, train_dataset[['EnglishLevel']].values, train_dataset[['ExperienceLevel']].values), format='csr')
y = train_dataset["Hired"]
SEED = 5

X_train, X_val, Y_train, Y_val = train_test_split(x, y, random_state = SEED, test_size = 0.25)
print("Train with %d rows, validate with %d rows" % (X_train.shape[0], X_val.shape[0]))


Train with 330 rows, validate with 110 rows


In [0]:
# Logistic Regression
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, Y_train)
acc_log = round(logreg.score(X_val, Y_val) * 100, 2)
acc_log

88.18

In [0]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_val, Y_val) * 100, 2)
acc_random_forest

87.27

## Training Keras (TensorFlow)

### Keras modeling

In [0]:
input_dim = X_train.shape[1]
print('input_dim: ', input_dim)
output_classes = 2
print('output_classes = ', output_classes)

input_dim:  47802
output_classes =  2


In [0]:
keras_model = keras.Sequential([
  keras.layers.Dense(32, input_dim=input_dim, activation='relu'),
  keras.layers.Dropout(0.2),
  keras.layers.Dense(16, activation='relu'),
  keras.layers.Dense(output_classes, activation='sigmoid')
  ])

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [0]:
keras_model.compile(optimizer='adam', 
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'])
keras_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                1529696   
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
Total params: 1,530,258
Trainable params: 1,530,258
Non-trainable params: 0
_________________________________________________________________


### Keras training and analysis

In [0]:
history = keras_model.fit(X_train, Y_train, epochs=1, 
                          validation_data=(X_val, Y_val))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 330 samples, validate on 110 samples


# V - Predict

## Build predict dataset

In [0]:
jobs_to_predict_df

Unnamed: 0,JobId,Responsibilities,POSITION,Skills
0,3030,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Backend Developer,"api, aws, google cloud, java, mongodb, node.js..."
1,3018,<p>A global event travel tech company with off...,Backend NodeJS Developer,"angular, http, javascript, node.js, react.js, ..."
2,3004,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Cloud DevOps Specialist,"aws, jboss, linux, perl, python"
3,2946,<p>&nbsp;<em>Good communication skills</em><br...,Game Developer,"c, c++, game development, unity"
4,2967,<p><strong>The perfect candidate:</strong></p>...,Mobile Developer (React Native),"gradle, javascript, jenkins, jest, react nativ..."
5,3020,<p>Awesome company in Toronto is hiring for se...,Senior Front End Engineer,"angular, css, html, javascript, react.js"
6,3009,<p><strong>Must-have-skills</strong></p>\n<p>-...,Senior Mobile App Developer,"dart, ionic framework, javascript, kotlin, nat..."
7,3022,<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&...,Software Developer,"angular, javascript, react.js"
8,3015,<p>Waterloo based startup is hiring a Sr Back ...,Sr Backend Developer (Python),"aws, continuos deployment, continuous integrat..."
9,2957,<p>Great startup located in the vibrant Berlin...,Sr Backend Engineer (Java),"clean code, ddd, distributed systems developme..."


***IMPORTANT!!! In order to make predictions for each job, manually set the desired index as parameter***

No need to run previous cells between predictions of different positions

TODO: Automate the process


In [0]:
# *** CHOOSE THE JOB TO BE PREDICT HERE ***
job_index = 0 
job = jobs_to_predict_df.iloc[job_index]
print('predicting for position', job['POSITION'])

predicting for position Backend Developer


In [0]:
  predict_dataset = predict_dataset_base.copy()
  predict_dataset['Position_J'] = job['POSITION']
  predict_dataset['Skills_J'] = job['Skills']
  predict_dataset = predict_dataset[['Position_J', 'Skills_J', 'Position_C', 'Skills_C', 'EnglishLevel', 'YearsOfExperience', 'ExperienceLevel', 'UserId']]

In [0]:
aux_series = columnTreatment(predict_dataset, 'Position_J', returnList=False)
position_JV =  positions_vectorizer.transform(aux_series)

aux_series = columnTreatment(predict_dataset, 'Skills_J', returnList=False)
skills_JV =  skills_vectorizer.transform(aux_series)

aux_series = columnTreatment(predict_dataset, 'Position_C', returnList=False)
position_CV =  positions_vectorizer.transform(aux_series)

aux_series = columnTreatment(predict_dataset, 'Skills_C', returnList=False)
skills_CV =  skills_vectorizer.transform(aux_series)



In [0]:
X_pred = hstack((position_JV, skills_JV, position_CV, skills_CV, predict_dataset[['EnglishLevel']].values, predict_dataset[['ExperienceLevel']].values), format='csr')

## Logistic Regression

In [0]:
logreg_results = logreg.predict_proba(X_pred)

In [0]:
logreg_results

array([[8.96709351e-01, 1.03290649e-01],
       [8.45531288e-02, 9.15446871e-01],
       [1.92787363e-03, 9.98072126e-01],
       ...,
       [9.99077905e-01, 9.22095175e-04],
       [3.12456685e-01, 6.87543315e-01],
       [6.62690924e-01, 3.37309076e-01]])

In [0]:
predict_logreg = pd.concat([predict_dataset, pd.DataFrame(logreg_results)], axis=1)
predict_logreg.rename(columns={1: 'Hired', 0: 'Not_Hired'}, inplace=True)
predict_logreg.sort_values(by='Hired', ascending=False).head(10)

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,ExperienceLevel,UserId,Not_Hired,Hired
7548,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Senior Software Engineer,"android, sql, ruby on rails, sql server, scrum...",3,14,2,673156,3e-06,0.999997
1579,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Full Stack Software Engineer,"sql, scrum, javascript, python, spring, uml, e...",3,14,2,641935,7e-06,0.999993
5835,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Deep knowledge in development and management,"javascript, c++, microsoft office, mvc, html, ...",3,33,2,666266,1.5e-05,0.999985
8635,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",JavaScript Developer,"scrum, javascript, python, wordpress, english,...",3,2,0,677297,2e-05,0.99998
6985,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Full Stack Developer,"sql, ruby on rails, scrum, javascript, wordpre...",3,11,2,671051,2.4e-05,0.999976
478,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Project Manager,"sql, ruby on rails, sql server, javascript, py...",4,12,2,636162,2.7e-05,0.999973
4360,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Senior Software Engineer,"sql, scrum, javascript, python, wordpress, lar...",3,10,2,658341,3.1e-05,0.999969
2783,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Mentor Android,"android, javascript, laravel, c++, html, boots...",3,9,2,648244,4.7e-05,0.999953
2785,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Software Engineer,"elixir, tdd, sql, scrum, javascript, spring, w...",3,13,2,648247,5.1e-05,0.999949
7372,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Frontend Engineer,"scrum, javascript, spring, html, bootstrap, cs...",3,5,1,672499,0.000139,0.999861


## Random Forest

In [0]:
random_forest_results = random_forest.predict_proba(X_pred)

In [0]:
random_forest_results

array([[0.58, 0.42],
       [0.35, 0.65],
       [0.19, 0.81],
       ...,
       [0.69, 0.31],
       [0.51, 0.49],
       [0.56, 0.44]])

In [0]:
predict_random_forest = pd.concat([predict_dataset, pd.DataFrame(random_forest_results)], axis=1)
predict_random_forest.rename(columns={1: 'Hired', 0: 'Not_Hired'}, inplace=True)
predict_random_forest.sort_values(by='Hired', ascending=False).head(10)

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,ExperienceLevel,UserId,Not_Hired,Hired
5008,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Frontend Developer and Mobile Developer,"javascript, python, wordpress, testing, html, ...",0,8,2,662560,0.09,0.91
2783,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Mentor Android,"android, javascript, laravel, c++, html, boots...",3,9,2,648244,0.09,0.91
6985,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Full Stack Developer,"sql, ruby on rails, scrum, javascript, wordpre...",3,11,2,671051,0.1,0.9
6623,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...","CTO, Chief Architect, Software development man...","azure, sql, javascript, python, c++, agile dev...",3,15,2,669517,0.12,0.88
9785,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",AWS Certified DevOps Engineer,"tdd, sql, javascript, python, c++, english, ht...",3,20,2,681234,0.12,0.88
8635,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",JavaScript Developer,"scrum, javascript, python, wordpress, english,...",3,2,0,677297,0.12,0.88
6936,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...","Software Development Engineer, Technology Lead","sql, javascript, spring, mvc, hibernate, soap,...",4,9,2,670894,0.12,0.88
9331,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Software Engineer,"android, sql, javascript, laravel, english, te...",3,6,1,679718,0.13,0.87
3174,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Senior Software Engineer,"sql, javascript, python, html, json, postgresq...",3,11,2,650846,0.13,0.87
5867,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Full Stack Engineer ( BigData / Cloud ),"postgresql, node.js, mongodb, j2ee, react.js, ...",3,4,1,666408,0.14,0.86


## Keras

In [0]:
keras_results = keras_model.predict(X_pred)

In [0]:
keras_results 

array([[0.4731384 , 0.47191063],
       [0.4103439 , 0.4748675 ],
       [0.3470883 , 0.49659997],
       ...,
       [0.46913517, 0.47356233],
       [0.3740267 , 0.44780344],
       [0.46305776, 0.47501224]], dtype=float32)

In [0]:
predict_keras = pd.concat([predict_dataset, pd.DataFrame(keras_results)], axis=1)
predict_keras.rename(columns={1: 'Hired', 0: 'Not_Hired'}, inplace=True)
predict_keras.sort_values(by='Hired', ascending=False).head(10)

Unnamed: 0,Position_J,Skills_J,Position_C,Skills_C,EnglishLevel,YearsOfExperience,ExperienceLevel,UserId,Not_Hired,Hired
5475,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Program & Project Manager,"scrum, pmi, microsoft office, english, pl/sql,...",2,26,2,664868,0.305871,0.580331
1710,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Data Engineer/ Architect and Agile Analytics C...,"sql, javascript, python, html5, json, css3, da...",4,12,2,642682,0.314396,0.574435
1579,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Full Stack Software Engineer,"sql, scrum, javascript, python, spring, uml, e...",3,14,2,641935,0.260966,0.571044
1801,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",IT Project Manager,"scrum, itil, jira, agile project management, p...",0,22,2,643161,0.359621,0.558562
6623,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...","CTO, Chief Architect, Software development man...","azure, sql, javascript, python, c++, agile dev...",3,15,2,669517,0.270667,0.557393
5727,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...","Software Development Manager, Agile Coach","scrum, english, agile development methodology,...",4,13,2,665811,0.341146,0.555062
388,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...","Prince 2 Certified, IC Agile Certified, CLF - ...","sql server, scrum, javascript, pmi, english, c...",2,20,2,635641,0.29616,0.554525
9367,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Scrum Master / Agile Coach / Product Manager,"scrum, microsoft office, web development, node...",0,5,1,679853,0.353064,0.554184
816,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Business Analyst (Business Intelligence),"microsoft office, helpdesk, data modeling and ...",2,10,2,638025,0.386621,0.552935
2299,Backend Developer,"api, aws, google cloud, java, mongodb, node.js...",Project Manager,"scrum, pmi, erp, agile, soa, pmbok, agile meth...",3,8,2,645501,0.318812,0.547968


# VI - Last words

I know there are several other approaches to work this challenge and I intend to try some of them to keep moving on my jorney towards machine learning and data science. Having that said, I do believe I (and probably other Vanhacktoners) would be able to get better results if we got more data.  I've spent most of the time working on building a train dataset the would have relevant information - under my perspective, of course ;-).

Thanks for the opportunity of working on this challenge. And please don't forget to release VanHack's expected output as having it as a benchmark will be of great value for ML Vanhacktoners to improve their skills.