Firstly I want to clean and transform the dataset into a format that I can easily train and evaluate my models on. <br>
Loading in and visualising the data is the first step.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_csv('data.csv')
print(df.shape)

(50000, 60)


In [3]:
for col in df:
    print('-----------------')
    print(f'feature: {col}')
    print(df[col].value_counts())
    print('-----------------')

-----------------
feature: candidate_id
bgjikcrozdgkrtzigymnaylpypxxublc0        1
ofacorucqeodmhwxbrwfwjqsdaafyvky33350    1
cqvmqnymmdwptstkmujnneseoiqbwrne33328    1
jpiizgnklifjbrntontjpfohwcnwoekl33329    1
zdgybcguficcdfzlpstluhziokcsxswp33330    1
                                        ..
ltxvztomcjuwkgbgukceqgkjkyfpgytt16668    1
qmjtskergapokdedcmjgrphzrfmbhemw16669    1
muasztjtujpgnvkodjtqleglqtxububr16670    1
ppvtjjahslsxywbseiyrtkcxngsxxits16671    1
kifewbinvomrspttqkjbbqasqlglkysu49999    1
Name: candidate_id, Length: 50000, dtype: int64
-----------------
-----------------
feature: application_status
pre-interview    20350
hired            15122
interview        14528
Name: application_status, dtype: int64
-----------------
-----------------
feature: number_of_employees_log
4.0    18169
2.9     9902
2.0     9009
2.6     8572
1.0     4348
Name: number_of_employees_log, dtype: int64
-----------------
-----------------
feature: occupation_id
exvwhbxlejsfyqxnwjabksnntpwodf

This should help me to identify the nature of the features in the dataset, including which ones will be useful for making predictions and which values would be sensible to fill in for missing entries.

It is worth highlighting that the candidate, occupation and company ids are all unique for each row, so each entry is independent. These values won't offer any predictive power and can be discarded.

occupation_skill_counts are all binary variables, so it would be sensible to assume that a missing value would be indicative of 0 (missing this skill).

candidate_attribute_1 is another binary variable, in this case it is not so clear how to deal with so will exclude (could predict?).

candidate attribute_2 is a floating point value, not much choice but to use an average to replace missing values. Could be more granular using other data to predict with more time.

application_attribute_1 is all unique values, so discard.

candidate_demographic_variables 1-4 are all binary, another case of hard to replace missing values. Could consider excluding these features, with more time could predict more sensible values within certain group demographics using the other data.

In the case of ethnicity we can reasonably fill missing values as 'Rather not say'.

candidate_demographic_variable 5 is another hard to fill column, will likely exclude.

6-8 are similar ^, 9 & 10 no missing values.

Age as a floating point is a another case of using the average (could predict from other features with more time?)

candidate_attribute 3-5 are floats, so using the average to fill missing values makes sense (as before, could predict). Another approach could be to sample randomly from the empirical distribution as a source of randomness for the ensemble of models

discard 6&7, could predict etc.

candidate_interest: 1 & 3-8, use mean/sample/predict; 2, single valued (discard).

candidate_attribute_8: average/sample/predict

number_years_feature 1-5: will assume here that a blank indicates zero years (assuming this refers to experience)

candidate_skill_count 1-9: will again assume that missing value indicates a count of 0.

candidate_relative_test 1 & 2: average/sample/predict



columns to discard:
candidate_id
occupation_id
company_id
candidate_attribute_1
application_attribute_1
candidate_demographic_variable_1
candidate_demographic_variable_2
candidate_demographic_variable_3
candidate_demographic_variable_4
candidate_demographic_variable_5
candidate_demographic_variable_6
candidate_demographic_variable_7
candidate_demographic_variable_8
candidate_attribute_6
candidate_attribute_7
candidate_interest_2

columns to 0:
occupation_skill_1_count
occupation_skill_2_count
occupation_skill_3_count
occupation_skill_4_count
occupation_skill_5_count
occupation_skill_6_count
occupation_skill_7_count
occupation_skill_8_count
occupation_skill_9_count
number_years_feature_1
number_years_feature_2
number_years_feature_3
number_years_feature_4
number_years_feature_5
candidate_skill_1_count
candidate_skill_2_count
candidate_skill_3_count
candidate_skill_4_count
candidate_skill_5_count
candidate_skill_6_count
candidate_skill_7_count
candidate_skill_8_count
candidate_skill_9_count




                                candidate_id application_status  \
622      dekrthebotukkqsshgngjsyvcxisxrsj622          interview   
3537    gkgpjxtxagzummwtcbsvfegrinbmaxnt3537              hired   
4416    uunzkatluqgzgeocckqymamhtxrpboev4416          interview   
4932    todtdiqswllfzmocqlfndfmmvdyntbnj4932      pre-interview   
4984    uwgloptmdfeptulxhoylsauffgwdnzgf4984          interview   
6039    jsbxahdouleadvpdlwerjtshewhcwtaa6039      pre-interview   
8444    ywbaojbbpmqolwgmpqbhqpqtfuhxaegj8444              hired   
10820  znyixfrufvjehnmrfathlxinkfeuvlmq10820      pre-interview   
14028  mrkqbjjfbjisomowuczuvnudkzstqwlz14028      pre-interview   
14636  gzhplmkifunflgtmwqzojutrgzusedwd14636      pre-interview   
15238  rdyhcmfxdhzzhdkezahellnjixoxaqxg15238          interview   
18882  ltzmhndvmxkhqmtpeqqyfpxaknlwolji18882      pre-interview   
22298  vfrftpthkvndeeizwpkoezfryvppnypy22298          interview   
22324  dsewdvozstgskxvsyswmgmisrhbifcjr22324      pre-intervie