# Final Project - How well can we predict when somebody is looking for a job change? 


# Matthew Hui

# Load Data and Import Packages

I chose to train the model using 80% of the data and test on 20% of the data. I also stratified my split in order to get the same proportions for the target.


In [1]:
import numpy as np
import pandas as pd
from sklearn.base                 import BaseEstimator
from   sklearn.compose            import *
from   sklearn.ensemble           import RandomForestClassifier
from   sklearn.ensemble           import ExtraTreesClassifier
from   sklearn.impute             import *
from   sklearn.metrics            import *
from   sklearn.model_selection    import train_test_split
from   sklearn.model_selection    import RandomizedSearchCV
from   sklearn.pipeline           import Pipeline
from   sklearn.preprocessing      import *

In [2]:
df = pd.read_csv('aug_train.csv')

target_col = df.columns == 'target'

X = df.loc[:, ~target_col]
y = df.loc[:,target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), stratify=y,
                                                    test_size = 0.2, random_state=42) 

# Feature Engineering
There are many missing variables in this dataset, so I used simple imputing to impute the median of continuous variables and imputed a new unknown category for discrete variables. I used one hot encoding on my nominal variables and ordinal encoding on the ordinal variables.

In [3]:
# Impute median for missing values of categorical variables (do not want to make assumptions)
con_pipe = Pipeline([('imputer', SimpleImputer(strategy='median'))])

# Impute 'unknown' when value is missing for nominal categorical variables
nom_pipe = Pipeline([
                     ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
                     ('ohe', OneHotEncoder())
                    ])

# Setting category orders for ordinal variables
nums = [str(i) for i in range(1, 21)]

categories = [['Primary School','unknown', 'High School', 'Graduate', 'Masters', 'Phd'],
              ['never','unknown', '1', '2', '3', '4', '>4'],
              ['<1', 'unknown'] + nums + ['>20'],
              ['no_enrollment', 'unknown', 'Part time course', 'Full time course'],
              ['unknown', '<10', '10/49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']]

# Impute 'unknown' when value is missing for ordinal variables (do not want to make assumptions)
ord_pipe = Pipeline([
                     ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
                     ('ohe', OrdinalEncoder(categories=categories))
                    ])

# Make column transformer combining pipelines (do not want to make assumptions)
preprocessing = ColumnTransformer([
                                   ('continuous', con_pipe, ['city_development_index', 'training_hours']),
                                   ('nominal', nom_pipe, ['gender', 'relevent_experience',
                                                              'major_discipline', 'company_type']),
                                   ('ordinal', ord_pipe, ['education_level', 'last_new_job', 'experience',
                                                          'enrolled_university', 'company_size'])
                                  ])

# Algorithms and Search - Extra Trees Classifier and Random Forest Classifier
Creating search space:
- Criterion: The two criterion are two different functions that determine how good a split is. 
- Max Depth: Limits the number of splits each tree can have (prevents overfitting)
- Min Samples Split/Min Samples Leaf: Have similar uses: once a leaf node is small enough stop fitting (prevents overfitting)
- Max Features: The amount of features the tree can consider (by taking a subset it prevents all trees to look the same)
- Class Weight: Takes into account the proportion of the target (deals with the slight data imbalance problem)
- N Estimators: Number of trees produced gives better generalization to the model

In [4]:
# Dummy Estimator to pass through pipeline
class DummyEstimator(BaseEstimator):
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

# Create basic pipeline
pipe = Pipeline([
                 ('preprocessing', preprocessing),
                 ('clf', DummyEstimator())
                ])

# Create dictionary with hyperparameters to search through
search_space = {
                    'clf': [ExtraTreesClassifier(), RandomForestClassifier()],
                    'clf__criterion': ['gini', 'entropy'], 
                    'clf__max_depth': [3, 5, 10, 15, 20, 25, 50, 75, 100, 200, None],
                    'clf__min_samples_split': [2, 3, 5, 10, 20, 30, 50],
                    'clf__min_samples_leaf': [1, 2, 3, 5, 10, 15],
                    'clf__max_features': ['auto', 'sqrt', 'log2'],
                    'clf__class_weight': [None, 'balanced', 'balanced_subsample'],
                    'clf__n_estimators': [1, 3, 5, 10, 15, 25, 50, 75, 100, 125, 150, 200]
               }

In [5]:
# Want to use and F0.5 score to put more weight on precision
fbeta_scorer = make_scorer(fbeta_score, beta=0.5)

# Random search through parameter grid
rand_search = RandomizedSearchCV(estimator=pipe, 
                                 param_distributions=search_space, 
                                 n_iter=100,
                                 cv=5, 
                                 n_jobs=-1,
                                 scoring=fbeta_scorer,
                                 random_state=42)

# Fit search
final_model = rand_search.fit(X_train, y_train)

In [6]:
# Print best paramaters
print('Best Model:', final_model.best_estimator_.get_params()['clf'])

Best Model: RandomForestClassifier(class_weight='balanced', max_depth=200,
                       min_samples_leaf=3, min_samples_split=5,
                       n_estimators=200)


# Final Model




In [11]:
# Fit the final model
pipe = Pipeline([
                 ('preprocessing', preprocessing),
                 ('rf', RandomForestClassifier(class_weight='balanced', max_depth=200,
                                               min_samples_leaf=3, min_samples_split=5,
                                               n_estimators=200, n_jobs=-1))
                ])

pipe.fit(X_train, y_train);

y_preds = pipe.predict(X_test)


In [12]:
pipe = Pipeline([
                 ('preprocessing', preprocessing),
                 ('rf', RandomForestClassifier(class_weight='balanced', max_depth=200,
                                               min_samples_leaf=3, min_samples_split=5,
                                               n_estimators=200, n_jobs=-1))
                ])

pipe.fit(X_train, y_train);

y_preds = pipe.predict(X_test)


# Evaluation Metrics

Accuracy: How good were our overall predictions
- 79% of all of our predictions were correct  

Precision: When predicting job change, how often was it correct?
- 56% of all predicted positives were true positives.

Recall: How much of the actual candidate pool did we keep?
- 74% of all potential candidates were included in our predictions





In [14]:
print('Confusion Matrix:\n', confusion_matrix(y_test, y_preds))
print('Accuracy Score:', round(accuracy_score(y_test, y_preds), 2))
print('Precision:', round(precision_score(y_test, y_preds), 2))
print('Recall:', round(recall_score(y_test, y_preds), 2))

Confusion Matrix:
 [[2328  549]
 [ 245  710]]
Accuracy Score: 0.79
Precision: 0.56
Recall: 0.74


# Conclusion
Final Model:
- RandomForestClassifier
- class weight: balanced
- max depth: 200
- min samples leaf: 3
- min samples split: 5
- number of estimators: 200

Model did not perform as well as I had hoped. If there was more demographic data that could help a lot. Sacrificing some recall in order to get a higher precision could also better improve the model since the candidate pool is so large and a company only needs so many candidates.

The model has a lot of functionality when trying to find potential data scientists to hire. Using this model can reduce time and money spent on the hiring process. With accurate predictions, hiring managers can reduce their time on potential candidates that are actually not interested.

Next steps: Some potential next steps would be trying more models or trying to create features using the features that are already provided. 


In [10]:
from sklearn import set_config
set_config(display='diagram')
pipe

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=3d577cd0-4e74-4b55-beaf-ba4ec1241d48' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>