 As this is a case study individual assignment, I agree and acknowledge that all code modified in this notebook is my own. I have not and will not collaborate with anyone on this assignment. If I have questions, I will ask the instructor or TAs.

In [1]:
# Please provide your name and agreement to the above statement as variables `name` (string) and `agree` (boolean)
name = "Lewis Blake"
agree = True

In [2]:
conditions = [isinstance(name, str), isinstance(agree, bool), agree]
for condition in conditions:
    if not condition:
        raise ValueError("Student has not agreed to work on this assignment alone and without collaboration or has not provided their name")

# Job Performance Prediction

You work for a software startup, Predict All The Things Inc. (PALT), and are approached by the CEO to build an algorithm that can help sift through resumes. PALT just closed a $3 million Series A round of funding and the CEO just landed a deal with a national retailer, SellsALOT, to help them with hiring Sales Associates.

They are able to obtain data on all the employees that work as Sales Associates throughout their stores as well as customer satisfaction and sales performance scores.

In this case study, you are tasked to build a model to predict job performance to assist HR in selecting applicants to interview.

The data was provided to you by the new HR intern, Keegan. This is the email you got from Keegan with the attached data.

>Hi!
>
>I hope you're doing well. I've attached the data we have about all employees. Please ensure this data stays confidential and is not shared with anyone who has not signed the NDA. The columns have all the information we have about our employees and the scoring rating that they've received from our performance monitors. We also have some employees that were fired and I have included those as well.
>
>I was also able to dig up some more information about our employees that I found on the internet. It took a lot of time but I hope it helps in making the model even better. Can't wait to see this thing in action. Everyone here is very excited about our collaboration with you and we look forward to this making hiring a lot easier for us.
>
>Thanks,
>
>Keegan Thiel
>
>HR Intern
>
>Human Resources
>
>SellsALOT


Data is available in the `employees.csv` file provided. 


SellsALOT is an Equal Opportunity Employer which is an employer who agrees not to discriminate against any employee or job applicant because of race, color, religion, national origin, sex, physical or mental disability, or age.


## Data Cleaning

First, let's investigate the data that we received from Keegan.



In [3]:
import pandas as pd
import matplotlib
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
import datetime
from datetime import date

In [76]:
df = pd.read_csv("employees.csv")

In [77]:
df.head()

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,...,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
0,Sarah,Chang,1989-12-24,764 Howard Tunnel,30167,Female,Black,Fluent,Basic,High School,...,2.52,8.8,0.0,ISTJ,693,1108,False,2.21,2.07,Current Employee
1,Daniel,Taylor,1985-03-15,4892 Jessica Turnpike Suite 781,86553,Male,Black,Fluent,Basic,High School,...,3.9,13.7,0.0,ISFJ,507,1259,False,3.37,2.98,Current Employee
2,Heather,Stewart,1993-09-20,778 Linda Orchard Apt. 609,30167,Female,Black,Proficient,Basic,High School,...,2.63,5.2,0.0,INFP,599,868,False,1.5,1.36,Current Employee
3,Katherine,Dillon,1986-12-22,139 Linda Crossroad Suite 115,30167,Female,Black,Basic,Basic,High School,...,3.88,12.5,0.0,ISFP,1321,889,True,2.89,2.62,Current Employee
4,Sheri,Bolton,1991-02-24,1858 Lauren Orchard,60531,Female,Black,Proficient,Proficient,High School,...,3.3,7.0,0.0,ISFJ,414,13760,True,1.94,1.78,Current Employee


In [6]:
df.describe(include="all")

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,...,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
count,2000,2000,2000,2000,2000.0,2000,2000,2000,2000,2000,...,1645.0,2000.0,2000.0,2000,2000.0,2000.0,2000,2000.0,2000.0,2000
unique,477,696,1701,2000,,2,3,3,3,5,...,,,,16,,,2,,,2
top,Michael,Smith,1990-05-09,64538 Harris Fork Suite 487,,Female,Black,Fluent,Basic,High School,...,,,,ISFJ,,,False,,,Current Employee
freq,42,45,4,1,,1201,1000,1183,1795,881,...,,,,218,,,1813,,,1844
mean,,,,,53300.7405,,,,,,...,3.251465,8.04255,0.263,,1065.1445,9586.589,,2.220175,2.06511,
std,,,,,17455.384226,,,,,,...,0.430466,4.674366,0.907487,,7499.266485,226956.3,,1.058473,0.97946,
min,,,,,24310.0,,,,,,...,2.5,0.0,0.0,,300.0,500.0,,0.0,0.0,
25%,,,,,43357.0,,,,,,...,2.87,3.9,0.0,,364.0,671.0,,1.3075,1.2575,
50%,,,,,55864.0,,,,,,...,3.27,8.1,0.0,,466.5,992.5,,2.25,2.06,
75%,,,,,60531.0,,,,,,...,3.61,12.1,0.0,,769.25,2042.0,,3.08,2.86,


In [34]:
print("The columns of data are:")
list(df.columns)

The columns of data are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'Gender',
 'Race / Ethnicity',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired']

Based on the data above we need to do the following:

1. Split Myers Briggs into subtypes
1. Convert categorical columns to dummy variables
1. Calculate Age based on date of birth

The [Myers Briggs Type Indicator](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator) (MBTI) descibes people as one of two types for each of:

* extraversion (E) or introversion (I)
* sensing (S) or intuition (N)
* thinking (T) or feeling (F)
* judgment (J) or perception (P)

It would make more sense for us to represent people as one or the other of these instead of creating all the possible cases. That way a model can learn based on each of those factors as well as their combination. 

Your next task is to split the MBTI column into four columns in the dataframe:

* MBTI_EI with value `E` or `I`
* MBTI_SN with value `S` or `N`
* MBTI_TF with value `T` or `F`
* MBTI_JP with value `J` or `P`

That correspond to the same row's Myers Briggs Type.

In [84]:
df["MBTI_EI"] = df["Myers Briggs Type"].str[:1]
df["MBTI_SN"] = df["Myers Briggs Type"].str[1:2]
df["MBTI_TF"] = df["Myers Briggs Type"].str[2:3]
df["MBTI_JP"] = df["Myers Briggs Type"].str[3:4]

Index(['First Name', 'Last Name', 'Date of Birth', 'Address', 'Zipcode',
       'Gender', 'Race / Ethnicity', 'English Fluency', 'Spanish Fluency',
       'Education', 'High School GPA', 'College GPA', 'Years of Experience',
       'Years of Volunteering', 'Myers Briggs Type', 'Twitter followers',
       'Instagram Followers', 'Requires Sponsorship',
       'Customer Satisfaction Rating', 'Sales Rating', 'Fired', 'MBTI_EI',
       'MBTI_SN', 'MBTI_TF', 'MBTI_JP'],
      dtype='object')

In [85]:
assert len(set(df["MBTI_EI"])) == 2
assert "E" in set(df["MBTI_EI"]) and "I" in set(df["MBTI_EI"])
assert len(set(df["MBTI_SN"])) == 2
assert "S" in set(df["MBTI_SN"]) and "N" in set(df["MBTI_SN"])
assert len(set(df["MBTI_TF"])) == 2
assert "T" in set(df["MBTI_TF"]) and "F" in set(df["MBTI_TF"])
assert len(set(df["MBTI_JP"])) == 2
assert "J" in set(df["MBTI_JP"]) and "P" in set(df["MBTI_JP"])

1. ~~Split Myers Briggs into subtypes~~
1. Convert categorical columns to dummy variables
1. Calculate Age based on date of birth

Dumy variables are variables that allow us to convert a category into several binary variables. For example, if we had a color value that we were storing and we knew it could only have the values `red`, `green`, and `blue`, then instead of storing the color as those strings, we can store three binary variables: `is_red`, `is_green`, and `is_blue`. 

We can do this in pandas easily by using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [81]:
# Determine all the categorical columns and save them to categorical_columns
# Note that this does not include the binary features
categorical_columns = pd.DataFrame({"MBTI_EI": df["MBTI_EI"], "MBTI_SN": df["MBTI_SN"], "MBTI_TF": df["MBTI_TF"], "MBTI_JP": df["MBTI_JP"]})
#df.join(categorical_columns, how = "outer")
#df.columns

In [82]:
assert len(categorical_columns) > 8
for category in categorical_columns:
    assert category in df.columns

In [83]:
# For every column in the categorical features
# Calculate the dummy variables and add them to the dataframe
tmp = pd.get_dummies(categorical_columns).copy()
df.join(tmp, how = "outer")

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,...,MBTI_TF,MBTI_JP,MBTI_EI_E,MBTI_EI_I,MBTI_SN_N,MBTI_SN_S,MBTI_TF_F,MBTI_TF_T,MBTI_JP_J,MBTI_JP_P
0,Sarah,Chang,1989-12-24,764 Howard Tunnel,30167,Female,Black,Fluent,Basic,High School,...,T,J,0,1,0,1,0,1,1,0
1,Daniel,Taylor,1985-03-15,4892 Jessica Turnpike Suite 781,86553,Male,Black,Fluent,Basic,High School,...,F,J,0,1,0,1,1,0,1,0
2,Heather,Stewart,1993-09-20,778 Linda Orchard Apt. 609,30167,Female,Black,Proficient,Basic,High School,...,F,P,0,1,1,0,1,0,0,1
3,Katherine,Dillon,1986-12-22,139 Linda Crossroad Suite 115,30167,Female,Black,Basic,Basic,High School,...,F,P,0,1,0,1,1,0,0,1
4,Sheri,Bolton,1991-02-24,1858 Lauren Orchard,60531,Female,Black,Proficient,Proficient,High School,...,F,J,0,1,0,1,1,0,1,0
5,Donna,Davis,1996-05-26,4232 Tina Forks,86553,Female,Black,Proficient,Basic,Associates,...,T,J,1,0,0,1,0,1,1,0
6,Benjamin,Shelton,1985-01-15,186 Warren Mount Apt. 396,30167,Male,Black,Proficient,Basic,Associates,...,F,J,1,0,0,1,1,0,1,0
7,Kevin,Hayes,1994-01-21,515 Tucker Plaza Suite 304,59010,Male,Black,Fluent,Basic,High School,...,T,P,1,0,0,1,0,1,0,1
8,Autumn,Robinson,1996-05-05,0123 Audrey Union,60531,Female,Black,Fluent,Basic,,...,F,J,0,1,0,1,1,0,1,0
9,Kimberly,Becker,1983-04-12,91615 Wilson Place,60531,Female,Black,Fluent,Basic,High School,...,T,P,0,1,1,0,0,1,0,1


In [74]:
assert len(list(df.columns)) > 45

AssertionError: 

In [None]:
print("The current columns are:")
list(df.columns)

In [None]:
# Now let's drop all the categorical features columns from the dataframe
# So that we don't have duplicate information stored
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print("The current columns are:")
list(df.columns)

In [None]:
assert 45 > len(list(df.columns)) > 30

1. ~~Split Myers Briggs into subtypes~~
1. ~~Convert categorical columns to dummy variables~~
1. Calculate Age based on date of birth

In [None]:
def calculate_age(born):
    """Calculates age based on date of birth using https://stackoverflow.com/a/9754466/818687

    Args:
        born (datetime): The date of birth

    Returns:
        int: The age based on date of birth
    """
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [None]:
# Add an "Age" column to the dataframe that calculates people's ages based on their date of birth
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df["Age"].min() == 20
assert df["Age"].max() == 36
assert df["Age"].median() == 28

## Modelling

Based on your understanding of the data, select the features that you want to use to predict:

1. Customer Satisfaction
1. Sales Performance
1. Fired

In [None]:
# Save the columns we are trying to predict to targets
# Make sure that if we had a categorical column, that you use the dummy representation(s)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(targets) == 3
for target in targets:
    assert target in df.columns

Your prediction will be used to rank applicants for interviews with HR. **Which features will you select to use in your model?**

In [None]:
print("The available columns are:")
list(df)

In [None]:
# Enter all the features you want to use in a list and save it to rank_features
# These are the features for the model that will rank applicants for interviews
# YOUR CODE HERE
raise NotImplementedError()

Why did you choose the features you did?

In [None]:
## Save your reasoning in a string to the variable ranking_reason

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(ranking_reason, str)
assert len(ranking_reason) > 20

In [None]:
# Perform a train and test split on the data with the variable names:
# rank_x_train for the training features
# rank_x_test for the testing features
# rank_y_train for the training targets
# rank_y_test for the testing targets
# The test dataset should be 20% of the total dataset

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert (len(rank_x_train) / (len(rank_x_test) + len(rank_x_train))) == 0.8
assert (len(rank_y_train) / (len(rank_y_test) + len(rank_y_train))) == 0.8
assert len(rank_x_train) == len(rank_y_train)
assert len(rank_x_test) == len(rank_y_test)

In [None]:
# Select models of your choosing, import them and perform a parameter search to train them on each of the targets
# Determine an appropriate metric for measuring your performance and report that.

# YOUR CODE HERE
raise NotImplementedError()

# You should have print messages that state something like:
# To predict the target (target), I trained a (model) model
# and determined the best hyperparameters as (param1 = p1), (param2 = p2)...
# resulting in a (metric) score of (score)

Would your feature choice change if HR was going to use your model to directly hire applicants without an interview? **Which features will you select to use in that model?**

In [None]:
print("The available columns are:")
list(df)

In [None]:
# Enter all the features you want to use in a list and save it to selection_features
# These are the features for the model that will directly hire the top applicants
# YOUR CODE HERE
raise NotImplementedError()

Why did you choose the features you did?

In [None]:
## Save your reasoning in a string to the variable selection_reason

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(selection_reason, str)
assert len(selection_reason) > 20

Why was your choice different from or the same as the ranking features?


In [None]:
# Save your reasoning in a string to the variable
# same_reason if the features are the same
# different_reason if the features are different
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
if all([rf in selection_features for rf in rank_features]) and all([sf in rank_features for sf in selection_features]):
    print("Your features for ranking and selection are the same.")
    assert isinstance(same_reason, str)
    assert len(same_reason) > 20
else:
    print("Your features for ranking and selection are different.")
    assert isinstance(different_reason, str)
    assert len(different_reason) > 20

In [None]:
# Perform a train and test split on the data with the variable names:
# selection_x_train for the training features
# selection_x_test for the testing features
# selection_y_train for the training targets
# selection_y_test for the testing targets
# The test dataset should be 20% of the total dataset

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert (len(selection_x_train) / (len(selection_x_test) + len(selection_x_train))) == 0.8
assert (len(selection_y_train) / (len(selection_y_test) + len(selection_y_train))) == 0.8
assert len(selection_x_train) == len(selection_y_train)
assert len(selection_x_test) == len(selection_y_test)

Now let's see if the model performs differently.

In [None]:
# Select models of your choosing, import them and perform a parameter search to train them on each of the targets
# Determine an appropriate metric for measuring your performance and report that.

# YOUR CODE HERE
raise NotImplementedError()

# You should have print messages that state something like:
# To predict the target (target), I trained a (model) model
# and determined the best hyperparameters as (param1 = p1), (param2 = p2)...
# resulting in a (metric) score of (score)

In [None]:
# Follow this up with a comparison between the performance on your 2 models using the different features.
# You should print something like
# Using rank features for target (target) the model scored (score)
# versus using the selection features where it scored (score)
# YOUR CODE HERE
raise NotImplementedError()


## Feedback

In [None]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    raise NotImplementedError()