## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. You can add more notebook cells or edit existing notebook cells other than "# YOUR CODE HERE" to test out or debug your code. We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Job Performance Prediction

You work for a software startup, Predict All The Things Inc. (PALT), and are approached by the CEO to build an algorithm that can help sift through resumes. PALT just closed a $3 million Series A round of funding and the CEO just landed a deal with a national retailer, SellsALOT, to help them with hiring Sales Associates.

They are able to obtain data on all the employees that work as Sales Associates throughout their stores as well as customer satisfaction and sales performance scores.

In this case study, you are tasked with building a model to predict job performance to assist HR in selecting applicants to interview.

The data was provided to you by the new HR intern, Keegan. This is the email you got from Keegan with the attached data.

>Hi!
>
>I hope you're doing well. I've attached the data we have about all employees. Please ensure this data stays confidential and is not shared with anyone who has not signed the NDA. The columns have all the information we have about our employees and the scoring rating that they've received from our performance monitors. We also have some employees who were fired and I have included those as well.
>
>I was also able to dig up some more information about our employees that I found on the internet. It took a lot of time but I hope it helps in making the model even better. Can't wait to see this thing in action. Everyone here is very excited about our collaboration with you and we look forward to this making hiring a lot easier for us.
>
>Thanks,
>
>Keegan Thiel
>
>HR Intern
>
>Human Resources
>
>SellsALOT


Data is available in the `employees.csv` file provided. 


SellsALOT is an Equal Opportunity Employer which is an employer who agrees not to discriminate against any employee or job applicant based on race, color, religion, national origin, sex, physical or mental disability, or age.


## Import packages that are likely to be useful

### However, __do not__ use TensorFlow to build your models.

Below we import packages that are needed or may be useful. You may import additional packages as you see fit, with the exception of TensorFlow. You may use scikit-learn. **Ensure that you import additional packages in cells that say "### YOUR CODE HERE".**

In [1]:
import pandas as pd
import matplotlib
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
import datetime
from datetime import date

## Data Cleaning

First, let's investigate the data that we received from Keegan.

If you are using colab, **Make sure you upload the employees.csv file** so it can be loaded in the next cell. In order to do so, click the file folder icon in the colab sidebar. You will see the contents of the current directory, which will include a "sample_data" folder. Click the upload icon (piece of paper with upward pointing arrow, just below the word "Files"). Locate the employees.csv file that you downloaded from Canvas to your local machine and open/upload the file.

In [2]:
df = pd.read_csv("employees.csv")

In [3]:
df.head(10)

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
0,Sarah,Chang,1989-12-24,764 Howard Tunnel,30167,Female,Black,Fluent,Basic,High School,3.1,2.52,8.8,0.0,ISTJ,693,1108,False,2.21,2.07,Current Employee
1,Daniel,Taylor,1985-03-15,4892 Jessica Turnpike Suite 781,86553,Male,Black,Fluent,Basic,High School,3.02,3.9,13.7,0.0,ISFJ,507,1259,False,3.37,2.98,Current Employee
2,Heather,Stewart,1993-09-20,778 Linda Orchard Apt. 609,30167,Female,Black,Proficient,Basic,High School,2.95,2.63,5.2,0.0,INFP,599,868,False,1.5,1.36,Current Employee
3,Katherine,Dillon,1986-12-22,139 Linda Crossroad Suite 115,30167,Female,Black,Basic,Basic,High School,3.99,3.88,12.5,0.0,ISFP,1321,889,True,2.89,2.62,Current Employee
4,Sheri,Bolton,1991-02-24,1858 Lauren Orchard,60531,Female,Black,Proficient,Proficient,High School,3.82,3.3,7.0,0.0,ISFJ,414,13760,True,1.94,1.78,Current Employee
5,Donna,Davis,1996-05-26,4232 Tina Forks,86553,Female,Black,Proficient,Basic,Associates,2.05,3.14,1.6,0.0,ESTJ,495,2401,False,0.78,0.87,Current Employee
6,Benjamin,Shelton,1985-01-15,186 Warren Mount Apt. 396,30167,Male,Black,Proficient,Basic,Associates,2.12,3.51,13.1,0.0,ESFJ,1696,1158,False,3.41,3.11,Current Employee
7,Kevin,Hayes,1994-01-21,515 Tucker Plaza Suite 304,59010,Male,Black,Fluent,Basic,High School,2.09,2.92,4.1,0.0,ESTP,319,538,False,1.3,1.29,Current Employee
8,Autumn,Robinson,1996-05-05,0123 Audrey Union,60531,Female,Black,Fluent,Basic,,,,3.0,0.0,ISFJ,988,510,False,0.89,0.63,Current Employee
9,Kimberly,Becker,1983-04-12,91615 Wilson Place,60531,Female,Black,Fluent,Basic,High School,2.99,2.97,14.8,0.0,INTP,1169,2254,False,3.59,3.4,Current Employee


In [4]:
df.describe(include="all")

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship,Customer Satisfaction Rating,Sales Rating,Fired
count,2000,2000,2000,2000,2000.0,2000,2000,2000,2000,2000,1645.0,1645.0,2000.0,2000.0,2000,2000.0,2000.0,2000,2000.0,2000.0,2000
unique,477,696,1701,2000,,2,3,3,3,5,,,,,16,,,2,,,2
top,Michael,Smith,1997-06-28,4945 Susan Pass Apt. 771,,Female,Black,Fluent,Basic,High School,,,,,ISFJ,,,False,,,Current Employee
freq,42,45,4,1,,1201,1000,1183,1795,881,,,,,218,,,1813,,,1844
mean,,,,,53300.7405,,,,,,3.010359,3.251465,8.04255,0.263,,1065.1445,9586.589,,2.220175,2.06511,
std,,,,,17455.384226,,,,,,0.584502,0.430466,4.674366,0.907487,,7499.266485,226956.3,,1.058473,0.97946,
min,,,,,24310.0,,,,,,2.0,2.5,0.0,0.0,,300.0,500.0,,0.0,0.0,
25%,,,,,43357.0,,,,,,2.51,2.87,3.9,0.0,,364.0,671.0,,1.3075,1.2575,
50%,,,,,55864.0,,,,,,3.02,3.27,8.1,0.0,,466.5,992.5,,2.25,2.06,
75%,,,,,60531.0,,,,,,3.51,3.61,12.1,0.0,,769.25,2042.0,,3.08,2.86,


In [5]:
print("The columns of data are:")
list(df.columns)

The columns of data are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'Zipcode',
 'Gender',
 'Race / Ethnicity',
 'English Fluency',
 'Spanish Fluency',
 'Education',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Myers Briggs Type',
 'Twitter followers',
 'Instagram Followers',
 'Requires Sponsorship',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Fired']

Before building any models, your manager has asked you to **convert all the feature data into formats that can easily be used for training and testing a variety of models**. This means:
1. Splitting the 16 Myers Briggs types into 4 subtypes
2. Converting categorical features into dummy binary features
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

In addition, you should remove columns that contain redundant information after going through the process above, e.g., removing the 'Date of Birth' column after an 'Age' column is added.


### MBTI Splitting

The [Myers Briggs Type Indicator](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator) (MBTI) descibes people as one of two types for each of:

* extraversion (E) or introversion (I)
* sensing (S) or intuition (N)
* thinking (T) or feeling (F)
* judgment (J) or perception (P)

It would make more sense for us to represent people as one or the other of these instead of creating all the possible cases. That way a model can learn based on each of those factors as well as their combination. 

Your next task is to split the MBTI column into four columns in the dataframe, with the following column names and values:

* MBTI_EI with value `E` or `I`
* MBTI_SN with value `S` or `N`
* MBTI_TF with value `T` or `F`
* MBTI_JP with value `J` or `P`

that correspond to the same row's Myers Briggs Type, and add those columns to your DataFrame, ```df```. Consider using the Series ```apply()``` method.

Afterwards, remove the original "Myers Briggs Type" column.

In [6]:
df['MBTI_EI'] = np.where(df['Myers Briggs Type'].str[0]=='E', 'E', 'I')
df['MBTI_SN'] = np.where(df['Myers Briggs Type'].str[1]=='S', 'S', 'N')
df['MBTI_TF'] = np.where(df['Myers Briggs Type'].str[2]=='T', 'T', 'F')
df['MBTI_JP'] = np.where(df['Myers Briggs Type'].str[3]=='J', 'J', 'P')
df.drop(['Myers Briggs Type'], axis=1, inplace=True)

In [7]:
assert len(set(df["MBTI_EI"])) == 2
assert "E" in set(df["MBTI_EI"]) and "I" in set(df["MBTI_EI"])
assert len(set(df["MBTI_SN"])) == 2
assert "S" in set(df["MBTI_SN"]) and "N" in set(df["MBTI_SN"])
assert len(set(df["MBTI_TF"])) == 2
assert "T" in set(df["MBTI_TF"]) and "F" in set(df["MBTI_TF"])
assert len(set(df["MBTI_JP"])) == 2
assert "J" in set(df["MBTI_JP"]) and "P" in set(df["MBTI_JP"])
assert "Myers Briggs Type" not in list(df.columns)

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. Converting categorical features into dummy binary features
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

### Categorical to Dummy Variables

Dummy variables are variables that allow us to convert a category into several binary variables. For example, if we had a color value that we were storing and we knew it could only have the values `red`, `green`, and `blue`, then instead of storing the color as those strings, we can store three binary variables: `is_red`, `is_green`, and `is_blue`. 

We can do this in pandas easily by using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [8]:
# Review the DataFrame columns and identify the columns that contain categorical
# features and save them to a list called "categorical_columns".

# Categorical here means that there is a discrete (albeit large in some cases)
# number of possible options for the column that are not just 0 or 1

categorical_columns = ['Zipcode', 'Gender', 'Race / Ethnicity', 'English Fluency',
                       'Spanish Fluency', 'Education', 'Requires Sponsorship', 'Fired',
                       'MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP']

In [9]:
assert len(categorical_columns) > 8
for category in categorical_columns:
    assert category in df.columns

In [10]:
# Before we get the dummy variables, we need to make sure that all these 
# categorical columns are actually recognized by pandas to be of 'category' type.
for column in categorical_columns:
  
    df[column] = df[column].astype('category')

In [11]:
# For every column in the categorical_columns,
# calculate the dummy variables and add them to the dataframe

df = pd.get_dummies(df, columns=categorical_columns)

In [12]:
assert len(list(df.columns)) > 45

In [13]:
print("The current columns are:")
list(df.columns)

The current columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Zipcode_24310',
 'Zipcode_30167',
 'Zipcode_43357',
 'Zipcode_43711',
 'Zipcode_54821',
 'Zipcode_55864',
 'Zipcode_59010',
 'Zipcode_60531',
 'Zipcode_72361',
 'Zipcode_86553',
 'Gender_Female',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_False',
 'Requires Sponsorship_True',
 'Fired_Current Employee',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_EI_I',
 'MBTI_SN_N',
 'MBTI_SN_S',
 'MBTI_T

In [14]:
# Now drop all the categorical features columns from the dataframe
# So that we don't have duplicate information stored

redundant_features = ['Gender_Female', 'Requires Sponsorship_False', 'Fired_Current Employee', 'MBTI_EI_I', 'MBTI_SN_N', 'MBTI_TF_F', 'MBTI_JP_P']

# errors = 'ignore' in case already gone or non-existent i.e., no present observation for binary
df.drop(redundant_features, axis=1, inplace=True, errors='ignore')

In [15]:
print("The current columns are:")
list(df.columns)

The current columns are:


['First Name',
 'Last Name',
 'Date of Birth',
 'Address',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Zipcode_24310',
 'Zipcode_30167',
 'Zipcode_43357',
 'Zipcode_43711',
 'Zipcode_54821',
 'Zipcode_55864',
 'Zipcode_59010',
 'Zipcode_60531',
 'Zipcode_72361',
 'Zipcode_86553',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_True',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_SN_S',
 'MBTI_TF_T',
 'MBTI_JP_J']

In [16]:
assert 55 > len(list(df.columns)) > 30

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. ~~Converting categorical features into dummy binary features~~
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

### Age Calculation

In [17]:
def calculate_age(born):
    """Calculates age based on date of birth using https://stackoverflow.com/a/9754466/818687

    Args:
        born (datetime): The date of birth

    Returns:
        int: The age based on date of birth
    """
    
    today = datetime.datetime.strptime("2020-11-20", "%Y-%m-%d")
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

Add an "Age" column to the dataframe, with the help of the ```calculate_age()``` function above. Afterwards, remove the "Date of Birth" column.

The input to ```calculate_age()``` should be a datetime object. Review the ```datetime.datetme.strptime()```
function and [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) to determine how to convert the "Date of Birth" date string into a datetime object.

In [18]:
df['Date of Birth'] = df['Date of Birth'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
df['Age'] = df['Date of Birth'].apply(lambda x: calculate_age(x))
df.drop(['Date of Birth'], axis=1, inplace=True)

In [19]:
assert df["Age"].min() == 22
assert df["Age"].max() == 38
assert df["Age"].median() == 30
assert "Date of Birth" not in list(df.columns)

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. ~~Converting categorical features into dummy binary features~~
3. ~~Calculating age based on date of birth~~
4. Dealing with missing (NaN) values in the data

## Handle NaN values

In [20]:
# Create a list of columns that contain NaN values

nan_columns = df.columns[df.isna().any()].tolist()
print(nan_columns)

['High School GPA', 'College GPA']


We see that data is not truly "missing" any values, but for people that did not attend or complete high school or college, there are no GPA values.

How should you deal with this? If you had a large number of people without GPAs, you might consider making separate models for people with GPAs and for people without. For this case, your manager asks you to make sure there's one model for everyone. She recommends one of the following options:

1. Replace NaN values with the mean value of all the non-NaN values.
1. Replace NaN values with 0
1. Replace NaN values with some other value
1. Create a model to predict people's GPA values from other attributes and fill them in with those values

Consider the assumptions of each approach:
1. Replacing with the mean assumes that that person would receive the average of others who work at this company.
1. Replacing with 0 assumes that that person would fail if they attended high school or college.
1. Replacing with some arbitrary value will have assumptions based on what that value is
1. Creating a model to predict people's GPA values from the other attributes in the data assumes that those attributes are predictive of GPA. 


Regardless of the approach you take, just make sure there are no more NaN values. 

In [21]:
# For each of the two columns that contain NaN values, replace the NaN values
# with numerical values, using one of the approaches above, or some other approach
# that you devise yourself.

df[nan_columns] = df[nan_columns].fillna(df[nan_columns].mean())

In [22]:
for col in nan_columns:
    assert not df[col].isna().any()

In [23]:
# Describe the approach you chose and why and save that as a string called nan_filling_approach

nan_filling_approach = """My thinking behind this was that since this data is speicific to a job
we could expect the employees working there without these values to have somewhere around the mean 
of the other employees. This implies that they have around the same knowledge is how I interpreted it"""
print(nan_filling_approach)

My thinking behind this was that since this data is speicific to a job
we could expect the employees working there without these values to have somewhere around the mean 
of the other employees. This implies that they have around the same knowledge is how I interpreted it


In [24]:
assert len(nan_filling_approach) > 30

## Modeling

### Interviewing model(s)

Having completed the conversion of the data into a format that can be used with machine learning models, your manager asks that you build three seperate models which predict the following three targets, respectively:

1. Customer Satisfaction
1. Sales Performance
1. Fired

In [25]:
# Save the names of columns we are trying to predict to a list called "targets".
# Make sure that if we had a categorical column, that you use the dummy representation(s)

targets = ['Customer Satisfaction Rating', 'Sales Rating', 'Fired_Fired']

In [26]:
assert len(targets) == 3
for target in targets:
    assert target in df.columns

Ultimately, the predictions of your models will be used to rank applicants for interviews with HR.

**Which features will you select to use in your model?** You will specify them below.

In [27]:
print("The available columns are:")
list(df)

The available columns are:


['First Name',
 'Last Name',
 'Address',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Zipcode_24310',
 'Zipcode_30167',
 'Zipcode_43357',
 'Zipcode_43711',
 'Zipcode_54821',
 'Zipcode_55864',
 'Zipcode_59010',
 'Zipcode_60531',
 'Zipcode_72361',
 'Zipcode_86553',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_True',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_SN_S',
 'MBTI_TF_T',
 'MBTI_JP_J',
 'Age']

In [28]:
# Enter all the features you want to use in a list and save it to "interview_features".
# These are the features for the models that will predict the targets, and the
# predictions will be used to rank applicants for **interviews**.

interview_features = ['High School GPA', 'College GPA',  'Years of Experience', 'Years of Volunteering', 'Twitter followers', 'Instagram Followers', 'English Fluency_Basic', 'English Fluency_Fluent', 'English Fluency_Proficient',
                      'Spanish Fluency_Basic', 'Spanish Fluency_Fluent', 'Spanish Fluency_Proficient', 'Education_Associates', 'Education_Graduate', 'Education_High School', 'Education_None', 'Education_Undergraduate',
                      'Requires Sponsorship_True', 'Age']

Why did you choose the features you did?

In [29]:
## Save your reasoning in a string to the variable interview_reason

interview_reason = """I didn't think someones ethnicity or gender should be included in picking an applicant, but whether they're fluent in other languages could be important depending on the applicaiton. I felt since we believe so 
highly in this personality test it would be good to include all of those. Age also feels like an important thing when determining longevity. I didn't include any personal information such as names or location
because a name obviously doesn't matter, but location could only matter if the interview wasn't local, or if the person wasn't willing to relocate. I included the followers because I thought it may lead to
a representation of how much someone is personable. Even though we replaced the GPA of some people, I felt those were both important. I do believe the appraoch to replacing those could be imporved. On the topic 
of education, I left those categorical variables in as well. Since sponsorship is important to jobs, I did include that."""

In [30]:
assert isinstance(interview_reason, str)
assert len(interview_reason) > 20

In [31]:
# Perform a train and test split on the data with the variable names:
# interview_x_train for the training features
# interview_x_test for the testing features
# interview_y_train for the training targets
# interview_y_test for the testing targets
# The test dataset should be 20% of the total dataset

interview_x_train, interview_x_test, interview_y_train, interview_y_test = train_test_split(df[interview_features], df[targets], test_size=0.2, random_state=0)

In [32]:
assert (len(interview_x_train) / (len(interview_x_test) + len(interview_x_train))) == 0.8
assert (len(interview_y_train) / (len(interview_y_test) + len(interview_y_train))) == 0.8
assert len(interview_x_train) == len(interview_y_train)
assert len(interview_x_test) == len(interview_y_test)

Build and train your interviewing models.

In [33]:
# Select models of your choosing, import them here, and perform a
# hyperparameter search while training them on each of the targets.
#
# Do not use Tensorflow to build a model - you may use scikit-learn.
#
# Determine an appropriate metric for measuring your performance for each
# model/target, and report the test score for that metric. The metric may be
# different for each model/target.
#
# Save your models in a list, with models ordered in the same manner as the
# targets they are predicting in the list "targets" you created above.
# Call the list "my_hiring_models", e.g.
#    my_interview_models = [interview_model_target1,
#                           interview_model_target2,
#                           interview_model_target3]
#
# You should use multiple print messages to print something like the
# following for each of your models/targets:
#
# To predict the target (target), I trained a (model) model
# and determined the best hyperparameters as (param1 = p1), (param2 = p2)...
# resulting in a (metric) score of (score).

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, Lasso, LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Customer Satisfaction Rating

n_alphas = 50
alphas = np.logspace(-5, 0, n_alphas)
scoring_metric = "neg_mean_squared_error"

# Ridge Regression
interview_cust_ridge_grid = GridSearchCV(estimator=Ridge(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
interview_cust_ridge_grid.fit(interview_x_train, interview_y_train.iloc[:, 0])

# Lasso
interview_cust_lasso_grid = GridSearchCV(estimator=Lasso(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
interview_cust_lasso_grid.fit(interview_x_train, interview_y_train.iloc[:, 0])

interview_cust_results = {
    'Ridge Regression': -interview_cust_ridge_grid.best_score_,
    'Lasso': -interview_cust_lasso_grid.best_score_
}

print(f"""To predict the target {interview_y_train.columns[0]}, I trained a {min(interview_cust_results, key=interview_cust_results.get)} model
and determined the best hyperparameters as alpha = {interview_cust_lasso_grid.best_params_['alpha']:.6f}
resulting in a MSE score of {-interview_cust_lasso_grid.best_score_:.6f}.\n""")

# Sales Rating

# Ridge Regression
interview_sales_ridge_grid = GridSearchCV(estimator=Ridge(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
interview_sales_ridge_grid.fit(interview_x_train, interview_y_train.iloc[:, 1])

# Lasso
interview_sales_lasso_grid = GridSearchCV(estimator=Lasso(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
interview_sales_lasso_grid.fit(interview_x_train, interview_y_train.iloc[:, 1])

interview_sales_results = {
    'Ridge Regression': -interview_sales_ridge_grid.best_score_,
    'Lasso': -interview_sales_lasso_grid.best_score_
}

print(f"""To predict the target {interview_y_train.columns[1]}, I trained a {min(interview_sales_results, key=interview_sales_results.get)} model
and determined the best hyperparameters as alpha = {interview_sales_ridge_grid.best_params_['alpha']:.6f}
resulting in a MSE score of {-interview_sales_ridge_grid.best_score_:.6f}.\n""")

# Fired_Fired

# SVM
svm_params = {
    "C": [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e2, 1e3, 1e4],
    "random_state": [0],
    "gamma": np.logspace(-5, 0, 7)
}
interview_fired_svm_grid = GridSearchCV(estimator=SVC(kernel='rbf'), param_grid=svm_params, scoring='accuracy')
interview_fired_svm_grid.fit(interview_x_train, interview_y_train.iloc[:, 2])

# KNN
ks = list(range(1, 21))
interview_fired_knn_grid = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=dict(n_neighbors=ks), scoring='accuracy')
interview_fired_knn_grid.fit(interview_x_train, interview_y_train.iloc[:, 2])

# Logistic Regerssion
log_params = {
    'penalty': ['l1', 'l2'],
    'C':[0.001,.009,0.01,.09,1,5,10,25]
}

interview_fired_log_grid = GridSearchCV(estimator=LogisticRegression(), param_grid=log_params, scoring='accuracy')
interview_fired_log_grid.fit(interview_x_train, interview_y_train.iloc[:, 2])

interview_fired_results = {
    'Support Vector Machine': interview_fired_svm_grid.best_score_,
    'K Nearest Neighbors': interview_fired_knn_grid.best_score_,
    'Logistic Regression': interview_fired_log_grid.best_score_
}

print(f"""To predict the target {interview_y_train.columns[2]}, I trained a {max(interview_fired_results, key=interview_fired_results.get)} model
and determined the best hyperparameters as C = {interview_fired_log_grid.best_params_['C']}, with an {interview_fired_log_grid.best_params_['penalty']} penalty
resulting in an accuracy score of {interview_fired_log_grid.best_score_:.6f}.""")

To predict the target Customer Satisfaction Rating, I trained a Lasso model
and determined the best hyperparameters as alpha = 0.000687
resulting in a MSE score of 0.012080.

To predict the target Sales Rating, I trained a Ridge Regression model
and determined the best hyperparameters as alpha = 0.000010
resulting in a MSE score of 0.006842.

To predict the target Fired_Fired, I trained a Logistic Regression model
and determined the best hyperparameters as C = 25, with an l2 penalty
resulting in an accuracy score of 0.935000.


In [34]:
interview_cust_results

{'Lasso': 0.012080459543189207, 'Ridge Regression': 0.012201737474585592}

In [35]:
interview_sales_results

{'Lasso': 0.006843063399610025, 'Ridge Regression': 0.006841580911866799}

In [36]:
interview_fired_results

{'K Nearest Neighbors': 0.93,
 'Logistic Regression': 0.9349999999999999,
 'Support Vector Machine': 0.93}

In [37]:
my_interview_models = [interview_cust_lasso_grid.best_estimator_, 
                       interview_sales_ridge_grid.best_estimator_, 
                       interview_fired_log_grid.best_estimator_]

In [38]:
assert len(my_interview_models)==len(targets)

### Hiring model(s)

You manager tells you that SellsALOT has decided they wish to consider doing away with interviews altogether, in order to save money. SellsALOT would like a model that will be used to rank candidates for directly hiring them, rather than for interviewing them.

Will your choice of features changes?

**Which features will you select to use in that model?** You will specify them below.

In [39]:
print("The available columns are:")
list(df)

The available columns are:


['First Name',
 'Last Name',
 'Address',
 'High School GPA',
 'College GPA',
 'Years of Experience',
 'Years of Volunteering',
 'Twitter followers',
 'Instagram Followers',
 'Customer Satisfaction Rating',
 'Sales Rating',
 'Zipcode_24310',
 'Zipcode_30167',
 'Zipcode_43357',
 'Zipcode_43711',
 'Zipcode_54821',
 'Zipcode_55864',
 'Zipcode_59010',
 'Zipcode_60531',
 'Zipcode_72361',
 'Zipcode_86553',
 'Gender_Male',
 'Race / Ethnicity_Black',
 'Race / Ethnicity_Caucasian',
 'Race / Ethnicity_Hispanic',
 'English Fluency_Basic',
 'English Fluency_Fluent',
 'English Fluency_Proficient',
 'Spanish Fluency_Basic',
 'Spanish Fluency_Fluent',
 'Spanish Fluency_Proficient',
 'Education_Associates',
 'Education_Graduate',
 'Education_High School',
 'Education_None',
 'Education_Undergraduate',
 'Requires Sponsorship_True',
 'Fired_Fired',
 'MBTI_EI_E',
 'MBTI_SN_S',
 'MBTI_TF_T',
 'MBTI_JP_J',
 'Age']

In [40]:
# Enter all the features you want to use in a list and save it to "hire_features".
# These are the features for the models that will predict the targets, and the
# predictions will be used to rank applicants for **hiring**.

hire_features = [interview_features[i] for i in [i for i in list(range(19)) if i not in [4, 5, 8, 11, 18] ]]

In [41]:
print('Customer Coefficients:')
print(interview_cust_lasso_grid.best_estimator_.sparse_coef_,'\n')
print('Sales Coefficients:')
print(interview_sales_lasso_grid.best_estimator_.sparse_coef_)

Customer Coefficients:
  (0, 0)	0.037916673460224995
  (0, 1)	0.057326883661508024
  (0, 2)	0.2133446609510015
  (0, 3)	0.21357734136996426
  (0, 4)	-6.702408547500355e-07
  (0, 5)	1.3176454907048578e-09
  (0, 6)	-0.2017407054379696
  (0, 7)	0.11846417540602998
  (0, 9)	-0.07124045163737841
  (0, 10)	0.02674679033801923
  (0, 13)	0.11760832708624114
  (0, 14)	-0.14705674481054526
  (0, 15)	-0.31645485633291404
  (0, 16)	0.07051610180739529
  (0, 18)	0.0034306621334382168 

Sales Coefficients:
  (0, 0)	0.03815621036202474
  (0, 1)	0.049713596130455076
  (0, 2)	0.19255789139742754
  (0, 3)	0.19198179507676527
  (0, 4)	-1.274373282962457e-07
  (0, 5)	-6.516758264568573e-09
  (0, 6)	-0.18821013667816783
  (0, 7)	0.09858188633167422
  (0, 9)	-0.14442816501250588
  (0, 10)	0.050531845928076496
  (0, 12)	0.1590714746675873
  (0, 13)	0.4133140007208122
  (0, 14)	-0.07207101880090003
  (0, 15)	-0.30986721886266433
  (0, 16)	0.2909296197147442
  (0, 17)	0.022654240217137696
  (0, 18)	0.003815535

Why did you choose the features you did?

In [42]:
## Save your reasoning in a string to the variable hire_reason

hire_reason = """In this model, I took out the five variables that had the small or sparse coefficients associated with the 
lasso models for both customer and sales ratings"""

In [43]:
assert isinstance(hire_reason, str)
assert len(hire_reason) > 20

Why was your choice different from or the same as the interviewing features?


In [44]:
# Save your reasoning in a string to the variable
# same_reason if the features are the same, or
# different_reason if the features are different.

different_reason = """I wanted to make the model selection more strict. Since the lasso method has the ability to 
act like feature selection, I felt identifying where it zeroed out or has low values were existing for the features"""

In [45]:
if all([rf in hire_features for rf in interview_features]) and all([sf in interview_features for sf in hire_features]):
    print("Your features for interviewing and hiring are the same.")
    assert isinstance(same_reason, str)
    assert len(same_reason) > 20
else:
    print("Your features for interviewing and hiring are different.")
    assert isinstance(different_reason, str)
    assert len(different_reason) > 20

Your features for interviewing and hiring are different.


In [46]:
# Perform a train and test split on the data with the variable names:
# hire_x_train for the training features
# hire_x_test for the testing features
# hire_y_train for the training targets
# hire_y_test for the testing targets
# The test dataset should be 20% of the total dataset

hire_x_train, hire_x_test, hire_y_train, hire_y_test = train_test_split(df[hire_features], df[targets], test_size=0.2, random_state=0)

In [47]:
assert (len(hire_x_train) / (len(hire_x_test) + len(hire_x_train))) == 0.8
assert (len(hire_y_train) / (len(hire_y_test) + len(hire_y_train))) == 0.8
assert len(hire_x_train) == len(hire_y_train)
assert len(hire_x_test) == len(hire_y_test)

Build and train your hiring models.

Do you expect this model to perform differently?

In [48]:
# Select models of your choosing, import them here, and perform a
# hyperparameter search while training them on each of the targets.
#
# Do not use Tensorflow to build a model - you may use scikit-learn.
#
# Determine an appropriate metric for measuring your performance for each
# model/target, and report the test score for that metric. The metric may be
# different for each model/target.
#
# Save your models in a list, with models ordered in the same manner as the
# targets they are predicting in the list "targets" you created above.
# Call the list "my_hiring_models", e.g.
#    my_hiring_models = [hiring_model_target1,
#                        hiring_model_target2,
#                        hiring_model_target3]
#
# You should use multiple print messages to print something like the
# following for each of your models/targets:
#
# To predict the target (target), I trained a (model) model
# and determined the best hyperparameters as (param1 = p1), (param2 = p2)...
# resulting in a (metric) score of (score).

# Customer Satisfaction Rating

# Ridge Regression
hire_cust_ridge_grid = GridSearchCV(estimator=Ridge(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
hire_cust_ridge_grid.fit(hire_x_train, hire_y_train.iloc[:, 0])

# Lasso
hire_cust_lasso_grid = GridSearchCV(estimator=Lasso(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
hire_cust_lasso_grid.fit(hire_x_train, hire_y_train.iloc[:, 0])

hire_cust_results = {
    'Ridge Regression': -hire_cust_ridge_grid.best_score_,
    'Lasso': -hire_cust_lasso_grid.best_score_
}

print(f"""To predict the target {hire_y_train.columns[0]}, I trained a {min(hire_cust_results, key=hire_cust_results.get)} model
and determined the best hyperparameters as alpha = {hire_cust_lasso_grid.best_params_['alpha']:.6f}
resulting in a MSE score of {-hire_cust_lasso_grid.best_score_:.6f}.\n""")

# Sales Rating

# Ridge Regression
hire_sales_ridge_grid = GridSearchCV(estimator=Ridge(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
hire_sales_ridge_grid.fit(hire_x_train, hire_y_train.iloc[:, 1])

# Lasso
hire_sales_lasso_grid = GridSearchCV(estimator=Lasso(), param_grid=dict(alpha=alphas), scoring=scoring_metric)
hire_sales_lasso_grid.fit(hire_x_train, hire_y_train.iloc[:, 1])

hire_sales_results = {
    'Ridge Regression': -hire_sales_ridge_grid.best_score_,
    'Lasso': -hire_sales_lasso_grid.best_score_
}

print(f"""To predict the target {hire_y_train.columns[1]}, I trained a {min(hire_sales_results, key=hire_sales_results.get)} model
and determined the best hyperparameters as alpha = {hire_sales_ridge_grid.best_params_['alpha']:.6f}
resulting in a MSE score of {-hire_sales_ridge_grid.best_score_:.6f}.\n""")

# Fired_Fired

# SVM
hire_fired_svm_grid = GridSearchCV(estimator=SVC(kernel='rbf'), param_grid=svm_params, scoring='accuracy')
hire_fired_svm_grid.fit(hire_x_train, hire_y_train.iloc[:, 2])

# KNN
hire_fired_knn_grid = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=dict(n_neighbors=ks), scoring='accuracy')
hire_fired_knn_grid.fit(hire_x_train, hire_y_train.iloc[:, 2])

# Logistic Regerssion
hire_fired_log_grid = GridSearchCV(estimator=LogisticRegression(), param_grid=log_params, scoring='accuracy')
hire_fired_log_grid.fit(hire_x_train, hire_y_train.iloc[:, 2])

hire_fired_results = {
    'Support Vector Machine': hire_fired_svm_grid.best_score_,
    'K Nearest Neighbors': hire_fired_knn_grid.best_score_,
    'Logistic Regression': hire_fired_log_grid.best_score_
}

print(f"""To predict the target {hire_y_train.columns[2]}, I trained a {max(hire_fired_results, key=hire_fired_results.get)} model
and determined the best hyperparameters as K = {hire_fired_knn_grid.best_params_['n_neighbors']}
resulting in an accuracy score of {hire_fired_knn_grid.best_score_:.6f}.""")

To predict the target Customer Satisfaction Rating, I trained a Lasso model
and determined the best hyperparameters as alpha = 0.000212
resulting in a MSE score of 0.007443.

To predict the target Sales Rating, I trained a Ridge Regression model
and determined the best hyperparameters as alpha = 0.152642
resulting in a MSE score of 0.006470.

To predict the target Fired_Fired, I trained a K Nearest Neighbors model
and determined the best hyperparameters as K = 9
resulting in an accuracy score of 0.967500.


In [49]:
hire_cust_results

{'Lasso': 0.007442567656380475, 'Ridge Regression': 0.007447448771292197}

In [50]:
hire_sales_results

{'Lasso': 0.006469602647378789, 'Ridge Regression': 0.006469573420238262}

In [51]:
hire_fired_results

{'K Nearest Neighbors': 0.9675,
 'Logistic Regression': 0.9362499999999999,
 'Support Vector Machine': 0.9650000000000001}

In [52]:
my_hiring_models = [hire_cust_lasso_grid.best_estimator_, hire_sales_ridge_grid.best_estimator_, hire_fired_knn_grid.best_estimator_]

In [53]:
assert len(my_hiring_models)==len(targets)

In [54]:
# Follow this up with a comparison between the performance (test scores) on your
# two sets of models.
#
# You should print something like, for each of the targets:
#   Using interview features for target (target) the model scored (score)
#   versus using the hiring features where it scored (score)

final_cust_results = pd.DataFrame(
    np.array([list(interview_cust_results.values()),
              list(hire_cust_results.values())]),
    index=['Interview Models', 'Hire Models'],
    columns=['Ridge Reression MSE', 'Lasso MSE']
)
final_sales_results = pd.DataFrame(
    np.array([list(interview_sales_results.values()),
              list(hire_sales_results.values())]),
    index=['Interview Models', 'Hire Models'],
    columns=['Ridge Reression MSE', 'Lasso MSE']
)
final_fired_results = pd.DataFrame(
    np.array([list(interview_fired_results.values()),
              list(hire_fired_results.values())]),
    index=['Interview Models', 'Hire Models'],
    columns=['Support Vector Machine', 'K Nearest Neighbors', 'Logistic Regression']
)

In [55]:
final_cust_results

Unnamed: 0,Ridge Reression MSE,Lasso MSE
Interview Models,0.012202,0.01208
Hire Models,0.007447,0.007443


In [56]:
final_sales_results

Unnamed: 0,Ridge Reression MSE,Lasso MSE
Interview Models,0.006842,0.006843
Hire Models,0.00647,0.00647


In [57]:
final_fired_results

Unnamed: 0,Support Vector Machine,K Nearest Neighbors,Logistic Regression
Interview Models,0.93,0.93,0.935
Hire Models,0.965,0.9675,0.93625


## Model Evaluation

In this section we'll create example applicants and see how they would fare based on their applications and your models. First, let's create some example applications. We've created four applicants, and you'll need to create a fifth one in the cell below.

In [58]:
applicant_1 = {
    'First Name': "Stefon",
    'Last Name': "Smith",
    'Date of Birth': "1989-12-24",
    'Address': "4892 Jessica Turnpike Suite 781",
    'Zipcode': 86553,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Proficient",
    'Spanish Fluency': "Basic",
    'Education': "Associates",
    'High School GPA': 2.9,
    'College GPA': 3.1,
    'Years of Experience': 5,
    'Years of Volunteering': 2,
    'Myers Briggs Type': "ESFJ",
    'Twitter followers': 524,
    'Instagram Followers': 857,
    'Requires Sponsorship': True
}
applicant_2 = {
    'First Name': "Sarah",
    'Last Name': "Chang",
    'Date of Birth': "1995-04-13",
    'Address': "9163 Rebecca Loop",
    'Zipcode': 43711,
    'Gender': "Female",
    'Race / Ethnicity': "Hispanic",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Fluent",
    'Education': "Undergraduate",
    'High School GPA': 4.0,
    'College GPA': 3.8,
    'Years of Experience': 5,
    'Years of Volunteering': 0,
    'Myers Briggs Type': "ISTJ",
    'Twitter followers': 97,
    'Instagram Followers': 204,
    'Requires Sponsorship': False
}
applicant_3 = {
    'First Name': "Daniel",
    'Last Name': "Richardson",
    'Date of Birth': "1998-10-23",
    'Address': "436 Lauren Stream",
    'Zipcode': 54821,
    'Gender': "Male",
    'Race / Ethnicity': "Black",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Proficient",
    'Education': "Undergraduate",
    'High School GPA': 3.0,
    'College GPA': 3.2,
    'Years of Experience': 1,
    'Years of Volunteering': 1,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 2087,
    'Instagram Followers': 3211,
    'Requires Sponsorship': False
}

applicant_4 = {
    'First Name': "Billy",
    'Last Name': "Bob",
    'Date of Birth': "1999-11-03",
    'Address': "412 Railway Stream",
    'Zipcode': 43711,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Basic",
    'Spanish Fluency': "Fluent",
    'Education': "Undergraduate",
    'High School GPA': 2.0,
    'College GPA': 3.5,
    'Years of Experience': 1,
    'Years of Volunteering': 1,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 207,
    'Instagram Followers': 309,
    'Requires Sponsorship': False
}

# Create a fictional applicant by copying the attributes above from any of the
# other applicants and/or adding example values that you would be curious to
# see how your model treats. For example, create an applicant you'd be sure to
# reject or sure to hire.

applicant_5 = {
    'First Name': "Billy",
    'Last Name': "Goat",
    'Date of Birth': "1999-09-25",
    'Address': "1050 Secrest St",
    'Zipcode': 80401,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Basic",
    'Education': "None",
    'High School GPA': 3.0,
    'College GPA': 1.94,
    'Years of Experience': 0,
    'Years of Volunteering': 0,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 0,
    'Instagram Followers': 0,
    'Requires Sponsorship': False
}

In [59]:
for key in applicant_4.keys():
    assert key in applicant_5.keys()

In [60]:
new_people = [applicant_1, applicant_2, applicant_3, applicant_4, applicant_5]
new_people_df = pd.DataFrame.from_records(new_people)

In [61]:
new_people_df

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship
0,Stefon,Smith,1989-12-24,4892 Jessica Turnpike Suite 781,86553,Male,Caucasian,Proficient,Basic,Associates,2.9,3.1,5,2,ESFJ,524,857,True
1,Sarah,Chang,1995-04-13,9163 Rebecca Loop,43711,Female,Hispanic,Fluent,Fluent,Undergraduate,4.0,3.8,5,0,ISTJ,97,204,False
2,Daniel,Richardson,1998-10-23,436 Lauren Stream,54821,Male,Black,Fluent,Proficient,Undergraduate,3.0,3.2,1,1,ENFJ,2087,3211,False
3,Billy,Bob,1999-11-03,412 Railway Stream,43711,Male,Caucasian,Basic,Fluent,Undergraduate,2.0,3.5,1,1,ENFJ,207,309,False
4,Billy,Goat,1999-09-25,1050 Secrest St,80401,Male,Caucasian,Fluent,Basic,,3.0,1.94,0,0,ENFJ,0,0,False


### Future Applicants Data Cleaning



In [62]:
# Apply all the cleaning and dummy variable creation you did above to this new
# DataFrame. You can copy your code from above and modify it to apply to
# new_people_df instead of df.

new_people_df['MBTI_EI'] = np.where(new_people_df['Myers Briggs Type'].str[0]=='E', 'E', 'I')
new_people_df['MBTI_SN'] = np.where(new_people_df['Myers Briggs Type'].str[1]=='S', 'S', 'N')
new_people_df['MBTI_TF'] = np.where(new_people_df['Myers Briggs Type'].str[2]=='T', 'T', 'F')
new_people_df['MBTI_JP'] = np.where(new_people_df['Myers Briggs Type'].str[3]=='J', 'J', 'P')
new_people_df.drop(['Myers Briggs Type'], axis=1, inplace=True)

categorical_columns = ['Zipcode', 'Gender', 'Race / Ethnicity', 'English Fluency',
                       'Spanish Fluency', 'Education', 'Requires Sponsorship',
                       'MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP']
for column in categorical_columns:
    new_people_df[column] = new_people_df[column].astype('category')

new_people_df = pd.get_dummies(new_people_df, columns=categorical_columns)
for column in interview_features:
  if column not in new_people_df.columns:
    new_people_df[column] = 0

redundant_features = ['Gender_Female', 'Requires Sponsorship_False', 'MBTI_EI_I', 'MBTI_SN_N', 'MBTI_TF_F', 'MBTI_JP_P']
new_people_df.drop(redundant_features, axis=1, inplace=True, errors='ignore')

new_people_df['Date of Birth'] = new_people_df['Date of Birth'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
new_people_df['Age'] = new_people_df['Date of Birth'].apply(lambda x: calculate_age(x))
new_people_df.drop(['Date of Birth'], axis=1, inplace=True)                       

In [63]:
nan_columns = new_people_df.columns[new_people_df.isna().any()].tolist()
new_people_df[nan_columns] = new_people_df[nan_columns].fillna(new_people_df[nan_columns].mean())

In [64]:
new_people_df

Unnamed: 0,First Name,Last Name,Address,High School GPA,College GPA,Years of Experience,Years of Volunteering,Twitter followers,Instagram Followers,Zipcode_43711,Zipcode_54821,Zipcode_80401,Zipcode_86553,Gender_Male,Race / Ethnicity_Black,Race / Ethnicity_Caucasian,Race / Ethnicity_Hispanic,English Fluency_Basic,English Fluency_Fluent,English Fluency_Proficient,Spanish Fluency_Basic,Spanish Fluency_Fluent,Spanish Fluency_Proficient,Education_Associates,Education_None,Education_Undergraduate,Requires Sponsorship_True,MBTI_EI_E,MBTI_SN_S,MBTI_TF_T,MBTI_JP_J,Education_Graduate,Education_High School,Age
0,Stefon,Smith,4892 Jessica Turnpike Suite 781,2.9,3.1,5,2,524,857,0,0,0,1,1,0,1,0,0,0,1,1,0,0,1,0,0,1,1,1,0,1,0,0,30
1,Sarah,Chang,9163 Rebecca Loop,4.0,3.8,5,0,97,204,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,0,0,25
2,Daniel,Richardson,436 Lauren Stream,3.0,3.2,1,1,2087,3211,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,22
3,Billy,Bob,412 Railway Stream,2.0,3.5,1,1,207,309,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,21
4,Billy,Goat,1050 Secrest St,3.0,1.94,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,21


In [65]:
for feature in interview_features:
    assert feature in new_people_df.columns
for feature in hire_features:
    assert feature in new_people_df.columns

### Future Applicant Model(s) Predictions

Now let's predict what the applicants' scores would be. Use your `best_interview_model` and `best_hire_model` to predict their scores.

In [66]:
# Save your predictions as new_people_interview and new_people_hire.
# Each of these should be a list of dictionaries, with one dictionery for
# each applicant. The keys of the dictionaries should be the same as the
# elements/strings in the "targets" list you created above.

new_people_interview_features = new_people_df[interview_features]
new_people_hire_features = new_people_df[hire_features]


interview_cust_pred = my_interview_models[0].predict(new_people_interview_features)
interview_sales_pred = my_interview_models[1].predict(new_people_interview_features)
interview_fired_pred = my_interview_models[2].predict(new_people_interview_features)
new_people_interview = [{targets[0]: interview_cust_pred[0],
                         targets[1]: interview_sales_pred[0],
                         targets[2]: interview_fired_pred[0]},
                        {targets[0]: interview_cust_pred[1],
                         targets[1]: interview_sales_pred[1],
                         targets[2]: interview_fired_pred[1]},
                        {targets[0]: interview_cust_pred[2],
                         targets[1]: interview_sales_pred[2],
                         targets[2]: interview_fired_pred[2]},
                        {targets[0]: interview_cust_pred[3],
                         targets[1]: interview_sales_pred[3],
                         targets[2]: interview_fired_pred[3]},
                        {targets[0]: interview_cust_pred[4],
                         targets[1]: interview_sales_pred[4],
                         targets[2]: interview_fired_pred[4]}]

hire_cust_pred = my_hiring_models[0].predict(new_people_hire_features)
hire_sales_pred = my_hiring_models[1].predict(new_people_hire_features)
hire_fired_pred = my_hiring_models[2].predict(new_people_hire_features)
new_people_hire = [{targets[0]: hire_cust_pred[0],
                         targets[1]: hire_sales_pred[0],
                         targets[2]: hire_fired_pred[0]},
                        {targets[0]: hire_cust_pred[1],
                         targets[1]: hire_sales_pred[1],
                         targets[2]: hire_fired_pred[1]},
                        {targets[0]: hire_cust_pred[2],
                         targets[1]: hire_sales_pred[2],
                         targets[2]: hire_fired_pred[2]},
                        {targets[0]: hire_cust_pred[3],
                         targets[1]: hire_sales_pred[3],
                         targets[2]: hire_fired_pred[3]},
                        {targets[0]: hire_cust_pred[4],
                         targets[1]: hire_sales_pred[4],
                         targets[2]: hire_fired_pred[4]}]                         

In [67]:
new_people_interview

[{'Customer Satisfaction Rating': 1.9936867449458988,
  'Fired_Fired': 0,
  'Sales Rating': 1.9382944647355316},
 {'Customer Satisfaction Rating': 1.918468762519378,
  'Fired_Fired': 0,
  'Sales Rating': 2.0150816590613996},
 {'Customer Satisfaction Rating': 1.1679860625489105,
  'Fired_Fired': 0,
  'Sales Rating': 1.3062773011243294},
 {'Customer Satisfaction Rating': 0.8516349305474354,
  'Fired_Fired': 0,
  'Sales Rating': 1.0434187523468657},
 {'Customer Satisfaction Rating': 0.20858467660751143,
  'Fired_Fired': 0,
  'Sales Rating': 0.11012494787581634}]

In [68]:
new_people_hire

[{'Customer Satisfaction Rating': 1.9821925225530626,
  'Fired_Fired': 0,
  'Sales Rating': 1.926226365525524},
 {'Customer Satisfaction Rating': 1.9418280526308513,
  'Fired_Fired': 0,
  'Sales Rating': 2.0198036936068458},
 {'Customer Satisfaction Rating': 1.175339140798052,
  'Fired_Fired': 0,
  'Sales Rating': 1.3088800584141764},
 {'Customer Satisfaction Rating': 0.8705701186266215,
  'Fired_Fired': 0,
  'Sales Rating': 1.0493949418949342},
 {'Customer Satisfaction Rating': 0.20379951179270311,
  'Fired_Fired': 1,
  'Sales Rating': 0.11314320638861647}]

In [69]:
for new_person_interview in new_people_interview:
    for key in targets:
        assert key in new_person_interview.keys()

for new_person_hire in new_people_hire:
    for key in targets:
        assert key in new_person_hire.keys()

### Ranking Evaluation

Your manager notes that given that you might have more than one prediction target, the model predictions aren't really ranking or selecting people. There is no "best" person because there's more than one metric to look through. A human still needs to look through the predictions so your models don't yet really do what SellsALOT has asked for.

Your manager asks you to create a synthetic scalar variable that is calculated from the multiple target predictions of an individual person. That way we'll have one metric by which we can rank people. You need to create that synthetic metric (score).

Some candidate approaches:

1. Incorporating a binary value, x:
    - You can multiply x by some arbitrary value and add/subtract it to/from the total score:
      - score = t1 + t2 * x
    - You can multiple your entire score output by the binary value to say something like "if not x, then  score is 0", e.g.:
      - score = x * (t1 + t2)
1. Balancing between different target values:
    - You can balance between different values by adding a multiplier (if t1 is twice as important as t2, then the score can be something like:
     - score = 2 * t1 + t2
1. Some combination of the items above
1. Something creative you devise on your own!

In [70]:
def calculate_synthetic_metric(targets):
    """Calculates a synthetic matric based on the targets of an individual
    Your metric should result in a higher score being a better one

    Args:
      targets (dict): The dictionary with keys as the target names and
                      values as the target values/predictions

    Returns:
      float: The synthetic score produced from 
    """

    # in case someones df used a different name i.e., Fired_Fired vs Fired_Current Employee
    cust_value = [value for key, value in targets.items() if 'customer satisfaction' in key.lower()][0]
    sales_value = [value for key, value in targets.items() if 'sales' in key.lower()][0]
    fired_value = [value for key, value in targets.items() if 'fired' in key.lower()][0]
    return sales_value + 0.75*cust_value - 0.075*fired_value

Let's try out the synthetic metric on the original data and see if you're happy with the result based on the past data.

In [71]:
# Add a column named "Metric" to the **original** DataFrame with the synthetic metric applied to each row

df['Metric'] = df[targets].apply(lambda x: calculate_synthetic_metric(x), axis=1)

In [72]:
assert "Metric" in df.columns
assert np.issubdtype(df["Metric"].dtype, np.number)

Are you happy with the synthetic score based on the values for each person here? Go back and update it until you're satisfied with this score.

In [73]:
# Explain the logic behind your synthetic scoring mechanism and save it as synthetic_score_reasoning

synthetic_score_reasoning = """I believe there should be more weight associated with the sales score. Even though the saying goes, 'the customer is always right',
I believe it is more important for an employee to maintain a better sales score. Therefore, I decreased the weight of the customer score to 75%. For the fired binary
variable I felt it needed to negatively impact the person if they have a fired response."""

In [74]:
assert len(synthetic_score_reasoning) > 100

Now let's calculate the synthetic scores for the new people (applicants) and see if you're satisfied with your models' rankings for interviewing and hiring.

In [75]:
new_people_interview_score = [calculate_synthetic_metric(target_values) for target_values in new_people_interview]
new_people_hire_score = [calculate_synthetic_metric(target_values) for target_values in new_people_hire]

In [76]:
best_interview_person = new_people[new_people_interview_score.index(max(new_people_interview_score))]
best_hire_person = new_people[new_people_hire_score.index(max(new_people_hire_score))]

Based on these scores, your model selected the following people:

In [77]:
print(f"""
Your interviewing model selected {best_interview_person['First Name']} {best_interview_person['Last Name']} as the person to interview.

Your hiring model selected {best_hire_person['First Name']} {best_hire_person['Last Name']} as the person to hire.
""")


Your interviewing model selected Sarah Chang as the person to interview.

Your hiring model selected Sarah Chang as the person to hire.



Are you happy with these results? Feel free to modify the `applicant_5`'s attributes and see how your model performs based on changing these values. 

In [78]:
# Describe your level of satisfaction with your models
# Did you edit your model based on the results? What did you change?
# What general conclusions did you get from the exercise
# Save your answer to the above questions as conclusions

conclusions = """I am actually quite content with the way this turned out. It seems like theres a correlation between the variables that I believed would be most influential. I created what I felt would
be a 'bad' applicant and even though they got through for an interview, the more strict hiring model classified them as fired...poor Billy Goat. Being a stats major, but not an avid Python user, I found this 
exercise very helpful with working with the pandas library as well as the regression and classification functions. Overall, this was an amazing summary of a lot of techniques we covered in this class. I liked
that is wasn't so heavy on the end material, as some of that is still fresh. It allowed us to slowly work on it rather than being at a standstill."""
print(conclusions)

I am actually quite content with the way this turned out. It seems like theres a correlation between the variables that I believed would be most influential. I created what I felt would
be a 'bad' applicant and even though they got through for an interview, the more strict hiring model classified them as fired...poor Billy Goat. Being a stats major, but not an avid Python user, I found this 
exercise very helpful with working with the pandas library as well as the regression and classification functions. Overall, this was an amazing summary of a lot of techniques we covered in this class. I liked
that is wasn't so heavy on the end material, as some of that is still fresh. It allowed us to slowly work on it rather than being at a standstill.


In [79]:
assert len(conclusions) > 100

## Feedback

In [80]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    return "All good!"

In [81]:
print(feedback())

All good!
