## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. Cells in which "# YOUR CODE HERE" is found are the cells where your graded code should be written.
2. In order to test out or debug your code you may also create notebook cells or edit existing notebook cells other than "# YOUR CODE HERE". We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Case Study: Job Placement

### Predicting job performance of candidate hires

You work for a software startup, Predict All The Things Inc. (PALT), and are approached by the CEO to build an algorithm that can help sift through resumes. PALT just closed a $3 million Series A round of funding and the CEO just landed a deal with a national retailer, SellsALOT, to help them with hiring Sales Associates.

They are able to obtain data on all the employees that work as Sales Associates throughout their stores as well as customer satisfaction and sales performance scores.

In this case study, you are tasked with building a model to predict job performance to assist HR in selecting applicants to interview.

The data was provided to you by the new HR intern, Keegan. This is the email you got from Keegan with the attached data.

>Hi!
>
>I hope you're doing well. I've attached the data we have about all employees. Please ensure this data stays confidential and is not shared with anyone who has not signed the NDA. The columns have all the information we have about our employees and the scoring rating that they've received from our performance monitors. We also have some employees who were fired and I have included those as well.
>
>I was also able to dig up some more information about our employees that I found on the internet. It took a lot of time but I hope it helps in making the model even better. Can't wait to see this thing in action. Everyone here is very excited about our collaboration with you and we look forward to this making hiring a lot easier for us.
>
>Thanks,
>
>Keegan Thiel
>
>HR Intern
>
>Human Resources
>
>SellsALOT


Data is available in the `employees.csv` file provided. 


SellsALOT is an Equal Opportunity Employer which is an employer who agrees not to discriminate against any employee or job applicant based on race, color, religion, national origin, sex, physical or mental disability, or age.


### Part 1: Data Cleaning

In part 1 of this notebook you will explore the data, manually decide which features you will keep for your models, and convert all features into numeric values that can be ingested by ML models.

By the __Part 1 deadline__, you will need to complete the data cleaning, save your cleaned data as a new CSV file, and __submit that CSV file in Canvas for grading__.


### Part 2: Modeling

In Part 2 you will build, train, and evaluating (on the test set) six different models.  
- __Interviewing__: You will build three models that are meant for selecting candidates for job __interviews__. Each model has a different target but takes in the same features.
- __Hiring__: You will build three additional models that are meant for selecting __direct hire__ candidates. Like the interviewing models, each model has a different target but the same input features.

After evaluating you individual models, you'll then create a new candidate-scoring function (not a model-scoring function) that combines all three predictions into a scalar value that can be used to rank applicants for interviewing, or for hiring.

Finally, you'll use data from some "new" applicants, as well as create some of you own, and observe your models' predictions for that data.

By the __Part 2 deadline__ you'll need to finalize and submit this notebook.

### Grading

- The Part 1 CSV file is worth __20 points__.  
- This notebook, both the Data Cleaning and Modeling sections together, is worth __80 points__.  
- The case study is worth __100 points in total__.

## Import packages that are likely to be useful

### However, __do not__ use TensorFlow to build your models.

Below we import packages that are needed or may be useful. You may import additional packages as you see fit, with the exception of TensorFlow. You may use scikit-learn. **Ensure that you import additional packages in cells that say "### YOUR CODE HERE".**

In [1]:
import pandas as pd
import matplotlib
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
import datetime
from datetime import date
from sklearn.linear_model import Ridge
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

## Data Cleaning

First, let's investigate the data that we received from Keegan.

If you are using colab, **Make sure you upload the employees.csv file** so it can be loaded in the next cell. In order to do so, click the file folder icon in the colab sidebar. You will see the contents of the current directory, which will include a "sample_data" folder. Click the upload icon (piece of paper with upward pointing arrow, just below the word "Files"). Locate the employees.csv file that you downloaded from Canvas to your local machine and open/upload the file.

In [2]:
df = pd.read_csv("employees.csv")

In [None]:
df.head(10)

In [None]:
df.describe(include="all")

In [None]:
print("The columns of data are:")
list(df.columns)

Before building any models, your manager has asked you to **convert all the feature data into formats that can easily be used for training and testing a variety of models**. This means:
1. Splitting the 16 Myers Briggs types into 4 subtypes
2. Converting categorical features into dummy binary features
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

In addition, you should remove columns that contain redundant information after going through the process above, e.g., removing the 'Date of Birth' column after an 'Age' column is added.


### MBTI Splitting

The [Myers Briggs Type Indicator](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator) (MBTI) descibes people as one of two types for each of:

* extraversion (E) or introversion (I)
* sensing (S) or intuition (N)
* thinking (T) or feeling (F)
* judgment (J) or perception (P)

It would make more sense for us to represent people as one or the other of these instead of creating all the possible cases. That way a model can learn based on each of those factors as well as their combination. 

Your next task is to split the MBTI column into four columns in the dataframe, with the following column names and values:

* MBTI_EI with value `E` or `I`
* MBTI_SN with value `S` or `N`
* MBTI_TF with value `T` or `F`
* MBTI_JP with value `J` or `P`

that correspond to the same row's Myers Briggs Type, and add those columns to your DataFrame, ```df```. Consider using the Series ```apply()``` method.

Afterwards, `drop` (remove) the original "Myers Briggs Type" column.

In [6]:
# YOUR CODE HERE
df[['MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP']] = df['Myers Briggs Type'].apply(lambda x: pd.Series(list(x)))
df = df.drop(columns='Myers Briggs Type')

In [7]:
assert len(set(df["MBTI_EI"])) == 2
assert "E" in set(df["MBTI_EI"]) and "I" in set(df["MBTI_EI"])
assert len(set(df["MBTI_SN"])) == 2
assert "S" in set(df["MBTI_SN"]) and "N" in set(df["MBTI_SN"])
assert len(set(df["MBTI_TF"])) == 2
assert "T" in set(df["MBTI_TF"]) and "F" in set(df["MBTI_TF"])
assert len(set(df["MBTI_JP"])) == 2
assert "J" in set(df["MBTI_JP"]) and "P" in set(df["MBTI_JP"])
assert "Myers Briggs Type" not in list(df.columns)

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. Converting categorical features into dummy binary features
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

### Categorical to Dummy Variables

Dummy variables are variables that allow us to convert a category into several binary variables. For example, if we had a color value that we were storing and we knew it could only have the values `red`, `green`, and `blue`, then instead of storing the color as those strings, we can store three binary variables: `is_red`, `is_green`, and `is_blue`. This is identical to "one-hot" encoding.

We can do this in pandas easily by using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html).

In [8]:
# Review the DataFrame columns and identify the columns that contain categorical
# features and save them to a list called "categorical_columns". Afterwards
# we will convert those features to one-hot encoded features, using get_dummies().

# "Categorical" means that there is a discrete (albeit large in some cases)
# number of possible options, and that those values have no ordinal (rankable)
# meaning.
# A slightly tricky one is zip codes. Are zip codes ordinal? Zip codes do increase,
# generally, as one moves from east to west in the US. But there is otherwise
# little ordinal relationship, so you may treat Zipcode as a categorical variable for
# this case study if you wish to use it as a feature for your ML models.

# While the following are technically categorical, they should have little to no
# predictive power, so don't include them in your list of categorical columns since
# they shouldn't be used as features for your ML model in any case.
#     'First Name'
#     'Last Name'
#     'Address'
    
# YOUR CODE HERE
categorical_columns = ['Zipcode', 'Gender', 'Race / Ethnicity', 'English Fluency', 'Spanish Fluency', 'Education', 'MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP', 'Requires Sponsorship', 'Fired']


In [9]:
assert len(categorical_columns) > 8
for category in categorical_columns:
    assert category in df.columns

In [10]:
# Before we get the dummy variables, we need to make sure that all these 
# categorical columns are actually recognized by pandas to be of 'category' type.

for column in categorical_columns:
    df[column] = df[column].astype('category')

In [11]:
# For every column in the categorical_columns,
# calculate the dummy variables and add them to the dataframe

# YOUR CODE HERE
df = pd.get_dummies(df, columns=categorical_columns)

In [12]:
assert len(list(df.columns)) > 45

In [None]:
print("The current columns are:")
list(df.columns)

In [14]:
# Now drop all the categorical features columns (those in your
# "categorical_columns" list) from the dataframe, so that we don't
# have duplicate information stored.

# YOUR CODE HERE
# you may (read: shouldn't ?) need to even do this? But for clarity: 
# Yeah it actually doesn't even run once, I was moderately shocked by this.
# df.drop(columns=categorical_columns)

In [None]:
print("The current columns are:")
list(df.columns)

In [16]:
assert 55 > len(list(df.columns)) > 30

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. ~~Converting categorical features into dummy binary features~~
3. Calculating age based on date of birth
4. Dealing with missing (NaN) values in the data

### Age Calculation

In [17]:
def calculate_age(born):
    """Calculates age (in years) based on date of birth
       using https://stackoverflow.com/a/9754466/818687

    Args:
        born (datetime): The date of birth

    Returns:
        int: The age based on date of birth
    """
    
    # We'll set a fixed date for "today" rather then use the actual date,
    # so the data will be the same regardless of when you run this notebook.
    today = datetime.datetime.strptime("2021-11-20", "%Y-%m-%d")
    
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

Add an "Age" column to the dataframe, with the help of the ```calculate_age()``` function above. Afterwards, remove the "Date of Birth" column.

The input to ```calculate_age()``` should be a datetime object. Review the ```datetime.datetme.strptime()```
function and [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) to determine how to convert the "Date of Birth" date string into a `datetime` object. Take note that __capitalization matters__ in the format code, e.g., the difference between (m)onth and (M)inute.

In [18]:
# YOUR CODE HERE
df['Date of Birth'] = pd.to_datetime(df['Date of Birth'])
df[['Age']] = df['Date of Birth'].apply(lambda x: calculate_age(x))
df = df.drop(columns='Date of Birth')

In [19]:
assert df["Age"].min() == 23
assert df["Age"].max() == 39
assert np.isclose(df["Age"].median(), 31)
assert "Date of Birth" not in list(df.columns)

1. ~~Splitting the 16 Myers Briggs types into 4 subtypes~~
2. ~~Converting categorical features into dummy binary features~~
3. ~~Calculating age based on date of birth~~
4. Dealing with missing (NaN) values in the data

## Handle NaN values

In [20]:
# Create a list of columns that contain NaN values

nan_columns = df.columns[df.isna().any()].tolist()
print(nan_columns)

['High School GPA', 'College GPA']


We see that data is not truly "missing" any values, but for people that did not attend or complete high school or college, there are no GPA values.

How should you deal with this? If you had a large number of people without GPAs, you might consider making separate models for people with GPAs and for people without. For this case, your manager asks you to make sure there's one model for everyone. She recommends one of the following options:

1. Replace NaN values with the mean value of all the non-NaN values.
1. Replace NaN values with 0
1. Replace NaN values with some other value
1. Create a model to predict people's GPA values from other attributes and fill them in with those values

Consider the assumptions of each approach:
1. Replacing with the mean assumes that that person would receive the average of others who work at this company.
1. Replacing with 0 assumes that that person would fail if they attended high school or college.
1. Replacing with some arbitrary value will have assumptions based on what that value is
1. Creating a model to predict people's GPA values from the other attributes in the data assumes that those attributes are predictive of GPA. 


Regardless of the approach you take, just make sure there are no more NaN values. 

In [21]:
# For each of the two columns that contain NaN values, replace the NaN values
# with numerical values, using one of the approaches above, or some other approach
# that you devise yourself.

# YOUR CODE HERE
# you can do in 1 line using df.fillna(0) but I didn't want to make an error in haste
df['High School GPA'] = df['High School GPA'].fillna(0)
df['College GPA'] = df['College GPA'].fillna(0)

In [22]:
for col in nan_columns:
    assert not df[col].isna().any()

In [23]:
# Describe the approach you chose and why and save that as a string called nan_filling_approach

# YOUR CODE HERE
nan_filling_approach = '''I filled the NAN values with 0, because if they didnt have a score it may not exactly be genuine to assign an average. This may be critical (of people) but its also theoretical so I
didn't care to dwell on it. Depending on the job perhaps this is not a good indicator at all and can be left out.'''
print(nan_filling_approach)

I filled the NAN values with 0, because if they didnt have a score it may not exactly be genuine to assign an average. This may be critical (of people) but its also theoretical so I
didn't care to dwell on it. Depending on the job perhaps this is not a good indicator at all and can be left out.


In [24]:
assert len(nan_filling_approach) > 30

## Save your cleaned data and submit it by the Part 1 deadline

Uncomment the code in the cell below and use it to save your DataFrame of cleaned data to a .csv file.  
Afterwards you can re-comment or delete the code if you like.

In [25]:
## Uncomment the line below to save your data to a .csv file, for Part 1 submission.

df.to_csv('employees_cleaned.csv', header=True, index=False)

## Modeling

### Interviewing model(s)

Having completed the conversion of the data into a format that can be used with machine learning models, your manager asks that you build three seperate models which predict the following three targets, respectively:

1. Customer Satisfaction
1. Sales Performance
1. Fired

In [26]:
# Save the names of columns we are trying to predict to a list called "targets".
# Make sure that if we had a categorical column, that you use the dummy representation(s).

# YOUR CODE HERE
# df.head()
targets = ['Customer Satisfaction Rating', 'Sales Rating', 'Fired_Fired']

In [27]:
assert len(targets) == 3
for target in targets:
    assert target in df.columns

Ultimately, the predictions of your models will be used to rank applicants for interviews with HR.

**Which features will you select to use in your interview ranking model?** You will specify them below.

In [None]:
print("The available columns are:")
list(df)

In [29]:
# Enter all the features you want to use in a list and save it to "interview_features".
# These are the features for the models that will predict the targets, and the
# predictions will be used to rank applicants for **interviews**.

# YOUR CODE HERE
interview_features = [
                      'Years of Experience', 
                      'Years of Volunteering',
                      'High School GPA',
                      'College GPA',  
                      'English Fluency_Basic',
                      'English Fluency_Fluent',
                      'English Fluency_Proficient',
                      'Spanish Fluency_Basic',
                      'Spanish Fluency_Fluent',
                      'Spanish Fluency_Proficient',
                      'Education_Associates',
                      'Education_Graduate',
                      'Education_Undergraduate',
                      'MBTI_EI_E',
                      'MBTI_EI_I',
                      'MBTI_SN_N',
                      'MBTI_SN_S',
                      'MBTI_TF_F',
                      'MBTI_TF_T',
                      'MBTI_JP_J',
                      'MBTI_JP_P',
                      ]

Why did you choose the features you did?

In [30]:
## Save your reasoning in a string to the variable interview_reason

# YOUR CODE HERE
interview_reason = '''In general I thought YOE/Volunteer History could be useful, as well as general language, education, and personality metrics. I stayed away from superficial things like 
race/gender, etc. Generally an interview is to ask questions about other things IDK. I picked these because they are telling of a candidate but also things that are controlled (mostly).'''
print(interview_reason)

In general I thought YOE/Volunteer History could be useful, as well as general language, education, and personality metrics. I stayed away from superficial things like 
race/gender, etc. Generally an interview is to ask questions about other things IDK. I picked these because they are telling of a candidate but also things that are controlled (mostly).


In [31]:
assert isinstance(interview_reason, str)
assert len(interview_reason) > 20

In [33]:
# Perform a train and test split on the data with the variable names:
#
# interview_x_train for the training features
# interview_x_test  for the testing features
#
# interview_y_train for the training targets
# interview_y_test  for the testing targets
#
# The test dataset should be 20% of the total dataset

# YOUR CODE HERE
# I saw on piazza that you generally avoid using the entire df[] in the split, but it seemed to work OK? (probably just best practice to make variables for them... naming-wise even)
interview_x_train, interview_x_test, interview_y_train, interview_y_test = train_test_split(df[interview_features], df[targets], test_size=0.2)

In [34]:
assert (len(interview_x_train) / (len(interview_x_test) + len(interview_x_train))) == 0.8
assert (len(interview_y_train) / (len(interview_y_test) + len(interview_y_train))) == 0.8
assert len(interview_x_train) == len(interview_y_train)
assert len(interview_x_test) == len(interview_y_test)

Build and train your interviewing models.

In [None]:
# Select models of your choosing, import them here, and perform a
# hyperparameter search while training them on each of the targets.
#
# Do not use Tensorflow to build a model - you may use scikit-learn.
#
# Determine an appropriate metric for measuring your performance for each
# model/target, and report the test score for that metric. The metric may be
# different for each model/target.
#
# Save your models in a list, with models ordered in the same manner as the
# targets they are predicting in the list "targets" you created above.
# Call the list "my_interview_models", e.g.
#
#    my_interview_models = [interview_model_target1,
#                           interview_model_target2,
#                           interview_model_target3]
#
# You should use multiple print messages to print something like the
# following for each of your models/targets:
#
#    To predict the target (target), I trained a (model) model
#    and determined the best hyperparameters as (param1 = p1), (param2 = p2), ...
#    resulting in a (metric) score of (score).

# YOUR CODE HERE
# targets = ['Customer Satisfaction Rating', 'Sales Rating', 'Fired_Fired']
# My initial thoughts are to just use a ridge model for all 3.
model = Ridge()

# GridSearch & MSE Calculations:
parameters = {'alpha':[1, 10]}
interview_scores = []
my_interview_models = []

for t in targets:
  # search & fit
  search = GridSearchCV(model, parameters, scoring='neg_mean_squared_error', cv=5)
  search.fit(interview_x_train, interview_y_train[t])
  # MSE calculations + print statement. 
  best_esti = search.best_estimator_
  pred = best_esti.predict(interview_x_test)
  mse = mean_squared_error(interview_y_test[t], pred)
  interview_scores.append(mse)
  my_interview_models.append(best_esti)
  output = output = 'To predict the target, Customer Satisfaction Rating, I trained a Ridge model and determined the best hyperparameter to be alpha = {}, resulting in an MSE score of {}'.format(search.best_estimator_, mse)
  print(output)

# List of models to use in the skeleton code
print(len(my_interview_models))

In [36]:
assert len(my_interview_models)==len(targets)

### Hiring model(s)

You manager tells you that SellsALOT has decided they wish to consider doing away with interviews altogether, in order to save money. SellsALOT would like a model that will be used to rank candidates for directly hiring them, rather than for interviewing them.

Will your choice of features changes?

**Which features will you select to use in that model?** You will specify them below.

In [None]:
print("The available columns are:")
list(df)

In [38]:
# Enter all the features you want to use in a list and save it to "hire_features".
# These are the features for the models that will predict the targets, and the
# predictions will be used to rank applicants for **hiring**.

# YOUR CODE HERE
hire_features = [
                      'Zipcode_24310',
                      'Zipcode_30167',
                      'Zipcode_43357',
                      'Zipcode_43711',
                      'Zipcode_54821',
                      'Zipcode_55864',
                      'Zipcode_59010',
                      'Zipcode_60531',
                      'Zipcode_72361',
                      'Zipcode_86553',
                      'Years of Experience', 
                      'Years of Volunteering',
                      'High School GPA',
                      'College GPA',  
                      'English Fluency_Basic',
                      'English Fluency_Fluent',
                      'English Fluency_Proficient',
                      'Spanish Fluency_Basic',
                      'Spanish Fluency_Fluent',
                      'Spanish Fluency_Proficient',
                      'Education_Associates',
                      'Education_Graduate',
                      'Education_Undergraduate',
                      'MBTI_EI_E',
                      'MBTI_EI_I',
                      'MBTI_SN_N',
                      'MBTI_SN_S',
                      'MBTI_TF_F',
                      'MBTI_TF_T',
                      'MBTI_JP_J',
                      'MBTI_JP_P',
                      'Requires Sponsorship_False',
                      'Requires Sponsorship_True',
                      ]

Why did you choose the features you did?

In [39]:
## Save your reasoning in a string to the variable hire_reason

# YOUR CODE HERE
hire_reason = '''In general, I would imagine if money/urgency were factors that something like sponsorship for citizenship or something could also be important. I also thought some location info would be important (zip code). 
 Primarily just to avoid having a lag due to relocating an employee, or something like this. The other features are the same! '''

In [40]:
assert isinstance(hire_reason, str)
assert len(hire_reason) > 20

Why was your choice different from or the same as the interviewing features?


In [41]:
# Save your reasoning in a string to the variable
# "same_reason" if the features are the same, or
# "different_reason" if the features are different.

# YOUR CODE HERE
different_reason = hire_reason
if 'same_reason' in locals():
    print(same_reason)
else:
    print(different_reason)

In general, I would imagine if money/urgency were factors that something like sponsorship for citizenship or something could also be important. I also thought some location info would be important (zip code). 
 Primarily just to avoid having a lag due to relocating an employee, or something like this. The other features are the same! 


In [42]:
if all([rf in hire_features for rf in interview_features]) and all([sf in interview_features for sf in hire_features]):
    print("Your features for interviewing and hiring are the same.")
    assert isinstance(same_reason, str)
    assert len(same_reason) > 20
else:
    print("Your features for interviewing and hiring are different.")
    assert isinstance(different_reason, str)
    assert len(different_reason) > 20

Your features for interviewing and hiring are different.


In [43]:
# Perform a train and test split on the data with the variable names:
#
# hire_x_train for the training features
# hire_x_test  for the testing features
#
# hire_y_train for the training targets
# hire_y_test  for the testing targets
#
# The test dataset should be 20% of the total dataset

# YOUR CODE HERE
hire_x_train, hire_x_test, hire_y_train, hire_y_test = train_test_split(df[hire_features], df[targets], test_size=0.2)

In [44]:
assert (len(hire_x_train) / (len(hire_x_test) + len(hire_x_train))) == 0.8
assert (len(hire_y_train) / (len(hire_y_test) + len(hire_y_train))) == 0.8
assert len(hire_x_train) == len(hire_y_train)
assert len(hire_x_test) == len(hire_y_test)

Build and train your hiring models.

Do you expect this model to perform differently?

In [45]:
# Select models of your choosing, import them here, and perform a
# hyperparameter search while training them on each of the targets.
#
# Do not use Tensorflow to build a model - you may use scikit-learn.
#
# Determine an appropriate metric for measuring your performance for each
# model/target, and report the test score for that metric. The metric may be
# different for each model/target.
#
# Save your models in a list, with models ordered in the same manner as the
# targets they are predicting in the list "targets" you created above.
# Call the list "my_hiring_models", e.g.
#
#    my_hiring_models = [hiring_model_target1,
#                        hiring_model_target2,
#                        hiring_model_target3]
#
# You should use multiple print messages to print something like the
# following for each of your models/targets:
#
#    To predict the target (target), I trained a (model) model
#    and determined the best hyperparameters as (param1 = p1), (param2 = p2), ...
#    resulting in a (metric) score of (score).

# YOUR CODE HERE
# My initial thoughts are to just use a ridge model for all 3.
model = Ridge()

parameters = {'alpha':[1, 10]}
hire_scores = []
my_hiring_models = []

for t in targets:
  # search & fit
  search = GridSearchCV(model, parameters, scoring='neg_mean_squared_error', cv=5)
  search.fit(hire_x_train, hire_y_train[t])
  # MSE calculations + print statement. 
  best_esti = search.best_estimator_
  pred = best_esti.predict(hire_x_test)
  mse = mean_squared_error(hire_y_test[t], pred)
  hire_scores.append(mse)
  my_hiring_models.append(best_esti)
  output = output = 'To predict the target, Customer Satisfaction Rating, I trained a Ridge model and determined the best hyperparameter to be alpha = {}, resulting in an MSE score of {}'.format(search.best_estimator_, mse)
  print(output)

# List of models to use in the skeleton code
print(len(my_hiring_models))

To predict the target, Customer Satisfaction Rating, I trained a Ridge model and determined the best hyperparameter to be alpha = Ridge(alpha=1), resulting in an MSE score of 0.0008172086489044972
To predict the target, Customer Satisfaction Rating, I trained a Ridge model and determined the best hyperparameter to be alpha = Ridge(alpha=1), resulting in an MSE score of 0.0005965592488081609
To predict the target, Customer Satisfaction Rating, I trained a Ridge model and determined the best hyperparameter to be alpha = Ridge(alpha=10), resulting in an MSE score of 0.06600747555455964
3


In [46]:
assert len(my_hiring_models)==len(targets)

In [47]:
# Follow this up with a comparison between the performance (test set scores) on your
# two sets of models (six models in total).
#
# You should print something like, for each of the targets:
#
#   Using interview features for target (target) the model scored (score)
#   versus using the hiring features where it scored (score).

# YOUR CODE HERE
for i,t in enumerate(targets):
  out = 'Using interview features for target: {} the model scored: {}, versus using the hiring features where it scored: {}'.format(t, interview_scores[i], hire_scores[i])
  print(out)

Using interview features for target: Customer Satisfaction Rating the model scored: 0.0008721788657854672, versus using the hiring features where it scored: 0.0008172086489044972
Using interview features for target: Sales Rating the model scored: 0.0007203030444323621, versus using the hiring features where it scored: 0.0005965592488081609
Using interview features for target: Fired_Fired the model scored: 0.07678650013077483, versus using the hiring features where it scored: 0.06600747555455964


## Model Evaluation

In this section we'll create example applicants and see how they would fare based on their applications and your models. First, let's create some example applications. We've created four applicants, and you'll need to create a fifth one in the cell below.

In [48]:
applicant_1 = {
    'First Name': "Stefon",
    'Last Name': "Smith",
    'Date of Birth': "1989-12-24",
    'Address': "4892 Jessica Turnpike Suite 781",
    'Zipcode': 86553,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Proficient",
    'Spanish Fluency': "Basic",
    'Education': "Associates",
    'High School GPA': 2.9,
    'College GPA': 3.1,
    'Years of Experience': 5,
    'Years of Volunteering': 2,
    'Myers Briggs Type': "ESFJ",
    'Twitter followers': 524,
    'Instagram Followers': 857,
    'Requires Sponsorship': True
}
applicant_2 = {
    'First Name': "Sarah",
    'Last Name': "Chang",
    'Date of Birth': "1995-04-13",
    'Address': "9163 Rebecca Loop",
    'Zipcode': 43711,
    'Gender': "Female",
    'Race / Ethnicity': "Hispanic",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Fluent",
    'Education': "Undergraduate",
    'High School GPA': 4.0,
    'College GPA': 3.8,
    'Years of Experience': 5,
    'Years of Volunteering': 0,
    'Myers Briggs Type': "ISTJ",
    'Twitter followers': 97,
    'Instagram Followers': 204,
    'Requires Sponsorship': False
}
applicant_3 = {
    'First Name': "Daniel",
    'Last Name': "Richardson",
    'Date of Birth': "1998-10-23",
    'Address': "436 Lauren Stream",
    'Zipcode': 54821,
    'Gender': "Male",
    'Race / Ethnicity': "Black",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Proficient",
    'Education': "Undergraduate",
    'High School GPA': 3.0,
    'College GPA': 3.2,
    'Years of Experience': 1,
    'Years of Volunteering': 1,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 2087,
    'Instagram Followers': 3211,
    'Requires Sponsorship': False
}

applicant_4 = {
    'First Name': "Billy",
    'Last Name': "Bob",
    'Date of Birth': "1999-11-03",
    'Address': "412 Railway Stream",
    'Zipcode': 43711,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Basic",
    'Spanish Fluency': "Fluent",
    'Education': "Undergraduate",
    'High School GPA': 2.0,
    'College GPA': 3.5,
    'Years of Experience': 1,
    'Years of Volunteering': 1,
    'Myers Briggs Type': "ENFJ",
    'Twitter followers': 207,
    'Instagram Followers': 309,
    'Requires Sponsorship': False
}

# Create a fictional applicant by copying the attributes above from any of the
# other applicants and altering values that you would be curious to
# see how your model treats. For example, create an applicant you'd (not your
# model) be sure to reject or sure to hire.

# YOUR CODE HERE
applicant_5 = {
    'First Name': "Ricky",
    'Last Name': "Bobby",
    'Date of Birth': "1996-01-01",
    'Address': "542 Reindeer Road",
    'Zipcode': 43711,
    'Gender': "Male",
    'Race / Ethnicity': "Caucasian",
    'English Fluency': "Fluent",
    'Spanish Fluency': "Fluent",
    'Education': "Graduate",
    'High School GPA': 3.0,
    'College GPA': 3.1,
    'Years of Experience': 10,
    'Years of Volunteering': 6,
    'Myers Briggs Type': "ENFP",
    'Twitter followers': 321,
    'Instagram Followers': 123,
    'Requires Sponsorship': False
}

In [49]:
for key in applicant_4.keys():
    assert key in applicant_5.keys()

In [50]:
new_people = [applicant_1, applicant_2, applicant_3, applicant_4, applicant_5]
new_people_df = pd.DataFrame.from_records(new_people)

In [51]:
new_people_df

Unnamed: 0,First Name,Last Name,Date of Birth,Address,Zipcode,Gender,Race / Ethnicity,English Fluency,Spanish Fluency,Education,High School GPA,College GPA,Years of Experience,Years of Volunteering,Myers Briggs Type,Twitter followers,Instagram Followers,Requires Sponsorship
0,Stefon,Smith,1989-12-24,4892 Jessica Turnpike Suite 781,86553,Male,Caucasian,Proficient,Basic,Associates,2.9,3.1,5,2,ESFJ,524,857,True
1,Sarah,Chang,1995-04-13,9163 Rebecca Loop,43711,Female,Hispanic,Fluent,Fluent,Undergraduate,4.0,3.8,5,0,ISTJ,97,204,False
2,Daniel,Richardson,1998-10-23,436 Lauren Stream,54821,Male,Black,Fluent,Proficient,Undergraduate,3.0,3.2,1,1,ENFJ,2087,3211,False
3,Billy,Bob,1999-11-03,412 Railway Stream,43711,Male,Caucasian,Basic,Fluent,Undergraduate,2.0,3.5,1,1,ENFJ,207,309,False
4,Ricky,Bobby,1996-01-01,542 Reindeer Road,43711,Male,Caucasian,Fluent,Fluent,Graduate,3.0,3.1,10,6,ENFP,321,123,False


### Future Applicants Data Cleaning



In [52]:
# Apply all the cleaning and dummy variable creation you did above to this new
# DataFrame. You can copy your code from above and modify it to apply to
# new_people_df instead of df.

# YOUR CODE HERE
# explicitly add in the area codes from test-data (?) // I had errors with these because some of them (obviously) are not found in the applicants we create above... IDK if this is the right
# approach. 
zipcodes =           ['Zipcode_24310',
                      'Zipcode_30167',
                      'Zipcode_43357',
                      'Zipcode_55864',
                      'Zipcode_59010',
                      'Zipcode_60531',
                      'Zipcode_72361'
                      ]
new_people_df[zipcodes] = 0
# Splits Myers Briggers
new_people_df[['MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP']] = new_people_df['Myers Briggs Type'].apply(lambda x: pd.Series(list(x)))
new_people_df = new_people_df.drop(columns='Myers Briggs Type')
# Categorical stuff
categorical_columns2 = ['Zipcode', 'Gender', 'Race / Ethnicity', 'English Fluency', 'Spanish Fluency', 'Education', 'MBTI_EI', 'MBTI_SN', 'MBTI_TF', 'MBTI_JP', 'Requires Sponsorship']

new_people_df = pd.get_dummies(new_people_df, columns=categorical_columns2)
# Age stuff
new_people_df['Date of Birth'] = pd.to_datetime(new_people_df['Date of Birth'])
new_people_df[['Age']] = new_people_df['Date of Birth'].apply(lambda x: calculate_age(x))
new_people_df = new_people_df.drop(columns='Date of Birth')
# Handle NANs 
nan_columns = new_people_df.columns[new_people_df.isna().any()].tolist()
new_people_df['High School GPA'] = new_people_df['High School GPA'].fillna(0)
new_people_df['College GPA'] = new_people_df['College GPA'].fillna(0)

In [53]:
new_people_df

Unnamed: 0,First Name,Last Name,Address,High School GPA,College GPA,Years of Experience,Years of Volunteering,Twitter followers,Instagram Followers,Zipcode_24310,Zipcode_30167,Zipcode_43357,Zipcode_55864,Zipcode_59010,Zipcode_60531,Zipcode_72361,Zipcode_43711,Zipcode_54821,Zipcode_86553,Gender_Female,Gender_Male,Race / Ethnicity_Black,Race / Ethnicity_Caucasian,Race / Ethnicity_Hispanic,English Fluency_Basic,English Fluency_Fluent,English Fluency_Proficient,Spanish Fluency_Basic,Spanish Fluency_Fluent,Spanish Fluency_Proficient,Education_Associates,Education_Graduate,Education_Undergraduate,MBTI_EI_E,MBTI_EI_I,MBTI_SN_N,MBTI_SN_S,MBTI_TF_F,MBTI_TF_T,MBTI_JP_J,MBTI_JP_P,Requires Sponsorship_False,Requires Sponsorship_True,Age
0,Stefon,Smith,4892 Jessica Turnpike Suite 781,2.9,3.1,5,2,524,857,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,1,0,0,1,31
1,Sarah,Chang,9163 Rebecca Loop,4.0,3.8,5,0,97,204,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,1,1,0,1,0,26
2,Daniel,Richardson,436 Lauren Stream,3.0,3.2,1,1,2087,3211,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,1,0,1,0,1,0,23
3,Billy,Bob,412 Railway Stream,2.0,3.5,1,1,207,309,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,1,0,1,0,1,0,1,0,1,0,22
4,Ricky,Bobby,542 Reindeer Road,3.0,3.1,10,6,321,123,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,0,1,1,0,25


In [54]:
for feature in interview_features:
    assert feature in new_people_df.columns
for feature in hire_features:
    assert feature in new_people_df.columns

### Future Applicant Model(s) Predictions

Now use your `my_interview_models` and your `my_hiring_models` to predict applicant scores, and store those predictions.

In [55]:
# Use your models to make predictions for people in new_people_df.
#
# Save your predictions as a "new_people_interview_pred" list and 
# a "new_people_hire_pred" list. Each should be a list of five dictionaries
# (one for each applicant). The keys of the dictionaries should be the
# same as the strings in the "targets" list you created above.

# YOUR CODE HERE
# To save our eyes... I couldn't for the life of me find the correct combination of .values to fit/train the data to avoid the Warnings.... so I will hide them :D
#   - The warning is just saying that the data was train with column names and fit without them, but changing the obvious things broke the code in other spots :(
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

new_people_interview_pred = []
new_people_hire_pred = []
# the keys of the dictionary are just targets[]... and the value is the score from that model...
# then we append this dictionary to the lists above...
#   - boils down to having a list with 5 entry, each entry has 3 measurements stored in a dictionary.

for i in range(len(new_people_df)):
  # keys for making the dictionary
  keys = targets
  # 2 empty lists to populate the dictionary values 
  interview_res = []
  hire_res = []
  for j in range(3):
    # interview
    person = new_people_df.iloc[i]
    interview_res.append(my_interview_models[j].predict(person[interview_features].to_numpy().reshape(1,-1)))
    # interview_satisfaction_val, interview_sales_val, interview_fired_val = my_interview_models[0].predict(new_people_df[i]), my_interview_models[1].predict(new_people_df[i]), my_interview_models[2].predict(new_people_df[i])
    # hire
    hire_res.append(my_hiring_models[j].predict(person[hire_features].to_numpy().reshape(1,-1)))
    # hire_satisfaction_val, hire_sales_val, hire_fired_val = my_hiring_models[0].predict(new_people_df[i]), my_hiring_models[1].predict(new_people_df[i]), my_hiring_models[2].predict(new_people_df[i])
  interview_dict = {keys[z]: interview_res[z] for z in range(3)}
  hiring_dict = {keys[z]: hire_res[z] for z in range(3)}
  new_people_interview_pred.append(interview_dict)
  new_people_hire_pred.append(hiring_dict)

print('Interview models:')
for p in new_people_interview_pred:
    print(p)

print('\nHiring models:')
for p in new_people_hire_pred:
    print(p)

Interview models:
{'Customer Satisfaction Rating': array([2.06369628]), 'Sales Rating': array([1.88096889]), 'Fired_Fired': array([0.07535652])}
{'Customer Satisfaction Rating': array([1.7680669]), 'Sales Rating': array([1.93027125]), 'Fired_Fired': array([0.06343208])}
{'Customer Satisfaction Rating': array([1.29552187]), 'Sales Rating': array([1.337515]), 'Fired_Fired': array([0.1215484])}
{'Customer Satisfaction Rating': array([0.98947136]), 'Sales Rating': array([1.0858942]), 'Fired_Fired': array([0.17336368])}
{'Customer Satisfaction Rating': array([4.44640278]), 'Sales Rating': array([4.30720478]), 'Fired_Fired': array([-0.05301207])}

Hiring models:
{'Customer Satisfaction Rating': array([2.05522462]), 'Sales Rating': array([1.91326915]), 'Fired_Fired': array([0.13608429])}
{'Customer Satisfaction Rating': array([1.76228811]), 'Sales Rating': array([1.92785851]), 'Fired_Fired': array([0.06601563])}
{'Customer Satisfaction Rating': array([1.30183634]), 'Sales Rating': array([1.33

In [56]:
for person in new_people_interview_pred:
    for key in targets:
        assert key in person.keys()

for person in new_people_hire_pred:
    for key in targets:
        assert key in person.keys()

### Ranking Evaluation

Your manager notes that given that you have more than one prediction target, the model predictions aren't really ranking or selecting people. There is no "best" person because there's more than one metric to look through. A human still needs to look at all three predictions so your models don't yet really do what SellsALOT has asked for.

Your manager asks you to create a synthetic scalar variable that is calculated from the multiple target predictions of an individual person. That way we'll have one metric by which we can rank people. You need to create that synthetic metric (score).

Some candidate approaches:

1. Incorporating a binary value (such as fired/not-fired), x:
    - You can multiply x by some arbitrary value and add/subtract it to/from the total score:
      - score = t1 + t2 * x
    - You can multiple your entire score output by the binary value to say something like "if not x, then  score is 0", e.g.:
      - score = x * (t1 + t2)
1. Weighting different target values:
    - You can weight different values by adding a multiplier (if t1 is twice as important as t2, then the score can be something like:
     - score = 2 * t1 + t2
1. Some combination of the items above
1. Something creative you devise on your own!

In [57]:
def calculate_synthetic_metric(targets):
    """Calculates a synthetic matric based on the targets of an individual
    Your metric should result in a higher score being a better one

    Args:
      targets (dict): The dictionary with keys as the target names and
                      values as the target values/predictions

    Returns:
      float: The synthetic score produced from 
    """
    # YOUR CODE HERE
    res = (2 * targets['Sales Rating'] + targets['Customer Satisfaction Rating']) - 10 * targets['Fired_Fired']
    return float(res)

Let's try out the synthetic metric on the original data and see if you're happy with the result based on the past data.

In [58]:
# Add a column named "Metric" to the *original* DataFrame ("df") with the
# synthetic metric applied to each row.

# YOUR CODE HERE
# print(new_people_df)
# # the code works with the calculated dictionaries above... but how do I apply this function to the rows of the original dataframe?
# for p in new_people_interview_pred:
#   print(calculate_synthetic_metric(p))
# I'll just do what I did to the new_people_df to the original df and then merge the 2... this is probably not the IDEAL solution but it should work nevertheless...
synthetic_scores = []
for i in range(len(df)):
  keys = targets
  # I'll just default ot  hiring rather than interview, as the latest scenario we are given is the one in which the company is directly hiring candidates. 
  res = []
  for j in range(3):
    person = df.iloc[i]
    res.append(my_hiring_models[j].predict(person[hire_features].to_numpy().reshape(1,-1)))
  new_dict = {keys[z]: res[z] for z in range(3)}
  synthetic_scores.append(calculate_synthetic_metric(new_dict))

# create column
df['Metric'] = synthetic_scores

In [59]:
df.head()

Unnamed: 0,First Name,Last Name,Address,High School GPA,College GPA,Years of Experience,Years of Volunteering,Twitter followers,Instagram Followers,Customer Satisfaction Rating,Sales Rating,Zipcode_24310,Zipcode_30167,Zipcode_43357,Zipcode_43711,Zipcode_54821,Zipcode_55864,Zipcode_59010,Zipcode_60531,Zipcode_72361,Zipcode_86553,Gender_Female,Gender_Male,Race / Ethnicity_Black,Race / Ethnicity_Caucasian,Race / Ethnicity_Hispanic,English Fluency_Basic,English Fluency_Fluent,English Fluency_Proficient,Spanish Fluency_Basic,Spanish Fluency_Fluent,Spanish Fluency_Proficient,Education_Associates,Education_Graduate,Education_High School,Education_None,Education_Undergraduate,MBTI_EI_E,MBTI_EI_I,MBTI_SN_N,MBTI_SN_S,MBTI_TF_F,MBTI_TF_T,MBTI_JP_J,MBTI_JP_P,Requires Sponsorship_False,Requires Sponsorship_True,Fired_Current Employee,Fired_Fired,Age,Metric
0,Sarah,Chang,764 Howard Tunnel,3.1,2.52,8.8,0.0,693,1108,2.21,2.07,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,1,0,1,1,0,1,0,1,0,31,5.120133
1,Daniel,Taylor,4892 Jessica Turnpike Suite 781,3.02,3.9,13.7,0.0,507,1259,3.37,2.98,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,1,1,0,1,0,1,0,1,0,36,9.544119
2,Heather,Stewart,778 Linda Orchard Apt. 609,2.95,2.63,5.2,0.0,599,868,1.5,1.36,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,1,0,1,0,0,1,1,0,1,0,28,2.857821
3,Katherine,Dillon,139 Linda Crossroad Suite 115,3.99,3.88,12.5,0.0,1321,889,2.89,2.62,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,1,1,0,0,1,0,1,1,0,34,7.374188
4,Sheri,Bolton,1858 Lauren Orchard,3.82,3.3,7.0,0.0,414,13760,1.94,1.78,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,1,0,1,0,0,1,1,0,30,4.900776


In [60]:
assert "Metric" in df.columns
assert np.issubdtype(df["Metric"].dtype, np.number)

Are you happy with the synthetic score based on the values for each person here? Go back and update it until you're satisfied with this score.

In [61]:
# Explain the logic behind your synthetic scoring mechanism and
# save it as "synthetic_score_reasoning".

# YOUR CODE HERE
synthetic_score_reasoning = '''In general, I figured that a sales company likely prioritizes sales/profit over everything. Therefore, I weighted sales twice as heavily as satisfaction.
 Also, I adjusted the score by subtracting off the score to be fired. More specifically, the weight of the fired modifier was 10 (5x that of sales). I thought this would provide enough
 variability to make the score meaningful, without having to do any adjustments (ie. normalization, etc).'''
print(synthetic_score_reasoning)

In general, I figured that a sales company likely prioritizes sales/profit over everything. Therefore, I weighted sales twice as heavily as satisfaction.
 Also, I adjusted the score by subtracting off the score to be fired. More specifically, the weight of the fired modifier was 10 (5x that of sales). I thought this would provide enough
 variability to make the score meaningful, without having to do any adjustments (ie. normalization, etc).


In [62]:
assert len(synthetic_score_reasoning) > 100

Now let's calculate the synthetic scores for the new people (applicants) and see if you're satisfied with your models' rankings for interviewing and hiring.

In [63]:
new_people_interview_score = [calculate_synthetic_metric(target_vals) for target_vals in new_people_interview_pred]
new_people_hire_score = [calculate_synthetic_metric(target_vals) for target_vals in new_people_hire_pred]

In [64]:
best_interview_person = new_people[new_people_interview_score.index(max(new_people_interview_score))]
best_hire_person = new_people[new_people_hire_score.index(max(new_people_hire_score))]

Based on these scores, your model selected the following people:

In [65]:
print(f"""
Your interviewing model selected {best_interview_person['First Name']} {best_interview_person['Last Name']} as the person to interview.

Your hiring model selected {best_hire_person['First Name']} {best_hire_person['Last Name']} as the person to hire.
""")


Your interviewing model selected Ricky Bobby as the person to interview.

Your hiring model selected Ricky Bobby as the person to hire.



Are you happy with these results? Feel free to modify the `applicant_5`'s attributes and see how your model performs based on changing these values. 

In [66]:
# Describe your level of satisfaction with your models.
# Did you edit your model based on the results? What did you change?
# What general conclusions did you get from the exercise?
# Save your answer to the above questions as "conclusions".

# YOUR CODE HERE
conclusions = ''' I am pretty satisfied with this model! Looking at the few entries above, I was originally confused why Daniel Taylor and Katherine Dillon had such a discrepency in points, but
one of them does have slightly higher sales/customer satisfaction metric. Further, the need for sponsorship impacts the result indirectly due to it being a feature in the hiring models, which I used
for these calculations. Further, I did edit the model based on the results, I mostly tinkered with the 5 people at the end to see what kind of decisions I could get the model to make. When I felt
they were at least somewhat valid I continued. I changed the weighting of the variables used in the metric calculations. The general conclusion I got from the exercise was that the model was able
to predict a candidates standing based on a number of different points of data. However, I think it also shows the need to have discussions, as these data points are not truly an accurate representation
of any candidates output, etc. One thing I did notice is that it often recommends the same person for hire/interview... Likely due to the similarly of features?'''
print(conclusions)

 I am pretty satisfied with this model! Looking at the few entries above, I was originally confused why Daniel Taylor and Katherine Dillon had such a discrepency in points, but
one of them does have slightly higher sales/customer satisfaction metric. Further, the need for sponsorship impacts the result indirectly due to it being a feature in the hiring models, which I used
for these calculations. Further, I did edit the model based on the results, I mostly tinkered with the 5 people at the end to see what kind of decisions I could get the model to make. When I felt
they were at least somewhat valid I continued. I changed the weighting of the variables used in the metric calculations. The general conclusion I got from the exercise was that the model was able
to predict a candidates standing based on a number of different points of data. However, I think it also shows the need to have discussions, as these data points are not truly an accurate representation
of any candidates output, etc

In [67]:
assert len(conclusions) > 100

## Feedback

In [68]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    return 'N/A, thanks a lot for the semester!'