## Basic Data Exploration

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [2]:
employee_data_path = "Employee_Churn_Dataset.csv"
employee_data=pd.read_csv(employee_data_path)

In [3]:
employee_data.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


In [4]:
employee_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [5]:
employee_data.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Departments', 'salary'],
      dtype='object')

## Building First Machine Learning Model

In [6]:
#Desired prediction column
y = employee_data.left

In [13]:
#List of features
feature_names = ["satisfaction_level","last_evaluation","number_project","average_montly_hours",
                 "time_spend_company", "Work_accident","promotion_last_5years"]

In [14]:
#Seect data corresponding to features in feature_names
X = employee_data[feature_names]

In [15]:
X.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.021268
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.144281
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0


In [16]:
X.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years
0,0.38,0.53,2,157,3,0,0
1,0.8,0.86,5,262,6,0,0
2,0.11,0.88,7,272,4,0,0
3,0.72,0.87,5,223,5,0,0
4,0.37,0.52,2,159,3,0,0


In [17]:
employee_model=DecisionTreeRegressor(random_state=1)

#Fit the model
employee_model.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

In [22]:
#make predictions
predictions = employee_model.predict(X)
print(predictions)

[1. 1. 1. ... 1. 1. 1.]


In [21]:
print(employee_model.predict(X.head()))

[1. 1. 1. 1. 1.]


## Model Validation

In [23]:
#Calculate mean absolute error
from sklearn.metrics import mean_absolute_error

predicted_lefts = employee_model.predict(X)
mean_absolute_error(y, predicted_lefts)

0.00016001066737782518

In [24]:
#Training using data split
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
#Define the model
employee_model = DecisionTreeRegressor()
#fit Model
employee_model.fit(train_X, train_y)
#get predicted values from validation data
val_predictions = employee_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

0.019933333333333334


In [26]:
print("First in-sample predictions:", employee_model.predict(val_X.head()))
print("Actual target values for those employees:", y.head().tolist())

First in-sample predictions: [1. 0. 0. 0. 0.]
Actual target values for those employees: [1, 1, 1, 1, 1]


## Underfitting and Overfitting

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in

In [29]:
#Function to help comapre MAE
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [31]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 500  		 Mean Absolute Error:  0
Max leaf nodes: 5000  		 Mean Absolute Error:  0
Max leaf nodes: 50000  		 Mean Absolute Error:  0
Max leaf nodes: 500000  		 Mean Absolute Error:  0


Here's the takeaway: Models can suffer from either:

* **Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
* **Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.
We use **validation** data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.