# Predicting Employee Attrition Using Decision Trees and Random Forests

### Business Understanding
- Explain real-world problem the project aims to solve
- Identify stakeholders who could use the project / how they would use it
- Summarize implications of project for real-world problem, stakeholders

### Data Understanding
- Describe data sources, explain why the data are suitable for the project
- Present the size of the dataset, descriptive statistics for all features used in the analysis
- Justify the inclusion of features based on their properties, relevance for project
- Identify any limitations of data that have implications for the project

#### Import needed libraries and load in dataset

In [6]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from sklearn.metrics import accuracy_score, precision_score, roc_curve, auc, recall_score, f1_score
from sklearn.metrics import plot_confusion_matrix

# setting seed variable
seed = 13

# loading in csv and looking at first 5 rows
df = pd.read_csv("../Data/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


### Data Preparation
- Instructions / code needed to get and prepare raw data for analysis
- Code comments, text to explain what your data prep code does
- Valid justifications for why the steps you took are appropriate for the problem you are solving

In [1]:
# only keep columns we want to use as features for prediction
df = df[['Age', 'Attrition', 'Education', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion']]
df.shape

### Modeling
- Runs and interprets a simple/baseline model for comparison
- Introduces new models that improve on prior models and interprets their results
- Explicitly justifies model changes based on results of prior models, problem context
- Explicitly describes any improvements found from running new models

### Evaluation
- Justify choice of metrics using context of real-world problem and consequences of errors
- Identifies one final model based on performance on the chosen metrics with validation data
- Evaluate performance of final model using holdout test data
- Discusses implications of the final model evaluation for solving real-world problem