Welcome to IBM where you have been hired as a data science consultant. Your job is to create a model that can predict the probability of an employee leaving the company, also known as "churn." You are told that you can use any classification algorithm of your choosing. The source for the data is located here: 

https://github.com/IBM/employee-attrition-aif360/blob/master/data/emp_attrition.csv

## Step 0: Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Step 1. Load the data

You will need to read the data in as "raw"

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/IBM/employee-attrition-aif360/master/data/emp_attrition.csv')

In [None]:
df.head(2)

## Step 2: Data Quality Check

In [None]:
# What are the column names?
df.columns

In [None]:
# What are the data types? Do you need to make conversion? Select 3-5 features for your model?
df.dtypes

In [None]:
sub_df = df[['Attrition', 'Age', 'Gender', 'JobSatisfaction', 'PerformanceRating', 'YearsAtCompany']]

In [None]:
sub_df.head(2)

In [None]:
# Is there any missing data?
sub_df.isnull().sum()

In [None]:
# Make feature conversions 
sub_df.dtypes

In [None]:
sub_df['Attrition_r'] = sub_df['Attrition'].map({'Yes': 1, 'No': 0})
sub_df['Gender_r'] = sub_df['Gender'].map({'Male': 1, 'Female': 0})

## Step 3: Quick EDA

In [None]:
# What percent of employees leave the company?
sub_df['Attrition_r'].mean()

In [None]:
# What is the distribution of Age?
sub_df['Age'].hist()

In [None]:
# What is the distribution of Job Satisfaction?
sub_df['JobSatisfaction'].hist()

## Step 4: Create X and y

In [None]:
feature_cols = ['Age', 'Gender_r', 'JobSatisfaction', 'PerformanceRating', 'YearsAtCompany']
X = sub_df[feature_cols]

In [None]:
y = sub_df['Attrition_r']

## Step 5: Standardize data

In [None]:
from sklearn.preprocessing import StandardScaler

# Instantiate
scaler = StandardScaler()

# Fit and transform!!!!
X_train = scaler.fit_transform(X_train)

# Transform
X_test = scaler.transform(X_test)

In [None]:
X_train

In [None]:
X_test

## Step 6: Slit data into train_test_split 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

## Step 7: Fit model and test for accuracy

In [None]:
# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X_train, y_train)

In [None]:
# predict
from sklearn import metrics

y_pred_class = knn.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

## BONUS ROUND

## Step 8: What are some improvements to the model pipeline we could make?

1. Shuffle data beforehand
2. Run correlation plots to identify strongest relationships and multi-collinear features
3. Look at interactions
4. Optimize hyperparameters
5. Use GridSearch
6. identify Bias vs Variance tradeoff