# Data Science Taste Test - Exercise on IBM HR Data
## brAIn.hr

### 3. The Data Science Workflow (General Assembly Model)

![GA Pipeline](./assets/general_assembly_pipeline.png)

- Frame - Problems & Hypotheses 
- Prepare - Ingestion & Cleaning
- Analyze - Studying the Data
- Interpret - Inference & Prediction
- Communicate/Deploy - Enabling Decisions

Practicing Data Science & Machine Learning skills is going to require that you get hands-on with datasets solving problems that are interesting to you. Fortunately, there are now tons of places you can go to access data and practice running models. The most prominent one is Kaggle. They run paid competitions to see whose models can drive the most accurate results, and you can see other people's projects and approaches to assessing data and running predictive models. 

Even better than Kaggle is working with data that you're already using at work. If you're able to install python in your work environment, then you have the best source of all to practice.

Today's dataset should be in the repository folder you downloaded, but if not, access it here: [Kaggle: IBM](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)

### Before we dive into the data...

The cell below installs a tool vital for graphing and visualizing some of our machine learning models. 

Notice the code comments that begin with "#". For your notebooks and code to be explainable, you have to use comments and docstrings """ """ heavily. You'll see more of this as you scroll through. 

In [None]:
#import sys
#!conda install -c pyviz --yes --prefix {sys.prefix} hvplot
#!conda install jupyterlab
#!jupyter labextension install -y @pyviz/jupyterlab_pyviz

The cell below installs a tool vital for graphing and visualizing some of our machine learning models. 

In [None]:
import sys
!pip install --prefix {sys.prefix} pydotplus  # enables visualization of decision trees
!pip install --prefix {sys.prefix} scikit-plot  # enables visualization of machine learning metrics

## IBM HR Data Project

IBM released a simulated dataset tracking various factors related to IBM employment: [Kaggle: IBM](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)

Our job is to use this data to understand and predict who is likely to quit their job. 

Most of the features (column names) are self explanatory (DailyRate: money per day, YearsAtCompany: number of years at IBM, etc)

However, the survey data columns need some explanation: 

- Education (1 'Below College', 2 'College', 3 'Bachelor', 4 'Master', 5 'Doctor')

- EnvironmentSatisfaction (1 'Low', 2 'Medium', 3 'High', 4 'Very High')

- JobInvolvement (1 'Low', 2 'Medium', 3 'High', 4 'Very High')

- JobSatisfaction (1 'Low', 2 'Medium', 3 'High', 4 'Very High')

- PerformanceRating (1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding')

- RelationshipSatisfaction (1 'Low', 2 'Medium', 3 'High', 4 'Very High')

- WorkLifeBalance (1 'Bad', 2 'Good', 3 'Better', 4 'Best')

### Framing

Because the ask of this project is about attrition, we'll frame our null hypothesis in exactly that way.

The null hypothesis is that we cannot predict attrition better than randomly selecting or by taking the average likelihood.

We reject the null if we can demonstrate statistically significant improvement over random & average likelihood as baseline.

If we're able to do this, it then makes sense to explore how to optimize and automate our predictive and prescriptive power.

Not an easy task, let's get started!

### Prepare

In this step, we import the that we want to work with in order to assess and visualize our data. You'll typically see this step at the top of any python project you see. Afterwards, we'll ingest and process our data.

The notes here are a bit more descriptive than what you'll commonly see, but you can cut and paste the block below for a good, cookie cutter start to many of your future analytics tasks!

Throughout the project, pay attention to the use of comments (#) and docstrings(""" """). These help your code's readability. You'll forget why you added certain features or functions, and other people may look at your code and not even have a clue. Well commented and documented code solves that problem and helps your science!

In [None]:
import pandas as pd  # This open source marvel is your key to deep analysis and manipulation of data
import matplotlib.pyplot as plt  # This is a low-level plotting language. Not friendly, but deep control
import seaborn as sns  # This is a friendlier, higher level statistical plotting API based on matplotlib
import numpy as np  # Matrix math & linear algebra

from tqdm import tqdm  # This is a progress bar that comes in handy to monitor lengthy operations stress-free
from IPython.display import display  # This gives more visibility options when running commands

plt.style.use("seaborn-darkgrid")  # A plotting style to make our plots look more appealing by default. 
pd.set_option("max_columns", 100)  # Allows us to adjust the number of columns a DataFrame will show

In [None]:
# Hooray! You're assigning your first variable. In python, you assign 
ibm_hr_path = "./data/WA_Fn-UseC_-HR-Employee-Attrition.csv"

#### Common Ways You'll Interact with Variables

Variables make it way easier to store, compare, and manipulate values that represent your data or real-world numbers!

You assign variables simply by declaring a name, using an equal sign, and inputting a value or expression (Example):
- variable = 20 + 7
- print(variable)
- output: 27

Common variables you'll interact with are strings(str), integers (int), and floats (float). 

Strings store text, integers store numbers with no decimals, and floats store numbers with decimals.

Strings consume the most memory of these types while ints consume the least. This matters heavily when working in datasets that have millions of rows, so it's important to be mindful!

We'll look at how these data types can interact with each other at a high level, but there are many more ways than this. 

In [None]:
type(ibm_hr_path)

In [None]:
path_length = len(ibm_hr_path)
print(path_length)

In [None]:
type(len(ibm_hr_path))

In [None]:
path_math = len(ibm_hr_path) * .59546
print(round(path_math, 2))

In [None]:
print(path_length+path_math)

In [None]:
print(type(path_math))

#### Functions: What they are and why they're heavily used in this notebook
![Algorithm](./assets/algorithm.jpg "Image Source: https://www.verywellmind.com/what-is-an-algorithm-2794807")

Functions allow you to define a set of tasks and conditions so that you can drive repeatable results. You'll see them frequently in this notebook and explore the awesomeness of not having to repeat yourself. 

When experimenting, rather than designing and declaring the same codeblocks with multiple data sources, functions allow you to create algorithms that you can run, test, & tweak at will. 

It's easy to design functions and then forget what they do. Make heavy use of docstrings """ """ and code comments so that your functions are easily understood and remembered.

In [None]:
def read_in(path):
    """Reads in and initially processes the IBM dataset. Takes in the path to the file as a string
    :path(str) - use the relative or absolute filepath to point to a data source"""
    
    df = pd.read_csv(path)  # Read in the file
    display(df.info())  # Display counts for missing data in columns and memory usage
    print("\nNumeric Columns")
    display(df.describe())  # Descriptive statistics for each column
    print("Categorical Columns")
    categorical_columns = list(df.dtypes[df.dtypes == "object"].index)
    display(df[categorical_columns].describe())
    
    return df

We experimented with assigning strings, floats and integers to variables. What type of data structure are we storing in the line below?

In [None]:
ibm_hr_data_initial = read_in(ibm_hr_path)

Let's put on our stat hats. Does it make sense to keep all these columns? Are there potentially any columns that aren't conveying meaning or information?

In [None]:
throw_away_columns = ["EmployeeCount", "StandardHours", "Over18"]  # This is a list of values. Lists can contain any number of other data structures:
                                                                   # integers, strings, boolean, floats, other lists, dictionaries, tuples, etc.!
    
ibm_hr_data = ibm_hr_data_initial.drop(columns=throw_away_columns)  # we passed the list to the "drop()" function to tell it to discard these columns
ibm_hr_data

In [None]:
type(ibm_hr_data)

In [None]:
len(ibm_hr_data)

### Analyze

![3 Classes of Analytics](./assets/analytics_classes.png)

### Once we've identified people who are at-risk for attrition...
What are some things we can and should do?

A:

In this exercise, we begin with descriptive statistics that form the bedrock for our predictions and suggestions. 

In [None]:
# Use "value_counts()" liberally to quickly count up values in columns. Notice how
# convenient is to count up categorical values

# Notice also the ease by which we can run calculations on these findings. 

attrition_value_counts = ibm_hr_data.Attrition.value_counts()
display(attrition_value_counts, 
        str(((attrition_value_counts[1]/attrition_value_counts[0])*100).round(2)) + "% rate of attrition.")

In [None]:
# Additional Data Cleanup
# We move back and forth between "Prepare" and "Analyze" as we discover more flaws in the data.
# Here, we're storing "Yes" & "No" as boolean values for more efficient math & machine learning operations.

ibm_hr_data.Attrition = np.where(ibm_hr_data.Attrition == "Yes", True, False)  # Conditional logic to change data from string to boolean
ibm_hr_data.OverTime = np.where(ibm_hr_data.OverTime == "Yes", True, False)  # Conditional logic to change data from string to boolean
ibm_hr_data.EducationField = ibm_hr_data.EducationField.str.strip()  # the .strip() function cleans white space before and after text
display(ibm_hr_data.Attrition.describe())  # describe() lets us quickly view relevant stats on a particuar variable

A little more on boolean values.

These represent True(1) and False(0). That's right, True stores as "1" and false stores as "0". Below, we play with boolean values and show what they look like interacting with other numbers. 

They're preferrable over strings when possible because they take up less space, and you can derive details by doing simple math. For example, if you want to find out how many "True" values are in a True/False column, just add the values values of that column. 

In [None]:
display(type(True), type(False))

In [None]:
True + 0

In [None]:
False + 0

What's the numeric value of True? 

a: 

In [None]:
ibm_hr_data.Attrition

In [None]:
display(sum(ibm_hr_data.Attrition))  # Adding the values of a column to get a count of the True values

In [None]:
numeric_columns = list(ibm_hr_data.dtypes[ibm_hr_data.dtypes.isin(["int64", "bool"])].index)  # Identifying the columns we can use for numeric calculations

In [None]:
number_of_numcols = len(numeric_columns)
print(f"There are {number_of_numcols} numeric columns.")  # f-strings let us put variables right into text. extremely fast and readable way to share your calculations!

In [None]:
display(numeric_columns)

In [None]:
def num_plotter(columns, groupby, data=ibm_hr_data, rot=0):
    """Returns a series of bar plots showing average difference of each variable.
    
    columns(list): list of columns you'd like to plot
    groupby(str): the name of the column you'd like to split the values
    data(DataFrame): the DataFrame being analyzed
    rot(int): input number of degrees to rotate the xticks for better aesthetics & readability"""
    
    # We want this to run quickly and effectively every time. We also want our users to understand
    # how to use the function. The docstring above helps, but sometimes they'll get something wrong
    # and fail to read the instructions. Below, you see that we can create our own error messages.
    # The idea behind this one is to let the user know that they've entered the wrong data, but there
    # are a wealth of options they can choose instead. 
    #
    # To err is human. To raise errors is divine.
    column_names = list(data.columns)
    
    # if statements allow for conditional logic, nested conditions, and powerful controls over
    # variables and outcomes
    if (groupby not in column_names):
        raise ValueError(f"Make sure that columns & groupby are in the columns of your data: \n{column_names}")  # F-strings make it easy and fast to explain your calculations
    
    # The meat of this algorithm is how it recursively generates graphs. The "for" loop below is 
    # telling the function to do some operation on each column of data we passed until it's
    # processed all of the columns. Only then does it execute the next code that's not indented.
    fontsize = 15
    plt.figure(figsize=(18, 20))   
    for index, col in enumerate(columns):
        plt.subplot(12, 3, index+1)
        sns.barplot(data=data, x=groupby, y=col)
        plt.xlabel(groupby, fontsize=fontsize-4)
        plt.ylabel(ylabel=col, fontsize=fontsize-5)
        plt.xticks(fontsize=fontsize-5, rotation=rot)
        plt.yticks(fontsize=fontsize-3)
        
    plt.tight_layout()
        
    return

Why didn't this code work?

In [None]:
# Why didn't this piece of code work?
num_plotter(columns=numeric_columns, groupby="Strength")

We programmed the function to raise an error if someone tried to pass a value that isn't in the columns. 

Our function in action! It's visualizing how various variables relate to Attrition. Which ones stand out?

In [None]:
num_plotter(columns=numeric_columns, groupby="Attrition")

We wondered how various variables related to job satisfaction. Obvious stuff: higher job satisfaction indicates lower attrition. Even if something is "obvious", it's much better that it be proven. 

In [None]:
num_plotter(columns=numeric_columns, groupby="JobSatisfaction")

Our fictional dataset lacks significant gender disparity. 

In [None]:
num_plotter(columns=numeric_columns, groupby="Gender")

It looks like R&D positions are relatively stable. 

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="Department" )

The type of educational attainment has an impact.

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="EducationField" )

Marital status has an impact as well. This could be indicative of other factors, such as age.

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="MaritalStatus")

Sales representatives are the least stable, whereas research directors are the most stable.

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="JobRole", rot=90)

Look at the difference in attrition rate for this variable.

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="OverTime")

And among the travelers

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="BusinessTravel")

And divided by sex

In [None]:
num_plotter(data=ibm_hr_data, columns=["Attrition"], groupby="Gender")

Let's unpack a different way to calculate the impact of these variables. Here, you see how we can chain together multiple functions to produce desired values. 

In [None]:
percent_differences = ibm_hr_data.groupby("Attrition").mean().pct_change().iloc[1].sort_values()
display(percent_differences)

We can then plot the variable to more easily see the impact.

In [None]:
ax = percent_differences.plot(kind="bar", figsize=(16, 8), cmap="viridis", fontsize=18)

Are any of these variables related to each other? We run a simple Pearson correlation to find out. 

In [None]:
data_correlations = ibm_hr_data[numeric_columns + ['Attrition']].corr()
display(data_correlations)

In [None]:
def heatmap_view(correlation_table, fontsize=20):
    """Intakes a correlation table and outputs a heatmap. """

    plt.figure(figsize=(18, 16))
    sns.heatmap(correlation_table.round(2), annot=True, cmap="plasma", robust=True, square=True)
    plt.yticks(fontsize=fontsize-8)
    plt.xticks(fontsize=fontsize-8)
    
    return

And to more easily see extreme values in how these numbers relate to each other, we use a heat map. 

In [None]:
heatmap_view(correlation_table=data_correlations)

We can use a pairplot to see distributions and how variables correlate with each other. 

In [None]:
top_difference_columns = percent_differences.abs()[:10].index

In [None]:
sns.pairplot(ibm_hr_data[top_difference_columns]);

#### Predictive Modeling with sklearn

Selecting the right algorithm in sklearn...
![ML Map](./assets/ml_map.png)

##### How to Handle the Data

Each class of algorithms comes with different considerations in how we work to obtain & measure predictions reliably. 

Because we're predicting a "True", "False" value, our work here will have to fall into the "Classification" category. 

What we're concerned with is how well the model outputs a probability that an employee will quit versus stay. 

Classification algorithms look for patterns in the data that are related to the output variable, the "y" variable, that we're trying to predict. Companies value these algorithms highly when they're able to perform well on new instances or new cases not seen in the original dataset. New cases not seen in the original data are called out-of-sample.

- Why is it important to perform well on out-of-sample data? 



a: 

How do we prepare for how well the model will perform on out-of-sample data, considering we can only train based on data we have available?

In [None]:
from sklearn.model_selection import train_test_split

def data_prep(df, features, target):
    """Input a DataFrame, a list of features, and a target column in order to get reliable
    training and test datasets that we can use to train and measure models.
    
    df(DataFrame): The DataFrame we're modeling
    features(list): Always input this as a list. These are the columns we're using to train the model. Always enter this as a list, and exclude your target variable.
    target(str): This is the name of the target column"""
        
    if target in features:
        raise ValueError("Your target is in your feature set. Please exclude your target variable from your features") 
    
    y = df[target]
    df = df[features].copy()
    
    categorical_features = list(df[features].dtypes[df.dtypes == "object"].index)
    
    if len(categorical_features) > 0:
        print("Processing Categorical Features: ")
        df = categorical_prep(df=df, categorical_columns=categorical_features)
        

        df = df.drop(columns=categorical_features)
        features = list(df.columns)
    
    # Split the dataframe into the feature columns & the target column    
    X = df[features]

    # This establishes a split where the model will train on 2/3rds of the data. The "Test" data is the remaining 33%.
    # By splitting into these sets, we're able to mimic the effect of performance on out-of-sample data. 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=10)
    
    return X_train, X_test, y_train, y_test

In [None]:
def categorical_prep(df, categorical_columns):
    """Prepares categorical columns for machine learning tasks"""
    
    # loop over the columns identified as categorical
    for col in categorical_columns:
        dummies = pd.get_dummies(df[col])  # form a matrix of 1's and 0's representing the categories
        print(col)
        dummy_col_list = list(dummies.columns)
        
        dummy_cols = []
        
        for column in dummy_col_list:
            dummy_cols.append(str(col) + "_" + str(column))  # Names the columns so they're more easily understood
            
        dummies.columns = dummy_cols
        
        display(dummies.head(3))
        
        df = pd.concat([df, dummies], axis=1)
        
    return df

Determining my Features

Feature engineering is the process of selecting, creating, or processing features to prepare for machine learning algorithms.

Above, we saw a numeric and categorical features that showed high association with attrition. We want to incorporate these into our model. 

In [None]:
ibm_hr_data.columns

In [None]:
target = "Attrition"
exclude_columns = ["Gender", "MaritalStatus"]
exclude_columns.append(target)
features = [feature for feature in ibm_hr_data.columns if feature not in exclude_columns]  # List comprehensions are a way to quickly populate your lists
                                                                                           # Notice that it's iterating over each column
    
print(features, f"\n\nWe're starting with {len(features)} features, but further engineering could yield more features.")

In [None]:
exclude_columns

In [None]:
features

![Data Flow](./assets/data_flow.png)

In [None]:
X_train, X_test, y_train, y_test = data_prep(df=ibm_hr_data, features=features, target=target)

In [None]:
# .shape is a fast, common way to see how many rows and columns a dataframe, matrix, or an array has. 
# the first number is number of rows, the second is number of columns

display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
X_train

In [None]:
# Classification algorithms. In reality, there are many others to try,
# but given the challenge at hand, these are highly explainable, fast,
# and robust to the categorical data we're dealing with
from sklearn.tree import DecisionTreeClassifier  # Decision Tree
from sklearn.ensemble import RandomForestClassifier  # Random Forest

# Visualization tools that will help us visualize our decision tree
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

def tree_fit_viz(X_train, y_train, depth=4, viz=True):
    """Fits a decision tree to prepared training data.
    
    Make sure that you've appropriately divided the data before 
    fitting your model. With viz set to True, returns a 
    visualization of a single decision tree fit to the dat."""
    
    # Instantiate the model
    clf = DecisionTreeClassifier(max_depth = depth, 
                             random_state = 45)
    
    # fit it to the data
    clf.fit(X_train, y_train)
    
    # Visualize the decision tree if viz is set to true
    if viz == True:
        dot_data = StringIO()
        export_graphviz(clf, out_file=dot_data,filled=True, rounded=True,
                        special_characters=True, feature_names=list(X_train.columns), max_depth=depth)
        graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
        display(Image(graph.create_png()))
        
        # Extract and display feature importance
        importance = zip(list(X_train.columns), clf.feature_importances_)
        importance_ranks = pd.Series(dict(importance))
        importance_ranks_significant = (importance_ranks[importance_ranks > 0]).sort_values(ascending=False) * 100

        ax = importance_ranks_significant.sort_values().plot(kind="barh", figsize=(10,7), title="Feature Importance", cmap="viridis")
        display(ax);
    
    return clf

In [None]:
decision_tree = tree_fit_viz(X_train, y_train, depth=4);

### Predictions, Metrics, & Interpretations

Recall that the original framing was that it's worthwhile to pursue predictive modeling only if we find that we can outperform random performance or simply applying the majority class and assuming no one will quit. Let's check our accuracy results for random, majority class application, and our decision tree. 

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
import scikitplot as skplt

In [None]:
print(f"We'll be making {len(y_test)} predictions, which is the number of values in the test set. Let's create a DataFrame to better track these results")

In [None]:
result_frame = pd.DataFrame({"Real Results": list(y_test),  # The actual results form column 1
                             "Random Results": np.where(np.random.randint(0,2, size= len(y_test)) == 1, True, False),  # We use a random number generator for our random baseline
                             "Random Probabilities": 0.5,  # The probabilities under random results
                             "False Only Column": False,  # For our "False" baseline, reflecting the majority class, this will project "False" all down the column
                             "False Only Probability": 0,  # The predicted probabilities for False only
                             "Tree Results": decision_tree.predict(X_test), # This column reflects what our decision tree observed
                             "Tree Probabilities": decision_tree.predict_proba(X_test)[:,1]})  # Probabilities for the Decision Tree


result_frame

In [None]:
false_only_accuracy = accuracy_score(result_frame['Real Results'], result_frame['False Only Column'])
tree_accuracy = accuracy_score(result_frame['Real Results'], result_frame['Tree Results'])
print(f"Predict No Attrition Accuracy Score: {false_only_accuracy}")
print(f"Decision Tree Accuracy Score: {tree_accuracy}")

Our decision tree only barely outperformed saying everything is false! Why would we want to go through the effort of making and using a decision tree? Or should we just find another model?

a: 



We will try another model (random forest) and use metrics that are actually effective for these problems.

In [None]:
%%time

def random_forest_fit(X_train, y_train, depth, trees):
    """Instantiate and output a random forest model based on data and inputs"""
    
    rf = RandomForestClassifier(max_depth=depth, n_estimators=trees, random_state=42)
    rf.fit(X_train, y_train)
    
    return rf

%time random_forest = random_forest_fit(X_train, y_train, depth=10, trees=1000)

result_frame['Random Forest Results'] = random_forest.predict(X_test)
result_frame["Random Forest Probabilities"] = random_forest.predict_proba(X_test)[:,1]

In [None]:
result_frame

In [None]:
# This is the result of running a pure accuracy metric against the numbers.
model_results = ["Real Results", "Random Results", "Tree Results", "Random Forest Results"]
raw_accuracy_scores = result_frame[model_results].apply(lambda x: (accuracy_score(result_frame["Real Results"], x)*100))
raw_accuracy_scores.plot(kind="bar", rot=45, cmap="viridis")
display(raw_accuracy_scores)

In [None]:
classification_report = pd.DataFrame({"Accuracy Scores": result_frame[model_results].apply(lambda x: (accuracy_score(result_frame["Real Results"], x)*100)),
                                      "Precision Scores": result_frame[model_results].apply(lambda x: (precision_score(result_frame["Real Results"], x)*100)),                                      
                                      "Recall Scores": result_frame[model_results].apply(lambda x: (recall_score(result_frame["Real Results"], x)*100)),
                                      "F1 Scores": result_frame[model_results].apply(lambda x: (f1_score(result_frame["Real Results"], x)*100))})

classification_report

We have reason to believe that the Decision Tree classifier performs the best on this data. However, this is only on predicting "True/False" values. What if we want to select the model that's best at scoring people at various levels of risk? 

For that, we'll need another classification metric called Receiver Operating Characteristic Area Under the Curve, much more commonly called AUC. What this does is test model performance at various thresholds. The output is in a familiar format: between 0 and 1, the closer to 1 being the better. This helps at understanding which model to select and put into production. 

In [None]:
y_probas = decision_tree.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_probas, plot_micro=False, plot_macro=False)
plt.show()

In [None]:
y_probas = random_forest.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_probas, plot_micro=False, plot_macro=False)
plt.show()

In [None]:
proba_cols = ['Real Results','Random Probabilities', 'False Only Probability', 
              'Tree Probabilities', 'Random Forest Probabilities' ]
proba_frame = result_frame[proba_cols]
proba_roc_auc = proba_frame.iloc[:, 1:].apply(lambda x: roc_auc_score(proba_frame['Real Results'], x))*100
proba_roc_auc.plot(kind='bar', cmap="viridis", rot=45);
display(proba_roc_auc)

In [None]:
def lift_calc(df, model_col_name, bins=5, real_results="Real Results"):
    """Intakes predictive model probability output and returns the relative lift of various risk categories.
    This is a more laymen-interpretable way to look at the results than ROC curves! Interpretability is
    key for adoption."""
    
    # pd.cut() separates the results into evenly spaced bins. These are our risk categories. Think of them as star rankings.
    df_cut = df.groupby(pd.cut(df[model_col_name], bins=bins))[real_results].agg(["count", "sum"])
    proba_cut = df_cut["sum"]/df_cut["count"]
    
    # We'll change the bin names into easily understood numbers
    proba_cut.index = range(1,len(proba_cut)+1)
    df_cut.index = range(1,len(proba_cut)+1)
    
    # Here, we plot the results as a bar chart
    fontsize=20
    
    model = model_col_name.replace('Probabilities', 'Model')
    
    ax = (proba_cut*100).plot(kind="bar", cmap="viridis", figsize=(10,6))
    plt.suptitle("Attrition Probability by Risk Score Lift Chart", fontsize=fontsize-1, y=1.01)
    plt.title(model)
    plt.xlabel("Risk Category (Higher Number is Higher Risk)", fontsize=fontsize-3)
    plt.ylabel("Actual Attrition %", fontsize=fontsize-3)
    
    # And we present a DataFrame for easier analysis
    lift_frame = pd.concat([df_cut, proba_cut*100], axis=1)
    lift_frame.columns = ["Employees", "Employee Attrition #", "Attrition %"]
    
    return lift_frame

#### Interpretable Communication

The graphics below frame up the results of the models in a way it should be easier to understand. This buckets the models' risk scores in a familiar "5 star" format. Each bar represent the probability of attrition identified in each bucket based on real numbers. For a good model, higher risk buckets should indicate higher probability of attrition. 

In [None]:
lift_frame = lift_calc(df=proba_frame, model_col_name="Tree Probabilities")
lift_frame

In [None]:
lift_frame = lift_calc(df=proba_frame, model_col_name="Random Forest Probabilities")
lift_frame

### The Results

Based on the results of our modeling, we believe that the data we're collecting on employee engagement has some definite "signal" in predicting attrition. 

Attrition is potentially highly costly, especially with high replacement costs, cultural impact, and corporate knowledge lost when high quality employees leave.

Because of the nuance and sensitivity of the types of interventions we might use given attrition, we are suggesting that the ML model we use be explainable, able to provide reasons for flagging employees as high attrition risk. We will evaluate between Decision Trees and Random Forests to determine which model can be the most effective at reducing undesired attrition.

# You Made It!

![data](https://media.giphy.com/media/zEU2uwmialC4U/giphy.gif)

If you made it this far, that means you're serious about learning this stuff!

Here are some tracks that you can use:
- [DataQuest](https://www.dataquest.io/) - For a self-paced, structured, online journey through initial python skills to data science tasks
- [General Assembly](https://generalassemb.ly/education/data-science-remote-online) - For those who prefer a classroom or online class environment

Who to follow?
- [Kareem Carr](https://twitter.com/kareem_carr) - For your daily dose of data science & statistcs snark
- [Rachel Thomas](https://twitter.com/math_rachel) - Co-founder of fast.ai, Natural Language Processing (NLP) guru, & heading up data ethics initiatives
- [Cathy O'Neil](https://twitter.com/mathbabedotorg) - Weapons of Math Destrution Author, heavy on data ethics
- [Chris Albon](https://twitter.com/chrisalbon) - You'll see his work a ton when you're learning data analysis in pandas, extremely helpful
- [Kevin Markham](https://twitter.com/justmarkham) - Founder of data school, prolific poster of entry level content, great community around learning pandas!
- [Data Science Renee](https://twitter.com/BecomingDataSci) - Great guidance on becoming a data scientist, follow her blog as well!