# Top Performers Model 

This notebook documents the process of building a model to predict the top performers in the company. The model will be used to identify the top performers in the company and provide insights into the factors that contribute to their success. 
The model will be built using a demo dataset composed of metrics from Viva Insights, and we will be using a random forest classifier from `sklearn` for this purpose. 

## Set-up

We start off by loading in the required Python packages:

In [15]:
# data cleaning and utility
import numpy as np
import pandas as pd
import vivainsights as vi
import os

# visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt

# machine learning
from sklearn.ensemble import RandomForestClassifier # scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

The next step here is to load in the dataset, and then examine the data. In our local directory, we have a demo dataset that has a similar structure to a Person Query, with an additional 5-point scale 'performance' attribute that represents performance scores. 

`vi.import_query()` imports the demo person query data, and performs cleaning on the variable names. An alternative to this is to use `pd.read_csv()`, which does the same thing of reading in the input csv file. 

In [16]:
# Set relative path to go up one directory and into data folder
raw_data = vi.import_query(os.getcwd() + "\\..\\data\\Top_Performers_Dataset_v2.csv")

# Examine the data
raw_data.head() # first 5 rows

Unnamed: 0,PersonId,Internal_network_size,Collaboration_hours,weekend_collaboration_hours,After_hours_call_hours,performance
0,8c14b6ac-f043-39a6-9365-2898072a5951,79,130.04306,31.162695,0.0,5.0
1,8c14b6ac-f043-39a6-9365-2898072a5951,79,129.31995,31.0,0.0,5.0
2,8c14b6ac-f043-39a6-9365-2898072a5951,79,128.94211,31.065472,0.0,5.0
3,8c14b6ac-f043-39a6-9365-2898072a5951,79,128.54636,31.0325,0.0,5.0
4,8c14b6ac-f043-39a6-9365-2898072a5951,79,127.14406,31.0975,0.0,5.0


## Data preparation

There are typically a number of data preparation and validation procedures involved before fitting a model, such as: 
- Handling missing values
- Changing variable types
- Handling outliers and unwanted data
- Splitting data into training and test sets

In this notebook, we will assume that the dataset is in decent quality, and all that is required are the standard procedures of changing variable types and splitting data into train/test sets. 

We start off by dropping any non-numeric columns (`PersonId` in this case). It is optional, but we also convert the `performance` variable into a binary variable (`perform_cat`), so we would yield a classification model. This step is for demo purposes as there are more use cases where the outcome variable is binary rather than ordinal or continuous. 

In [17]:
clean_data = raw_data.drop(columns=['PersonId']) # drop PersonId - not required for fitting
# Binary variable where >= 4 indicates High Performance
clean_data['perform_cat'] = np.where(clean_data['performance'] >= 4, 1, 0)


clean_data.head()

Unnamed: 0,Internal_network_size,Collaboration_hours,weekend_collaboration_hours,After_hours_call_hours,performance,perform_cat
0,79,130.04306,31.162695,0.0,5.0,1
1,79,129.31995,31.0,0.0,5.0,1
2,79,128.94211,31.065472,0.0,5.0,1
3,79,128.54636,31.0325,0.0,5.0,1
4,79,127.14406,31.0975,0.0,5.0,1


The `train_test_split()` function from `sklearn.model_selection` makes it easy to split the data into training and test datasets. In the following example, the parameters are provided in this order: (i) data frame containing the predictor variables only, (ii) data frame containing the outcome variable only, and (iii) `test_size` controlling the proportion of the dataset to include in the train split.

This is assigned to four data frames:
- `x_train` - predictors, train set
- `x_test` - predictors, test set
- `y_train` - outcome, train set
- `y_test` - outcome, test set

In [18]:
# Split train and test data
outc_var_df = clean_data['perform_cat']
pred_var_df = clean_data.drop(columns=['perform_cat'])

x_train, x_test, y_train, y_test = train_test_split(pred_var_df, outc_var_df, test_size = 0.30)

## Fitting the model

The next step is to fit the random forest model, with `RandomForestClassifer()` from the `sklearn.ensemble` module.

After initializing the model and assigning to `rf`, we supply `x_train` and `y_train` to `fit()`, where the two variables represent the training data sets for the predictors and the outcome respectively.

Note that `RandomForestClassifier()` comes with many default parameters, which you can find out more [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). We are using all the default parameters here, as we do not supply any additional parameters to `RandomForestClassifier()`.


In [19]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

## Evaluating the model

If no errors or warnings pop up, then the first iteration of the model is trained. The next step is to understand the model, and then to interpret and evaluate its outputs. 

Here are some metrics for assessing the model. These can be run from `sklearn.metrics`:

- **Accuracy**: This is the ratio of correct predictions to the total number of predictions. It's a good measure when the target variable classes in the data are nearly balanced. However, it can be misleading if the classes are imbalanced.

- **Precision**: Precision is the ratio of true positives (correctly predicted positive observations) to the total predicted positives. It's a measure of a classifier's exactness. A low precision indicates a high number of false positives.

- **Recall (Sensitivity)**: Recall is the ratio of true positives to the total actual positives. It's a measure of a classifier's completeness. A low recall indicates a high number of false negatives.

- **F1 Score**: The F1 Score is the weighted average of Precision and Recall. It tries to balance the two metrics. It's a good measure to use if you need to seek a balance between Precision and Recall and there is an uneven class distribution.

- **Confusion Matrix**: The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. It contains information about actual and predicted classifications done by the classifier. It's a good way to visualize the performance of the model.

The choice of metric depends on your business objective. For example, if the cost of having false positives is high, the strategy might be to optimize for precision; this arguably applies to a top performers use case, where it is preferred that the model predicts fewer top performers. If the cost of missing positives (having false negatives) is high, the strategy might be to optimize for recall, which could be more relevant for an attrition use case.  

In [20]:
# Predict the labels for the test set
y_pred = rf.predict(x_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
conf_matrix = confusion_matrix(y_test, y_pred)

# Print metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n {conf_matrix}")

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Confusion Matrix:
 [[285   0]
 [  0  15]]
