<a href="https://colab.research.google.com/github/vectice/vectice-examples/blob/master/Samples/Amazon_access_challenge/Amazon_employee_access_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Employee Access Challenge


## Frame the Problem

- The given problem is related with time wasted in granting and revoking access to the employee within company.  For employee to access any resources he/she needs prior permission i.e. access of that resource. The access granting and revoking process is manual, handled by superviso. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money.

- <b>Objective:</b> We have to build a model, learned using historical data, that will determine an employee's access needs, such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. The model will take an employee's role information and a resource code and will return whether or not access should be granted.


- <b>Data:</b> The data consists of real historical data collected from 2010 & 2011. Employees are manually allowed or denied access to resources over time. You must create an algorithm capable of learning from this historical data to predict approval/denial for an unseen set of employees.

Test dataset (10 columns): The test set for which predictions should be made.  Each row asks whether an employee having the listed characteristics should have access to the listed resource.

Training dataset (10 columns): Each row has the ACTION (ground truth), RESOURCE, and information about the employee's role at the time of approval.
Following are the features present in the training dataset:
- ACTION: Target variable. ACTION is 1 if the resource was approved, 0 if the resource was not approved.
- RESOURCE: An ID for each resource
- MGR_ID: The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time
- ROLE_ROLLUP_1: Company role grouping category id 1 (e.g. US Engineering)
- ROLE_ROLLUP_2: Company role grouping category id 2 (e.g. US Retail)
- ROLE_DEPTNAME: Company role department name (e.g. Retail)
- ROLE_TITLE: Company role business title description (e.g. Senior Engineering Retail Manager)
- ROLE_FAMILY_DESC: Company role family extended description (e.g. Retail Manager, Software Engineering)
- ROLE_FAMILY: Company role family description (e.g. Retail Manager)
- ROLE_CODE: Company role code; this code is unique to each role (e.g. Manager)

All features have numerical values but they all are categorical features.
 


## Install Vectice and GCS packages

Vectice provides a generic metadata layer that is potentially suitable for most data science workflows. For this notebook we will use the sickit-learn library for modeling and track experiments directly through our Python SDK to illustrate how to fine-tune exactly what you would like to track: metrics, etc. The same mechanisms would apply to R, Java or even more generic REST APIs to track metadata from any programming language and library.

Here is a link to the [Vectice Python library documentation](https://doc.vectice.com/).

In [None]:
## Requirements
!pip install --q fsspec
!pip install --q gcsfs
#Install Vectice Python library 
# In this tutorial we will do code versioning using github, we also support gitlab
# and bitbucket: !pip install -q "vectice[github, gitlab, bitbucket]"
!pip install --q vectice[github]==22.3.5.1

[K     |████████████████████████████████| 136 kB 7.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 4.1 MB/s 
[K     |████████████████████████████████| 144 kB 39.5 MB/s 
[K     |████████████████████████████████| 94 kB 2.1 MB/s 
[K     |████████████████████████████████| 271 kB 3.8 MB/s 
[K     |████████████████████████████████| 121 kB 5.2 MB/s 
[K     |████████████████████████████████| 291 kB 40.8 MB/s 
[K     |████████████████████████████████| 856 kB 40.3 MB/s 
[?25h

In [None]:
!pip show vectice

## Import the required libraries

In [None]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## Get the data from GCS 

We are going to load data stored in Google Cloud Storage, that is provided by Vectice for this notebook.

In [None]:
# Download the "JSON file" from the "Vectice Tutorial Page" in the application so that 
# you can access the GCS bucket. The name of the JSON file should be "readerKey.json"

from google.colab import files
uploaded = files.upload()

In [None]:
# Once your file is loaded set the credentials for GCS and load the file
# in a pandas frame, double check the json file name you uploaded.

### Complete with the name of the JSON key file to access GCS. It can be found in the tutorial page.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'readerKey.json'

#  Get the dataset from GCS
data = pd.read_csv('gs://vectice-examples-samples/Amazon_challenge/dataset.csv')
# Run head to make sure the data was loaded properly

data.head()

In [None]:
data.shape

## Data Exploration

Data exploration enables us to take a first look on the data, can enhance the overall understanding of the characteristics of the data domain and helps to detect correlation between the features, thereby allowing for the creation of more accurate models

In [None]:
data_explore = data.copy()

In [None]:
data_explore.info()

There is no column with null values.

In [None]:
data_explore.nunique()

- In dataset description it is mention that an employee can have only one manager at a time, then we can consider that the dataset contains information of maximum 4243 employees.
- There are same number of unique values for ROLE_TITLE and ROLE_CODE. There is 1-to-1 mapping between these columns. So for our problem only one feature is sufficent.

In [None]:
sns.countplot(x='ACTION', data=data_explore)

In [None]:
data['ACTION'].value_counts()

- We can see that we have an imbalanced dataset. There are very less records of not granting the access. some algorithms may learn just from the ones


In [None]:
## Manager ID and how much resources he has access to
data['MGR_ID'].value_counts()

- Lets find out top 15 Resources, Role department, Role family, Role codes for which most access is requested.

In [None]:
data_explore_resources = data_explore[['RESOURCE', "ACTION"]].groupby(by='RESOURCE').count()
data_explore_resources.sort_values('ACTION', ascending=False).head(n=15).transpose()

In [None]:
data_explore_role_dept = data_explore[['ROLE_DEPTNAME', "ACTION"]].groupby(by='ROLE_DEPTNAME').count()
data_explore_role_dept.sort_values('ACTION', ascending=False).head(n=15).transpose()

In [None]:
data_explore_role_codes = data_explore[['ROLE_CODE', "ACTION"]].groupby(by='ROLE_CODE').count()
data_explore_role_codes.sort_values('ACTION', ascending=False).head(n=15).transpose()

In [None]:
data_explore_role_family = data_explore[['ROLE_FAMILY', "ACTION"]].groupby(by='ROLE_FAMILY').count()
data_explore_role_family.sort_values('ACTION', ascending=False).head(n=15).transpose()

In [None]:
## We use data.describe() to only take numerical columns ,and avoid non numerical ones, in order to plot them
for i in data.describe().columns:
  sns.distplot(data[i].dropna())
  plt.show()

### Correlation

If we have a big correlation, we have a problem of multicolinearity. That means that there are some features that depend of other features, so we should reduce the dimentionality of our data (if A depends of B, we should either find a way to aggregate or combine the two features and turn it into one variable or drop one of the variables that are too highly correlated with another) and that can be adressed using Principal component analysis (PCA)

In [None]:
## If we have a big correlation, we have a problem of multicolinearity that can be adressed using PCA
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(),annot=True,cmap='viridis',linewidth=1)


In [None]:
corr_matrix = data_explore.corr()
corr_matrix['ACTION'].sort_values(ascending=False)

- There is no attribute to which target variable is strongly correlated.

## Vectice Configuration

In [None]:
from vectice import Experiment
from vectice.api.json import JobType
from vectice.api.json import ModelType

# Specify the API endpoint for Vectice.
# You can specify your API endpoint here in the notebook, but we recommand you to add it to a .env file
os.environ['VECTICE_API_ENDPOINT']= "app.vectice.com"

# To use the Vectice Python library, you first need to authenticate your account using an API token.
# You can generate an API token from the Vectice UI, by going to the "API Tokens" section in the "My Profile" section
# which is located under your profile picture.
# You can specify your API Token here in the notebook, but we recommend you to add it to a .env file
os.environ['VECTICE_API_TOKEN'] = "Your API Token"

# Add you project id. The project id can be found in the project settings page in the Vectice UI
project_id = ID

## Data Preprocessing

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
X = data.drop(columns=['ACTION'], axis=1).copy()
y = data['ACTION'].copy()
X.shape, y.shape

In [None]:
cat_attrs = list(X.columns)
cat_attrs

In [None]:
# We create our first experiment for data preparation and specify the workspace and the project we will be working on
# Each experiment only contains one job. Each invokation of the job is called a run.
# autocode = True enables you to track your git changes for your code automatically every time you execute a run (see below).
experiment = Experiment(job="jobSplitData", job_type = JobType.PREPARATION, project=project_id, auto_code = True)

We can check if the datasets are already created in our workspace by calling **experiment.vectice.list_datasets()** which lists all the datasets existing in the project

In [None]:
experiment.vectice.list_datasets()

Let's split the dataset into train and test sets and save them in GCS. (The GCS code has been commented out as the data has already been generated). For this section, we will re-use some datasets that have been already created to illustrate dataset versioning.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# We use auto-versioning here.
# The Vectice library automatically detects if there have been changes to the dataset you are using.
# If it detects changes, it will generate a new version of your dataset automatically.
# For this notebook, we changed the data to illustrate datasets auto-versioning..
# So, the Vectice Python library will create a new dataset version when this code is executed for the first time.
input_ds_version =  experiment.add_dataset_version("amazon_employee_access_challenge_dataset")

# Because we are using Colab in this tutorial example we are going to declare a reference to the code
## manually. This will be added as a reference to the run we are going to create next.
# If you are using your local environment with GIT installed or JupyterLab etc... the code
# tracking is automated.
uri = "https://github.com/vectice/vectice-examples"
entrypoint="Samples/Amazon_access_challenge/Amazon_employee_access_challenge.ipynb"
input_code = experiment.add_code_version_uri(git_uri=uri, entrypoint=entrypoint)

# The created dataset version and code version will be automatically attached as inputs of the run
experiment.start(run_properties={"Property1": "Value 1", "property2": "Value 2"})

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(X, y):
    strat_train_set = data.iloc[train_index]
    strat_test_set = data.iloc[test_index]

X_train = strat_train_set.drop('ACTION', axis=1)
y_train = strat_train_set['ACTION'].copy()
X_test = strat_test_set.drop('ACTION', axis=1)
y_test = strat_test_set['ACTION'].copy()
X_train.shape, X_test.shape

train_set = X_train.join(y_train)
test_set = X_test.join(y_test)

# We commented out the code to persist the training and testing test in GCS,
# because we already generated it for you, but feel free to uncomment it and execute it.
# The key (service account (readerKey.json)) existing in the tutorial page may not have writing permissions to GCS.
# Let us know if you want to be able to write files as well and we can issue you a different key.

#train_set.to_csv (r'gs://vectice-examples-samples/Amazon_challenge/training_data.csv', index = False, header = True)
#test_set.to_csv (r'gs://vectice-examples-samples/Amazon_challenge/testing_data.csv', index = False, header = True)

# We create new dataset versions 
train_ds_version = experiment.add_dataset_version("Training_data_Amazon")
test_ds_version = experiment.add_dataset_version("Testing_data_Amazon")

# We complete the current experiment's run 
## The created dataset versions will be automatically attached as outputs of the run
experiment.complete()

# We can preview one of our generated outputs to make sure that everything was executed properly.
X_train

We create a pipeline with the OneHotEncoder for algorithms that doesn't support categorical data

In [None]:
cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                         ('cat_enc', OneHotEncoder(handle_unknown='ignore'))])

pre_process = ColumnTransformer([('cat_process', cat_pipeline, cat_attrs)], remainder='passthrough')

X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)
X_train_transformed.shape, X_test_transformed.shape

- Since we will be using CatBoost Classifier. For CatBoost model, there is no need of encoding categorical model. Hence we will be creating a separate preprocessing pipeline for CatBoost model.

In [None]:
cat_boost_pre_process = ColumnTransformer([('imputer', SimpleImputer(strategy='most_frequent'), cat_attrs)], remainder='passthrough')

X_cb_train_transformed = cat_boost_pre_process.fit_transform(X_train)
X_cb_test_transformed = cat_boost_pre_process.transform(X_test)
X_cb_train_transformed.shape, X_cb_test_transformed.shape

In [None]:
feature_columns = list(pre_process.transformers_[0][1]['cat_enc'].get_feature_names(cat_attrs))
len(feature_columns)

## Modeling

In [None]:
#We create our second experiment for Modeling and specify the workspace and the project we will be working on
#Each experiment only contains one job. Each invokation of the job is called a run.
#autocode = True enables you to track your git changes for your code automatically every time you execute a run (see below).
experiment = Experiment(job="Modeling", project=project_id, job_type=JobType.TRAINING, auto_code=True)

In [None]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
from sklearn.metrics import matthews_corrcoef, make_scorer, roc_auc_score, roc_curve
Matthew = make_scorer(matthews_corrcoef)

results = []

def plot_custom_roc_curve(clf_name, y_true, y_scores):
    auc_score = np.round(roc_auc_score(y_true, y_scores), 3)
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    plt.plot(fpr, tpr, linewidth=2, label=clf_name+" (AUC Score: {})".format(str(auc_score)))
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
    plt.axis([0, 1, 0, 1])
    plt.xlabel("FPR", fontsize=16)
    plt.ylabel("TPR", fontsize=16)
    plt.legend()
    
    
def performance_measures(model, X_tr=X_train_transformed, y_tr=y_train, X_ts=X_test_transformed, y_ts=y_test,
                         store_results=True):
    train_mcc = cross_val_score(model, X_tr, y_tr, scoring=Matthew, cv=kf, n_jobs=-1)
    test_mcc = cross_val_score(model, X_ts, y_ts, scoring=Matthew, cv=kf, n_jobs=-1)
    print("Mean Train MCC: {}\nMean Test MCC: {}".format(train_mcc.mean(), test_mcc.mean()))

    
    train_roc_auc = cross_val_score(model, X_tr, y_tr, scoring='roc_auc', cv=kf, n_jobs=-1)
    test_roc_auc = cross_val_score(model, X_ts, y_ts, scoring='roc_auc', cv=kf, n_jobs=-1)
    print("Mean Train ROC AUC Score: {}\nMean Test ROC AUC Score: {}".format(train_roc_auc.mean(), test_roc_auc.mean()))
    return train_mcc.mean(), test_mcc.mean(), train_roc_auc.mean(), test_roc_auc.mean()
    
    if store_results:
        results.append([model.__class__.__name__, np.round(np.mean(train_roc_auc), 3), np.round(np.mean(test_roc_auc), 3), np.round(np.mean(train_mcc), 3), np.round(np.mean(test_mcc), 3)])
    

    

In [None]:
def plot_feature_importance(feature_columns, importance_values, top_n_features=10):
    feature_imp = [ col for col in zip(feature_columns, importance_values)]
    feature_imp.sort(key=lambda x:x[1], reverse=True)
    
    if top_n_features:
        imp = pd.DataFrame(feature_imp[0:top_n_features], columns=['feature', 'importance'])
    else:
        imp = pd.DataFrame(feature_imp, columns=['feature', 'importance'])
    plt.figure(figsize=(20, 10))
    sns.barplot(y='feature', x='importance', data=imp, orient='h')
    plt.title('Most Important Features', fontsize=16)
    plt.ylabel("Feature", fontsize=16)
    plt.xlabel("")
    plt.savefig('Feature_importance.png')
    plt.show()

We can get the list of the models existing in the project by calling **vectice.list_models()**

In [None]:
experiment.vectice.list_models()

### Logistic regression

In [None]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression

# we declare the dataset versions and code to use as inputs of our run
experiment.start(inputs=[input_code, train_ds_version, test_ds_version],
                run_properties={"Property1": "Value 1", "property2": "Value 2"})

solver='liblinear'
C=1
penalty='l2'
max_iter=1000
random_state=42
logistic_reg = LogisticRegression(solver=solver, C=C, penalty=penalty, max_iter=max_iter, random_state=random_state)
logistic_reg.fit(X_train_transformed, y_train)
train_mcc, test_mcc, train_roc_auc, test_roc_auc = performance_measures(logistic_reg)
plot_feature_importance(feature_columns, logistic_reg.coef_[0], top_n_features=15)

metrics = {"Train_mcc":  round(train_mcc, 3),"test_mcc":  round(test_mcc, 3), "train_roc_auc":  round(train_roc_auc, 3), 'test_roc_auc':  round(test_roc_auc, 3)}
hyper_parameters = {"solver": solver, "C": C, "penalty": penalty, "max_iter": max_iter, "random_state": random_state}

# Let's log the model we trained along with its metrics, as a new version 
# of the "Classifier" model in Vectice.
model_version = experiment.add_model_version(model="Classifier", algorithm="Logistic Regression", hyper_parameters=hyper_parameters, metrics=metrics, attachment="Feature_importance.png")

# We complete the current experiment's run 
## The created model version will be automatically attached as output of the run
experiment.complete()

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# we declare the dataset versions and code to use as inputs of our run
experiment.start(inputs=[input_code, train_ds_version, test_ds_version],
                run_properties={"Property1": "Value 1", "property2": "Value 2"})

n_estimators=300
max_depth=16
random_state=42

forest_clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=random_state)
forest_clf.fit(X_train_transformed, y_train)

train_mcc, test_mcc, train_roc_auc, test_roc_auc = performance_measures(forest_clf)

metrics = {"train_roc_auc": round(train_roc_auc, 3), 'test_roc_auc': round(test_roc_auc, 3)}
hyper_parameters = {"n_estimators": n_estimators, "max_depth": max_depth, "random_state": random_state}
plot_feature_importance(feature_columns, forest_clf.feature_importances_, top_n_features=15)

# Let's log the model we trained along with its metrics, as a new version 
# of the "Classifier" model in Vectice.
experiment.add_model_version(model="Classifier", algorithm="Random Forest",
                                              hyper_parameters=hyper_parameters, metrics=metrics, attachment="Feature_importance.png")

# We complete the current experiment's run 
## The created model version will be automatically attached as output of the run
experiment.complete()

### XGBoost

In [None]:
from xgboost import XGBClassifier

# we declare the dataset versions and code to use as inputs of our run
experiment.start(inputs=[input_code, train_ds_version, test_ds_version],
                run_properties={"Property1": "Value 1", "property2": "Value 2"})

n_estimators=300
max_depth=16
learning_rate=0.1
random_state=42

xgb_clf = XGBClassifier(n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate, random_state=random_state)
xgb_clf.fit(X_train_transformed, y_train)

train_mcc, test_mcc, train_roc_auc, test_roc_auc = performance_measures(xgb_clf)
hyper_parameters = {"n_estimators": n_estimators, "max_depth": max_depth, "learning_rate": learning_rate,"random_state": random_state}
metrics = {"Train_mcc":  round(train_mcc, 3),"test_mcc":  round(test_mcc, 3), "train_roc_auc":  round(train_roc_auc, 3), 'test_roc_auc':  round(test_roc_auc, 3)}

plot_feature_importance(feature_columns, xgb_clf.feature_importances_, top_n_features=15)

# Let's log the model we trained along with its metrics, as a new version 
# of the "Classifier" model in Vectice.
experiment.add_model_version(model="Classifier", algorithm="XGBoost", hyper_parameters=hyper_parameters, metrics=metrics, attachment="Feature_importance.png")

# We complete the current experiment's run 
## The created model version will be automatically attached as output of the run
experiment.complete()

### Catboost

In [None]:
!pip install -q catboost

In [None]:
from catboost import CatBoostClassifier

# we declare the dataset versions and code to use as inputs of our run
experiment.start(inputs=[input_code, train_ds_version, test_ds_version],
                run_properties={"Property1": "Value 1", "property2": "Value 2"})

loss_function='Logloss'
iterations=500
depth=6
eval_metric='AUC'
l2_leaf_reg=1
random_state=42
verbose=0
cat_features=list(range(X_cb_train_transformed.shape[1]))
catboost_clf = CatBoostClassifier(loss_function=loss_function, iterations=iterations, depth=depth, l2_leaf_reg=l2_leaf_reg, 
                                  cat_features=cat_features, 
                                  eval_metric=eval_metric, random_state=random_state, verbose=verbose)
catboost_clf.fit(X_cb_train_transformed, y_train)

train_mcc, test_mcc, train_roc_auc, test_roc_auc = performance_measures(catboost_clf, X_tr=X_cb_train_transformed, X_ts=X_cb_test_transformed)

metrics = {"Train_mcc":  round(train_mcc, 3),"test_mcc":  round(test_mcc, 3), "train_roc_auc":  round(train_roc_auc, 3), 'test_roc_auc':  round(test_roc_auc, 3)}
hyper_parameters = {"loss_function": loss_function), "iterations": iterations, "categorical features": list(range((X_cb_train_transformed.shape[1]))),
              "verbose": verbose, "depth": depth, "eval_metric": eval_metric, "l2_leaf_reg": l2_leaf_reg, "random_state": random_state}
plot_feature_importance(feature_columns, catboost_clf.feature_importances_, top_n_features=15)

# Let's log the model we trained along with its metrics, as a new version 
# of the "Classifier" model in Vectice.
experiment.add_model_version(model="Classifier", algorithm="CatBoost", hyper_parameters=hyper_parameters, metrics=metrics, attachment="Feature_importance.png")
# We complete the current experiment's run 
## The created model version will be automatically attached as output of the run
experiment.complete()

We can update a model's type or description by using experiment.update_model()

In [None]:
experiment.update_model(model="Classifier", type=ModelType.CLASSIFICATION, description="Model description")

## Model Evaluation

In [None]:
plt.figure(figsize=(8, 5))
# We can save those plots and add them to the model version by using 
## experiment.add_model_version_attachment(file="File name", model_version= "The model version name or id", model="The model name or id")
plot_custom_roc_curve('Logistic Regression', y_test, logistic_reg.decision_function(X_test_transformed))
plot_custom_roc_curve('Random Forest', y_test, forest_clf.predict_proba(X_test_transformed)[:,1])
plot_custom_roc_curve('XGBoost', y_test, xgb_clf.predict_proba(X_test_transformed)[:,1])
plot_custom_roc_curve('CatBoost', y_test, catboost_clf.predict_proba(X_cb_test_transformed)[:,1])
plt.show()