# Scikit-Learn binary classification model training on local notebook. 
## Plus Azure ML Dataset converted to Pandas DataFrames.
_**The code is plain vanilla Scikit-Learn training/creation of a Binary classification model.**_

_**Azure ML is only used to gather original data from an AML Dataset.**_

_**This notebook can run on a local PC or on any Azure ML Compute Instance or Azure ML VM.**_

This code is the baseline for the next labs/notebooks in the Workshop moving to AML remote compute, etc.

### Import libraries to use in notebook

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from azureml.core import Workspace, Dataset
from azureml.core import Environment

# Check versions
import azureml.core
import sklearn
import joblib
import pandas

print("Azure SDK version:", azureml.core.VERSION)
print('scikit-learn version is {}.'.format(sklearn.__version__))
print('joblib version is {}.'.format(joblib.__version__))
print('pandas version is {}.'.format(pandas.__version__))

### Create Workspace to load Tabular Datasets and log info from local training into AML

In [None]:
# Connect to an existing Azure ML Workshop in order to use Azure ML Datasets and Runs Logging into AML

ws = Workspace.from_config()

### Load the IBM employee attrition data created before

Note: as you are now accessing the workspace, the Notebook needs to be authenticated for access through device authentication. Hence, you will be prompted with a device login like so: 

    Performing interactive authentication. Please follow the instructions on the terminal.
    To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ARQPFR4B4 to authenticate.
    
Please follow these instructions in a new browser tab.

In [None]:
# get the IBM employee attrition dataset from the workspace
attritionData = ws.datasets['IBM-Employee-Attrition'].to_pandas_dataframe()
attritionData.head()

## Clean up the initial dataset

In [None]:
# Dropping Employee Count and Over18 as all values are 1 and hence attrition is independent of this feature
attritionData = attritionData.drop(['EmployeeCount'], axis=1)
attritionData = attritionData.drop(['Over18'], axis=1)

# Dropping Employee Number since it is merely an identifier
attritionData = attritionData.drop(['EmployeeNumber'], axis=1)

# Since all values are 80 let's drop also StandardHours
attritionData = attritionData.drop(['StandardHours'], axis=1)

target = attritionData["Attrition"]

attritionXData = attritionData.drop(['Attrition'], axis=1)

## Split in Train and Test datasets (DataFrames)

In [None]:
# Split data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(attritionXData, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=target)

## Transform data

In [None]:
# Collect the categorical and numerical column names in separate lists
categorical = []
for col, value in attritionXData.iteritems():
    if value.dtype == 'object':
        categorical.append(col)
        
numerical = attritionXData.columns.difference(categorical)

### Transform raw features
We can explain raw features by either using a sklearn.compose.ColumnTransformer or a list of fitted transformer tuples. The cell below uses sklearn.compose.ColumnTransformer. In case you want to run the example with the list of fitted transformer tuples, comment the cell below and uncomment the cell that follows after.

## Create data processing pipelines (Scikit-Learn pipelines)
NOTE: This code uses Scikit-Learn pipelines. Not related to AML Pipelines. Different concept. 

In [None]:
from sklearn.compose import ColumnTransformer

# We create the transformations pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transforms_pipeline = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical),
        ('cat', categorical_transformer, categorical)])

## Add classifier algorithm (SVC: Support Vector Classifier) to the pipeline

In [None]:
# Append classifier to Scikit-Learn transformations pipeline.
# Now we have a full Scikit-Learn prediction pipeline.
model_pipeline = Pipeline(steps=[('preprocessor', transforms_pipeline),
                      ('classifier', SVC(kernel='linear', C = 1.0, probability=True))]) 

## Create AML Experiment, run and log just for logging info while training locally in Notebook

Let's start an interactive logging session using *start_logging()*. It'll create also an interactive run in the specified experiment. Any metrics logged during the session are added to the run record in the experiment. The method *run.complete()* ends the sessions and marks the run as completed.
You can find other logging examples [here](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/logging-api/logging-api.ipynb).

In [None]:
from azureml.core import Experiment

# Get an experiment object from AML
experiment = Experiment(workspace=ws, name="aml-wrkshp-local-train-notebook-log")

# Create an interactive run object in the experiment
run =  experiment.start_logging()

# Log the algorithm parameter C to the run
run.log('C', 1.0)

## Train the SVM (Support Vector Machine) Classifier Model

In [None]:
model = model_pipeline.fit(x_train, y_train)

## Make Predictions and calculate Accuracy metric

In [None]:
x_test.describe()

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score


# Make Multiple Predictions
y_predictions = model.predict(x_test)

accuracy = accuracy_score(y_test, y_predictions)
rocauc = roc_auc_score(y_test, y_predictions)
average_precision = average_precision_score(y_test, y_predictions)

model_details_df = pd.DataFrame([accuracy, rocauc, average_precision],
                                columns = ['SVM'],
                                index=['Accuracy','ROC-AUC','Avg Precision'])

model_details_df


In [None]:
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

y_pred_proba = model.predict_proba(x_test)[::,1]

fpr, tpr, _ = roc_curve(y_test,  y_pred_proba)

plt.plot(fpr,tpr,label="data 1, auc="+str(rocauc))
plt.legend(loc=4)
run.log_image("ROC Curve", plot=plt)
#plt.show()

## Log metric and model into the AML run definition
### Note that training is local, we just use the run definition to log information about the run/training 

In [None]:
# Current folder, where the pkl file will be created
import os
os.getcwd()

In [None]:
# Output the Mean Squared Error to the notebook and to the run
 
run.log('Accuracy', accuracy)
run.log('ROC-AUC', rocauc)
run.log('Avg Precision', average_precision)


# Save the model to the outputs directory for capture
model_file_name = 'outputs/model.pkl'

# Allow local file overwriting
model_file_obj = open(model_file_name, 'wb')

# Persiste the model into the pkl file
joblib.dump(value = model, filename = model_file_obj)

In [None]:
# Any file into the ./output folder will be automatically uploaded into the run artifacts 
run.complete()

In [None]:
# # In case of duplicate artifact errors
# run.cancel() # Even if an exception is thrown, the run will be canceled anyway

### Check the experiment run and its logged info in Azure ML Workspace
Now, you should go to your AML Workspace and check the information logged for this run, such as the accuracy, hyper-parameters and any other info you logged for the experiment run.

## Confusion Matrix

You can easly get the values of the confusion matrix associated to the model in the following way:

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

class_names = y_test.unique()
# Just check that y_test contains all the classification label in the target variable

# Plot non-normalized confusion matrix
titles_options = [("Confusion matrix, without normalization", None),
                  ("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_test, y_test,
                                 display_labels=class_names,
                                 cmap=plt.cm.Blues,
                                 normalize=normalize)
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()

In [None]:
# Index of the instance you want to use as input for a prediction
instance_num = 6 # The seventh instance from the beginning (0-based)

# Get the prediction for the upon defined index
prediction_values = model.predict(x_test)
prediction_probs = model.predict_proba(x_test)

print("Classes:")
print(model.classes_)

print("Prediction label for instance", instance_num, ":")
print(prediction_values[instance_num])

print("True label for instance", instance_num, ":")
print(y_test.values[instance_num])

print("Prediction probabilities for instance", instance_num, ":")
print(prediction_probs[instance_num])

x_test.iloc[instance_num]

In [None]:
prediction_probs[0:7]

In [None]:
prediction_values[0:7]

I expected the default used threshold was 0.5. If so the 7th predicted label would be 0.

I found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) that *predict_proba* may be inconsistent with *predict*, as the probabilities are calibrated using Platt scaling (more details [here](https://scikit-learn.org/stable/modules/svm.html#scores-probabilities)).

Out of curiosity, I'm looking for the possible threhsolds to be used to make predicted values consistent with probabilities.

In [None]:
for tr in np.arange(0.0, 1.0, 0.01):
    if all( np.where(prediction_probs[:,1] > tr, True, False) == prediction_values ):
        print('Possible threshold to use:', tr)