# Scikit-Learn binary classification model training on local notebook

## Plus Azure ML Dataset converted to Pandas DataFrames

_**The code is plain vanilla Scikit-Learn training/creation of a Binary classification model.**_

_**Azure ML is only used to gather original data from an AML Dataset.**_

_**This notebook can run on a local PC or on any Azure ML Compute Instance or Azure ML VM.**_

This code is the baseline for the next labs/notebooks in the Workshop moving to AML remote compute, etc.

### Import libraries to use in notebook

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from azureml.core import Workspace, Dataset
from azureml.core import Environment

# Check versions
import azureml.core
import sklearn
import joblib
import pandas

print("Azure SDK version:", azureml.core.VERSION)
print('scikit-learn version is {}.'.format(sklearn.__version__))
print('joblib version is {}.'.format(joblib.__version__))
print('pandas version is {}.'.format(pandas.__version__))

### Create Workspace to load Tabular Datasets and log info from local training into AML

In [None]:
# Connect to an existing Azure ML Workshop in order to use Azure ML Datasets and Runs Logging into AML

ws = Workspace.from_config()

### Load the IBM employee attrition data created before

**Note:** as you are now accessing the workspace, the Notebook needs to be authenticated for access through device authentication. Hence, you will be prompted with a device login like so: 

    Performing interactive authentication. Please follow the instructions on the terminal.
    To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ARQPFR4B4 to authenticate.
    
Please follow these instructions in a new browser tab.

In [None]:
# get the IBM employee attrition dataset from the workspace

attritionData = ws.datasets['IBM-Employee-Attrition'].to_pandas_dataframe()
attritionData.head()

## Clean up the initial dataset

In [None]:
# Dropping Employee count as all values are 1 and hence attrition is independent of this feature
attritionData = attritionData.drop(['EmployeeCount'], axis=1)

# Dropping Employee Number since it is merely an identifier
attritionData = attritionData.drop(['EmployeeNumber'], axis=1)

attritionData = attritionData.drop(['Over18'], axis=1)

# Since all values are 80
attritionData = attritionData.drop(['StandardHours'], axis=1)
target = attritionData["Attrition"]

attritionXData = attritionData.drop(['Attrition'], axis=1)

## Split in Train and Test datasets (DataFrames)

In [None]:
# Split data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(attritionXData, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=target)

## Transform data

In [None]:
# Collect the categorical and numerical column names in separate lists
categorical = []
for col, value in attritionXData.iteritems():
    if value.dtype == 'object':
        categorical.append(col)
        
numerical = attritionXData.columns.difference(categorical)

### Transform raw features
We can explain raw features by either using a sklearn.compose.ColumnTransformer or a list of fitted transformer tuples. The cell below uses sklearn.compose.ColumnTransformer. In case you want to run the example with the list of fitted transformer tuples, comment the cell below and uncomment the cell that follows after.

## Create data processing pipelines (Scikit-Learn pipelines)
**Note:** This code uses Scikit-Learn pipelines. This is not related to AML Pipelines. Different concept. 

In [None]:
from sklearn.compose import ColumnTransformer

# We create the transformations pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transforms_pipeline = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical),
        ('cat', categorical_transformer, categorical)])

## Add classifier algorithm (SVC: Support Vector Classifier) to the pipeline

In [None]:
# Append classifier to Scikit-Learn transformations pipeline.
# Now we have a full Scikit-Learn prediction pipeline.
model_pipeline = Pipeline(steps=[('preprocessor', transforms_pipeline),
                      ('classifier', SVC(kernel='linear', C = 1.0, probability=True))]) 

## Create AML Experiment, run and log just for logging info while training locally in Notebook

In [None]:
from azureml.core import Experiment

# Get an experiment object from AML
experiment = Experiment(workspace=ws, name="local-train-notebook-aml-log")

# Create a run object in the experiment
run =  experiment.start_logging()

# Log the algorithm parameter C to the run
run.log('C', 1.0)

## Train the SVM (Support Vector Machine) Classifier Model

In [None]:
model = model_pipeline.fit(x_train, y_train)

## Make Predictions and calculate Accuracy metric

In [None]:
x_test.describe()

In [None]:
from sklearn.metrics import accuracy_score

# Make Multiple Predictions
y_predictions = model.predict(x_test)

accuracy = accuracy_score(y_test, y_predictions)
print('Accuracy: ', accuracy)

## Log metric and model into the AML run definition
### Note that training is local, we just use the run definition to log information about the run/training 

In [None]:
# Output the Mean Squared Error to the notebook and to the run
run.log('accuracy', accuracy)

# Save the model to the outputs directory for capture
model_file_name = 'outputs/model.pkl'

joblib.dump(value = model, filename = model_file_name)

# upload the model file explicitly into artifacts 
run.upload_file(name = model_file_name, path_or_stream = model_file_name)

# Complete the run
run.complete()

### Check the experiment run and its logged info in Azure ML Workspace
Now, you should go to your AML Workspace and check the information logged for this run, such as the accuracy, hyper-parameters and any other info you logged for the experiment run.

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_predictions)

print(cm)

# Show confusion matrix in a separate window
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# One Prediction
instance_num = 6
# Get the prediction for the first member of the test set and explain why model made that prediction
prediction_value = model.predict(x_test)[instance_num]

print("One Prediction:")
print(prediction_value)

print(y_predictions[:20])

x_test.head(20)

In [None]:
y_test.head(5)