# Using ML tools to bring in data and learn how to do it.

This notbook performs the following tasks:
    - Load the dataset from the dprep file
    - Stratify split the data into two sets one for training one for testing
    - Serialize that model using pickle
    - Upload the model to Azure Storage
    - Download the model from Azure Storage and test it out
    

# Imports and constant values

This section contains the neccesary imports and constants that will be used throughout the file. 

In [1]:
# Use the Azure Machine Learning data collector to log various metrics
from azureml.logging import get_azureml_logger
logger = get_azureml_logger()

# Use the Azure Machine Learning data preparation package
from azureml.dataprep import package

# Data manipulation/random generation from pandas and numpy
import os
import pandas as pd
import numpy as np

# Using the ski-kit leanr, this will fail if it has not been added as a dependency
from sklearn.tree import DecisionTreeClassifier

# Serialize models
import pickle

# Azure Storage for uploading the output model
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PublicAccess
from azure.storage.blob import ContentSettings

# Azure storage and file name information
AZURE_STORAGE_ACCOUNT_NAME = "<STORAGE_ACCOUNT_NAME>"
AZURE_STORAGE_ACCOUNT_KEY = "<STORAGE_ACCOUNT_KEY>"
AZURE_STORAGE_CONTAINER_NAME = "readydemo"
AZURE_STORAGE_BLOB_NAME = "factory.pkl"

LOCAL_SYSTEM_DIRECTORY = "modelfile"
LOCAL_MODEL_FILE  = "./{}/{}".format(LOCAL_SYSTEM_DIRECTORY,AZURE_STORAGE_BLOB_NAME)


# Load data from the dataflow in workbench

This code, along with some imports above, were provided when right clicking on the dprep file and choosing 
__Generate Data Access Code File__ in the Azure ML Workbench tool.

This cell loads in the data we will use for training and testing the model.

In [2]:
# This call will load the referenced package and return a DataFrame.
# If run in a PySpark environment, this call returns a
# Spark DataFrame. If not, it will return a Pandas DataFrame.
projectDataFrame = package.run('dataset.dprep', dataflow_idx=0)

# Remove this line and add code that uses the DataFrame
projectDataFrame.head(5)

[Row(temp=45.9842594460449, volt=150.513223075022, rotate=277.294013981084, state=0.0, time=1.0, id=1.0),
 Row(temp=52.9039296937009, volt=110.434075269674, rotate=314.586726208661, state=0.0, time=2.0, id=1.0),
 Row(temp=53.8255072204536, volt=169.259518750327, rotate=315.602502059127, state=0.0, time=3.0, id=1.0),
 Row(temp=47.7826759112912, volt=110.829531516979, rotate=345.894651732999, state=0.0, time=4.0, id=1.0),
 Row(temp=43.4792263968699, volt=199.351674834705, rotate=325.36408065582, state=1.0, time=5.0, id=1.0)]

# Create datasets for training and testing 

We first search the data to determine how many successes and how many failures we have. In that process we determine that
the data is heavily weighted towards succesful devices. 

To ensure that we have reasonable data for both testing and training we will:
    - Split the data up into two buckets, succesful data and failed data
    - Randomly choose approximately 70% of the data from each bucket for training and the remaining 30% for testing.
    
This type of split stratifies the data to ensure that neither set is too heavlily weighted to either class.

In [3]:
%%time 

# Using Ubuntu we have a PySpark environment, create a pandas data frame
machinereadings = projectDataFrame.toPandas()

# Split the datasets into successful devices and failed devices.
successrows = machinereadings.loc[machinereadings['state'] == 0]
failurerows = machinereadings.loc[machinereadings['state'] == 1]

print("Of the {} records, {} are devices that are OK, and {} are in a failure state".format(len(machinereadings), len(successrows), len(failurerows)))
print("")

# Numpy will create us a boolean array of the length we want randomly selecting true and false. This allows us 
# to choose 0.7(~70%) of the succesful and failed devices.
successmsk = np.random.rand(len(successrows)) < 0.7
failuremsk = np.random.rand(len(failurerows)) < 0.7

# For training take the 70% of items for each in the training set
trainingDataFrame = pd.concat([successrows[successmsk], failurerows[failuremsk]])
# For testing take the remaining 30% of items for each in the testing set
testingDataFrame = pd.concat([successrows[~successmsk], failurerows[~failuremsk]])

print("Training Data: {} records".format(len(trainingDataFrame)))
print(trainingDataFrame.head(5))

print("")

print("Testing Data : {} records".format(len(testingDataFrame)))
print(testingDataFrame.head(5))


Of the 10000 records, 8316 are devices that are OK, and 1684 are in a failure state

Training Data: 7045 records
         temp        volt      rotate  state  time   id
2   53.825507  169.259519  315.602502    0.0   3.0  1.0
5   49.593767  145.642949  370.523147    0.0   6.0  1.0
13  51.258633  180.305387  364.673218    0.0  14.0  1.0
14  47.227808  175.518532  317.506728    0.0  15.0  1.0
15  47.542005  163.243283  394.942473    0.0  16.0  1.0

Testing Data : 2955 records
        temp        volt      rotate  state  time   id
0  45.984259  150.513223  277.294014    0.0   1.0  1.0
1  52.903930  110.434075  314.586726    0.0   2.0  1.0
3  47.782676  110.829532  345.894652    0.0   4.0  1.0
7  49.624689  150.472998  310.409819    0.0   8.0  1.0
8  48.660651  155.330066  342.176581    0.0   9.0  1.0
CPU times: user 50.5 ms, sys: 16.6 ms, total: 67 ms
Wall time: 579 ms


# Create the model

The next step after splitting the data is to train a model. We are using a two class decision model here which is well suited to
DecisionTree. 

We use __ski-kit learn__ DecisionTreeClassifier as the model, train it with our training data then report on feature importance after training.

In [4]:
# We need to know which of the columns in our dataset are to be used as features, so we identify them here.
featureColumnNames = ["temp", "volt", "rotate", "time", "id"] 

# We set up two sets, one of features and one for labels
trainingFeatures = trainingDataFrame[featureColumnNames]
trainingLabels = trainingDataFrame["state"]

# fit/train the model with the training data
decisionTreeClassifier = DecisionTreeClassifier()
decisionTreeClassifier.fit(trainingFeatures, trainingLabels)

# Report on the feature importance, note that the time and id fields hold very little value to 
# determining our success or failure rate. In a real case, these fields would likely be left out
# of the decision making process before production.
print("Results of fitting the classifier with the training data:")
print("Feature Columns: {}".format(featureColumnNames))
print("Feature Importance: {}".format(decisionTreeClassifier.feature_importances_))

Results of fitting the classifier with the training data:
Feature Columns: ['temp', 'volt', 'rotate', 'time', 'id']
Feature Importance: [ 0.30984062  0.29348829  0.29527884  0.07128591  0.03010634]


# Test the model 

Using our testing data (the 30% of the original dataset) we score the model by predicting results only passing in the feature columns. 

In [5]:
# Using the testing data (30%) predict, get the features and send them in to predict
testdata = testingDataFrame[featureColumnNames]
results = decisionTreeClassifier.predict(testdata)

# Turn the results to pandas dataframe and rename the one column to prediction for easier reading of results.
pd_results = pd.DataFrame(results)
pd_results.columns = ["prediction"]

# Merge the test data set with the predictions, but because this is a subset of the main dataset the index of each
# row will caust a pandas.concat to give us a jagged matrix. Create a new dataframe with the index reset so that the 
# concat works as expected.
mergetest = testingDataFrame.reset_index(drop=True)
resultset = pd.concat([mergetest, pd_results], axis=1)

# Visualize a few things
print("Prediction results using the testing data:")
print(resultset.head(5))

Prediction results using the testing data:
        temp        volt      rotate  state  time   id  prediction
0  45.984259  150.513223  277.294014    0.0   1.0  1.0         0.0
1  52.903930  110.434075  314.586726    0.0   2.0  1.0         1.0
2  47.782676  110.829532  345.894652    0.0   4.0  1.0         0.0
3  49.624689  150.472998  310.409819    0.0   8.0  1.0         0.0
4  48.660651  155.330066  342.176581    0.0   9.0  1.0         0.0


# Research the results of the model

Using the test done above, we can determine how the model performed. To do so:
    
- Use the model to calculate accuracy 
- Calculate the TP, FP, TN, FN 
- Calcualte precision, recall and fscore of the model

We calculage precision and recall because accuracy is not enough. We should find that the model performs, without any tweaking, 
at close to 90% but precision and recall are closer to 70%. In a real world situation this likely is not sufficient and would
need tweaking by the data scientist.

In [6]:
#Do some connecting of data to results
actualOkState = len(resultset.loc[resultset['state'] == 0.0])
actualFailState = len(resultset.loc[resultset['state'] == 1.0])
resultOk = len(resultset.loc[resultset['prediction'] == 0.0])
resultFail = len(resultset.loc[resultset['prediction'] == 1.0])

trueNegative = len(resultset.loc[(resultset['state'] == 0.0) & (resultset['prediction'] == 0.0)]) # TN
falsePositive = len(resultset.loc[(resultset['state'] == 0.0) & (resultset['prediction'] == 1.0)]) #FP

truePositive = len(resultset.loc[(resultset['state'] == 1.0) & (resultset['prediction'] == 1.0)]) #TP
falseNegative = len(resultset.loc[(resultset['state'] == 1.0) & (resultset['prediction'] == 0.0)]) #FN

# Precision is percentage of failed prediction that are correct. Where Precision = TP/(TP+FP)
precision = truePositive / (truePositive+falsePositive)
# Recall is the percentage of failures that correctly identified. Where  Recall = TP/(TP+FN)
recall = truePositive / (truePositive+falseNegative)
# F score : f = 2*(precision*recall)/(precision + recall)
fscore = 2* ((precision*recall)/(precision+recall))

#Print out the score of the model and information about the result set
print("Model accuracy {}".format(decisionTreeClassifier.score(testdata,testingDataFrame["state"])))

print("")

print("Result set size: {0}".format(len(resultset)))

print("")

print("Info on devices that are OK:")
print("Known good = {} , predicted good {}".format(actualOkState, resultOk))

print("")

print("Info on devices that have failed:")
print("Known failed = {} , predicted failed {}".format(actualFailState, resultFail))

print("")

print("Precision {}".format(precision))
print("Recall {}".format(recall))
print("FScore {}".format(fscore))



Model accuracy 0.896108291032149

Result set size: 2955

Info on devices that are OK:
Known good = 2430 , predicted good 2423

Info on devices that have failed:
Known failed = 525 , predicted failed 532

Precision 0.7048872180451128
Recall 0.7142857142857143
FScore 0.7095553453169348


# Save the model

This section will perform the following steps:
    
- Use pickle to serialize the model to a local file
- Upload the model file to an Azure Storage Account

In [8]:
# Create an azure block blob service object
az_blob_service = BlockBlobService(account_name=AZURE_STORAGE_ACCOUNT_NAME, account_key=AZURE_STORAGE_ACCOUNT_KEY)

# Creat the local directory if not already present
if not os.path.exists(LOCAL_SYSTEM_DIRECTORY):
    os.makedirs(LOCAL_SYSTEM_DIRECTORY)
    print('DONE creating a local directory!')
else:
    print('Local directory already exists!')
    
# Open the local file and dump the model to it.    
filestream = open(LOCAL_MODEL_FILE, 'wb')
pickle.dump(decisionTreeClassifier, filestream)
filestream.close()
print("Model file was serialized to local path {}".format(LOCAL_MODEL_FILE))

# Upload the local file to Azure Storage
az_blob_service.create_blob_from_path(
    AZURE_STORAGE_CONTAINER_NAME,
    AZURE_STORAGE_BLOB_NAME,
    LOCAL_MODEL_FILE,
    content_settings=ContentSettings(content_type='application/octet-stream'))

print("Model {} was uploaded to storage in {} container".format(LOCAL_MODEL_FILE, AZURE_STORAGE_CONTAINER_NAME))

Local directory already exists!
Model file was serialized to local path ./modelfile/factory.pkl
Model ./modelfile/factory.pkl was uploaded to storage in readydemo container


# Download the model, de-serialize it and test it

In [9]:
# Clean up the model (if it exists)
if os.path.isfile(LOCAL_MODEL_FILE):
    os.remove(LOCAL_MODEL_FILE)
    print("Local model file exists and was deleted")
    
# Pull model back and try to run it
az_blob_service.get_blob_to_path(AZURE_STORAGE_CONTAINER_NAME, AZURE_STORAGE_BLOB_NAME, LOCAL_MODEL_FILE)

# Load the model from the location it was downloaded
print("Import the model from storage - {}".format(AZURE_STORAGE_BLOB_NAME))
localFile = open(LOCAL_MODEL_FILE, 'rb')
decisionTreeClassifier = pickle.load(localFile)
print("Model was downloaded from storage and the model re-created.")

#compare the results
print("Model accuracy {}".format(decisionTreeClassifier.score(testdata,testingDataFrame["state"])))

Local model file exists and was deleted
Import the model from storage - factory.pkl
Model was downloaded from storage and the model re-created.
Model accuracy 0.896108291032149
