## Imports and Variables ##

><b>Pre-requisites</b>
>
> Azure ML SDK has been installed.
>
> You have run <b>jupyter notebook</b> from a command line that was based in your project directory.
>
> You have successfully executed <b>az login</b> from said command line.

This notebook retrieves the training data from a predetermined location in Azure and separates it inot training and test data sets.

From there, it trains a model, tests it, reports statistics on it (and they aren't great, but this is just an exercise) and finally writes the model out to disk. 

## Imports and Variables ##

In [None]:
# Data manipulation/acquisition
import os
import pandas as pd
import numpy as np
from shutil import copyfile
from azure.storage.blob import BlockBlobService
from azure.storage.blob import ContentSettings


# Using the ski-kit leanr, this will fail if it has not been added as a dependency
from sklearn.tree import DecisionTreeClassifier

# Serialize models
import pickle

# Do NOT change this stoarge information as it is the training data provided for you.
AZURE_STORAGE_ACCOUNT_NAME = "catjulydemodata"
AZURE_STORAGE_ACCOUNT_KEY = "94nA2VT9YR8cWuT0eqUMM8uWJiRgTUpkGwONKKD/WsZIPC8UDH6PBQFSgTTh01qbY9WqCF0HoWvOnb1wRmIM6Q=="
AZURE_STORAGE_DATA_CONTAINER_NAME = "factorydemo"
AZURE_STORAGE_DATA_BLOB_NAME = "factorydataset.csv"

#AZURE_STORAGE_MODEL_CONTAINER_NAME = "factorymodel"
ML_MODEL_NAME = "factorymodel.pkl"

LOCAL_SYSTEM_DATA_DIRECTORY = "./TrainingData"
LOCAL_SYSTEM_DATA_FILE = "{}/{}".format(LOCAL_SYSTEM_DATA_DIRECTORY, AZURE_STORAGE_DATA_BLOB_NAME)
LOCAL_SYSTEM_MODEL_FILE = "./{}".format(ML_MODEL_NAME)


## Obtain Training Data ##

Downloads the pre-created training data for this lab.

In [None]:
# Create the local directory for the training data (provided for you)
if not os.path.exists(LOCAL_SYSTEM_DATA_DIRECTORY):
    os.makedirs(LOCAL_SYSTEM_DATA_DIRECTORY)
    print('DONE creating a local directory!')
else:
    print('Local directory already exists!')
    
#Create the blob service
az_blob_service = BlockBlobService(account_name=AZURE_STORAGE_ACCOUNT_NAME, account_key=AZURE_STORAGE_ACCOUNT_KEY)

#Download the training data
az_blob_service.get_blob_to_path(AZURE_STORAGE_DATA_CONTAINER_NAME, AZURE_STORAGE_DATA_BLOB_NAME, LOCAL_SYSTEM_DATA_FILE)

allTrainingDataFrame = pd.read_csv(LOCAL_SYSTEM_DATA_FILE)

allTrainingDataFrame.head(5)

# Create datasets for training and testing 

We first search the data to determine how many successes and how many failures we have. In that process we determine that
the data is heavily weighted towards succesful devices. 

To ensure that we have reasonable data for both testing and training we will:
    - Split the data up into two buckets, succesful data and failed data
    - Randomly choose approximately 70% of the data from each bucket for training and the remaining 30% for testing.
    
This type of splity stratifies the data to ensure that neither set is too heavlily weighted to either class.

In [None]:
# Split the datasets into successful devices and failed devices.
successrows = allTrainingDataFrame.loc[allTrainingDataFrame['state'] == 0]
failurerows = allTrainingDataFrame.loc[allTrainingDataFrame['state'] == 1]

print("Of the {} records, {} are devices that are OK, and {} are in a failure state".format(len(allTrainingDataFrame), len(successrows), len(failurerows)))
print("")

# Numpy will create us a boolean array of the length we want randomly selecting true and false. This allows us 
# to choose 0.7(~70%) of the succesful and failed devices.
successmsk = np.random.rand(len(successrows)) < 0.7
failuremsk = np.random.rand(len(failurerows)) < 0.7

# For training take the 70% of items for each in the training set
trainingDataFrame = pd.concat([successrows[successmsk], failurerows[failuremsk]])
# For testing take the remaining 30% of items for each in the testing set
testingDataFrame = pd.concat([successrows[~successmsk], failurerows[~failuremsk]])

print("Training Data: {} records".format(len(trainingDataFrame)))
print(trainingDataFrame.head(5))

print("")

print("Testing Data : {} records".format(len(testingDataFrame)))
print(testingDataFrame.head(5))


# Train the model

The next step after splitting the data is to train a model. We are using a two class decision model here which is well suited to
DecisionTree. 

We use __ski-kit learn__ DecisionTreeClassifier as the model, train it with our training data then report on feature importance after training.

In [None]:
# We need to know which of the columns in our dataset are to be used as features, so we identify them here.
featureColumnNames = ["temp", "volt", "rotate", "time", "id"] 

# We set up two sets, one of features and one for labels
trainingFeatures = trainingDataFrame[featureColumnNames]
trainingLabels = trainingDataFrame["state"]

# fit/train the model with the training data
decisionTreeClassifier = DecisionTreeClassifier()
decisionTreeClassifier.fit(trainingFeatures, trainingLabels)

# Report on the feature importance, note that the time and id fields hold very little value to 
# determining our success or failure rate. In a real case, these fields would likely be left out
# of the decision making process before production.
print("Results of fitting the classifier with the training data:")
print("Feature Columns: {}".format(featureColumnNames))
print("Feature Importance: {}".format(decisionTreeClassifier.feature_importances_))

As we can see, time and id have very little to do with the model test, but we want them in so when results are returned we can correctly record the results for the device.

# Test the model 

Using our testing data (the 30% of the original dataset) we score the model by predicting results only passing in the feature columns. 

In [None]:
# Using the testing data (30%) predict, get the features and send them in to predict
testdata = testingDataFrame[featureColumnNames]
results = decisionTreeClassifier.predict(testdata)

# Turn the results to pandas dataframe and rename the one column to prediction for easier reading of results.
pd_results = pd.DataFrame(results)
pd_results.columns = ["prediction"]

# Merge the test data set with the predictions, but because this is a subset of the main dataset the index of each
# row will caust a pandas.concat to give us a jagged matrix. Create a new dataframe with the index reset so that the 
# concat works as expected.
mergetest = testingDataFrame.reset_index(drop=True)
resultset = pd.concat([mergetest, pd_results], axis=1)

# Visualize a few things
print("Prediction results using the testing data:")
print(resultset.head(5))

## Research the results of the model ##

Using the test done above, we can determine how the model performed. To do so:

    Use the model to calculate accuracy
    Calculate the TP, FP, TN, FN
    Calcualte precision, recall and fscore of the model

We calculage precision and recall because accuracy is not enough. We should find that the model performs, without any tweaking, at close to 90% but precision and recall are closer to 70%. In a real world situation this likely is not sufficient and would need tweaking by the data scientist.


In [None]:
#Do some connecting of data to results
actualOkState = len(resultset.loc[resultset['state'] == 0.0])
actualFailState = len(resultset.loc[resultset['state'] == 1.0])
resultOk = len(resultset.loc[resultset['prediction'] == 0.0])
resultFail = len(resultset.loc[resultset['prediction'] == 1.0])

# Calculate True/False Negatives
trueNegative = len(resultset.loc[(resultset['state'] == 0.0) & (resultset['prediction'] == 0.0)]) # TN
falseNegative = len(resultset.loc[(resultset['state'] == 1.0) & (resultset['prediction'] == 0.0)]) #FN

# Calculate True/False Positives
truePositive = len(resultset.loc[(resultset['state'] == 1.0) & (resultset['prediction'] == 1.0)]) #TP
falsePositive = len(resultset.loc[(resultset['state'] == 0.0) & (resultset['prediction'] == 1.0)]) #FP

# Precision is percentage of failed prediction that are correct. Where Precision = TP/(TP+FP)
precision = truePositive / (truePositive+falsePositive)
# Recall is the percentage of failures that correctly identified. Where  Recall = TP/(TP+FN)
recall = truePositive / (truePositive+falseNegative)
# F score : f = 2*(precision*recall)/(precision + recall)
fscore = 2* ((precision*recall)/(precision+recall))

#Print out the score of the model and information about the result set
print("Model accuracy {}".format(decisionTreeClassifier.score(testdata,testingDataFrame["state"])))
print("")
print("Result set size: {0}".format(len(resultset)))
print("")
print("Info on devices that are OK:")
print("Known good = {} , predicted good {}".format(actualOkState, resultOk))
print("")
print("Info on devices that have failed:")
print("Known failed = {} , predicted failed {}".format(actualFailState, resultFail))
print("")
print("Precision {}".format(precision))
print("Recall {}".format(recall))
print("FScore {}".format(fscore))

Ok, so these results are pretty abysmal in the eyes of a data scientist, but we will continue on!

# Save the model

This section will perform the following steps:
    
- Use pickle to serialize the model to a local file
- Upload the model file to an Azure Storage Account

In [None]:
# Open the local file and dump the model to it.    
filestream = open(LOCAL_SYSTEM_MODEL_FILE, 'wb')
pickle.dump(decisionTreeClassifier, filestream)
filestream.close()
print("Model file was serialized to local path {}".format(LOCAL_SYSTEM_MODEL_FILE))

# Upload the local file to Azure Storage
#az_blob_service.create_blob_from_path(
#    AZURE_STORAGE_MODEL_CONTAINER_NAME,
#    AZURE_STORAGE_MODEL_BLOB_NAME,
#    LOCAL_SYSTEM_MODEL_FILE,
#    content_settings=ContentSettings(content_type='application/octet-stream'))

#print("Model {} was uploaded to storage in {} container".format(AZURE_STORAGE_MODEL_BLOB_NAME, AZURE_STORAGE_MODEL_CONTAINER_NAME))