# Tutorial #1: Train an raw sound classification model with Azure Machine Learning
In this tutorial, you train a machine learning model on remote compute resources. You'll use the training and deployment workflow for Azure Machine Learning service (preview) in a Python Jupyter notebook. You can then use the notebook as a template to train your own machine learning model with your own data. This tutorial is part one of a two-part tutorial series.

This tutorial trains a CNN model using raw sound dataset captured by [SoundCaptureModule](https://github.com/ms-iotkithol-jp/MicCaptureIoTSoundSample) on Azure IoT Edge device with Azure Machine Learning. Sound dataset consist from raw format and csv format file for each timespan. The goal is to create a multi-class classifier to identify the major or minor code of guiter equipment.

Learn how to:

- Set up your development environment
- Access and examine the data
- Train a CNN model on a remote cluster
- Review training results, find and register the best model
- You'll learn how to select a model and deploy it in part two of this tutorial later.

## Prerequisites
See prerequisites in the Azure Machine Learning documentation.

## Set up your development environment
All the setup for your development work can be accomplished in a Python notebook. Setup includes:

- Importing Python packages
- Connecting to a workspace to enable communication between your local computer and remote resources
- Creating an experiment to track all your runs
- Creating a remote compute target to use for training
- Import packages
- Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [None]:
# l-1
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Connect to workspace
Create a workspace object from the existing workspace. Workspace.from_config() reads the file config.json and loads the details into an object named ws.

In [None]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

## Create experiment
Create an experiment to track the runs in your workspace. A workspace can have muliple experiments.

In [None]:
experiment_name = 'raw-sound-major-miner-cnn'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)



Specify dataset and testset names and positions

In [None]:
# l-2
dataset_name = 'sound_data'
testset_name = 'sound_test'

data_folder_path = 'data'
test_folder_path = 'test'

soundDataDefFile = 'sounddata-csv.yml'
dataSrorageConfigFile = 'data-storage-config.yml'

## Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

<b>Creation of compute takes approximately 5 minutes</b>. If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D3_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

You now have the necessary packages and compute resources to train a model in the cloud.

## Explore data
Before you train a model, you need to understand the data that you are using to train it. In this section you learn how to:

- Download the captured raw sound is CSV files dataset
- Split each csv file into chunks when captured and convert it to a format that is suitable for CNN training 
- Create label dataset using user specified csv file. the csv file should be provided by user apart from sound csv files.
- Display some sounds

The format of sound scv file name is... 
<i>deviceid</i>-sound-<i>yyyyMMddHHmmssffffff</i>.csv
The part of <i>yyyyMMddHHmmss</i> means that the start time of sound capturing so that user can specify the appropriate files by specifying the start and end times.
So the csv file for label dataset specification consist from following format record.
<i>label_name</i>,<i>start timestamp</i>,<i>end timestamp</i>

both timestamp format shold be following.
<i>yyyy</i>/<i>MM</i>/<i>dd</i> <i>HH</i>:<i>mm</i>:<i>ss</i>

In the following code label dataset definition csv files name are
 - for training  'train-label-range.csv'
 - for testing   'test-label-range.csv'

The following cell should be run only when dataset is ready or updated.

※ You can use sound dataset stored in [sample guitar raw sound dataset](https://egstorageiotkitvol5.blob.core.windows.net/sound-ml-data/raw-sound-data.zip). please use a dataset that uploads a set of files that be extracted from the zip file to the blob which you create by your own Azure account.

In [None]:
# l-3
import os
train_label_range_csv_file = 'train-label-range.csv'
test_label_range_csv_file = 'test-label-range.csv'

csv_files = [train_label_range_csv_file, test_label_range_csv_file]
parsed_specs =[]
for csv in csv_files:
    specs ={}
    parsed_specs.append(specs)
    with open(csv,"rt") as f:
        ranges = f.readlines()
        for r in ranges:
            spec = r.rstrip().split(',')
            if not (spec[0] in specs):
                specs[spec[0]] = []
            specs[spec[0]].append([spec[1],spec[2]])

duration_for_train = parsed_specs[0]
duration_for_test = parsed_specs[1]
            
for i, ps in enumerate(parsed_specs):
    print('spec for - ' + csv_files[i])
    for d in duration_for_train.keys():
        for s in duration_for_train[d]:
            print(' {0}:{1}-{2}'.format(d,s[0],s[1]))


## Download guitar sound dataset from your own blob container
This code download only csv files that meet label specificated timespan criteria. Before you run following code, please set source_azure_storage_account_connection_string and source_container_name to match your storage account that contains the sound data files. The source_azure_storage_account_connection_string is configured into data-storage-config.yml

The files satisfied criteria will be downloaded in data and test folder.

In [None]:
# l-4
#!pip install -U azure-storage-blob>=12.2.0
#!pip list
import os
import datetime
import yaml

yml = {}
with open(dataSrorageConfigFile,'r') as ymlfile:
    yml.update(yaml.safe_load(ymlfile))
#print('config - {}'.format(yml))

# Specify datasource
source_azure_storage_account_connection_string = yml['blob_connection_string']
#source_azure_storage_account_connection_string = '< - your azure storage connection string - >'
source_container_name = 'edgesounds'

# Specify start and end time of duration for each chord
# duration_for_train = {
#    'major':[['2020/02/09 16:58:34', '2020/02/09 16:58:40'], ['2020/02/09 17:06:02','2020/02/09 17:06:10'], ['2020/02/09 16:59:42','2020/02/09 16:59:50'], ['2020/02/09 17:00:41','2020/02/09 17:00:49'],['2020/02/18 11:27:20', '2020/02/18 11:27:26'],['2020/02/18 11:28:05', '2020/02/18 11:28:11'],['2020/02/18 11:28:41', '2020/02/18 11:28:44'],['2020/02/18 11:29:18', '2020/02/18 11:29:20'],['2020/02/18 11:29:51', '2020/02/18 11:29:57'],['2020/02/18 11:30:25', '2020/02/18 11:30:29'],['2020/02/18 11:31:05', '2020/02/18 11:31:12']
# ],
#     'minor':[['2020/02/09 16:58:49', '2020/02/09 16:59:00'], ['2020/02/09 17:06:26','2020/02/09 17:06:36'], ['2020/02/09 16:59:57','2020/02/09 17:00:05'], ['2020/02/09 17:00:56','2020/02/09 17:01:03'],['2020/02/18 11:27:41', '2020/02/18 11:27:47'],['2020/02/18 11:28:22', '2020/02/18 11:28:26'],['2020/02/18 11:28:59', '2020/02/18 11:29:03'],['2020/02/18 11:29:33', '2020/02/18 11:29:38'],['2020/02/18 11:30:07', '2020/02/18 11:30:13'],['2020/02/18 11:30:45', '2020/02/18 11:30:49'],['2020/02/18 11:31:24', '2020/02/18 11:31:35']
# ]
# }

# duration_for_test = {
#     'major':[['2020/02/18 11:36:37', '2020/02/18 11:36:44'],['2020/02/18 11:37:12', '2020/02/18 11:37:18'],['2020/02/18 11:37:43', '2020/02/18 11:37:49'],['2020/02/18 11:38:18', '2020/02/18 11:38:23'],['2020/02/18 11:38:58', '2020/02/18 11:39:06'],['2020/02/18 11:39:36', '2020/02/18 11:39:40'],['2020/02/18 11:40:14', '2020/02/18 11:40:20']],
#     'minor':[['2020/02/18 11:36:56', '2020/02/18 11:37:01'],['2020/02/18 11:37:25', '2020/02/18 11:37:33'],['2020/02/18 11:38:00', '2020/02/18 11:38:08'],['2020/02/18 11:38:35', '2020/02/18 11:38:42'],['2020/02/18 11:39:15', '2020/02/18 11:39:21'],['2020/02/18 11:39:53', '2020/02/18 11:39:59'],['2020/02/18 11:40:34', '2020/02/18 11:40:41']]
# }

def pickup_target_files(target_folder_name, duration_for_target):
    folder_for_label = {}
    condition_for_label = {}
    target_data = {}
    # data store for traning
    data_folder = os.path.join(os.getcwd(), target_folder_name)
    for dflk in duration_for_target.keys():
        folder_for_key = os.path.join(data_folder, dflk)
        folder_for_label[dflk] = folder_for_key
        os.makedirs(folder_for_key, exist_ok=True)
        condition_for_label[dflk] = []
        durs = duration_for_target[dflk]
        for dur in durs:
            dur_se = []
            while len(dur)>0:
                t = dur.pop(0)
                ttime = datetime.datetime.strptime(t, '%Y/%m/%d %H:%M:%S')
                tnum = ttime.strftime('%Y%m%d%H%M%S') + '000000'
                dur_se.append(int(tnum))
            condition_for_label[dflk].append(dur_se)
        target_data[dflk] = []
    return folder_for_label, condition_for_label, target_data

data_folder_name = 'data'
test_folder_name = 'test'

train_folder_for_label, train_condition_for_label, train_data = pickup_target_files(data_folder_name, duration_for_train)
test_folder_for_label, test_condition_for_label, test_data = pickup_target_files(test_folder_name, duration_for_test)

from azure.storage.blob import BlobServiceClient
import datetime
import re
import numpy as np

# Connect to our blob via the BlobService
blobServiceClient = BlobServiceClient.from_connection_string(source_azure_storage_account_connection_string)
containerClient = blobServiceClient.get_container_client(source_container_name)
soundDataDefFile = 'sounddata-csv.yml'

def load_targeted_blobs(container, condition_for_target, folder_for_target, data_for_target ):
    with open(soundDataDefFile, "wb") as ymlFile:
        blobClient = containerClient.get_blob_client(soundDataDefFile)
        stream = blobClient.download_blob()
        blobContent = stream.readall()
        ymlFile.write(blobContent)
        
    target_blobs = []
    loaded_num_of_files = {}
    for blob in containerClient.list_blobs():
        matching = re.findall('sound-[0-9]+.csv', blob.name)
        if len(matching)>0:
            target_blobs.append({'blob':blob, 'num-of-ts':int(re.findall('[0-9]+',blob.name)[0])})
    for l in condition_for_target.keys():
        for cfl in condition_for_target[l]:
            filtered = list(filter(lambda b: cfl[0] <= b['num-of-ts'] and b['num-of-ts'] <= cfl[1], target_blobs))
            data_for_target[l].append(filtered)
    
    for l in data_for_target.keys():
        num_of_files = 0
        print('Label - '+l)
        for dft in data_for_target[l]:
            for ltd in dft:
                blobClient = containerClient.get_blob_client(ltd['blob'])
                stream = blobClient.download_blob()
                csvFilePath = os.path.join(folder_for_target[l], ltd['blob'].name)
                print(' Downloading - ' + ltd['blob'].name)
                with open(csvFilePath,"wb") as csvFile:
                    blobContent = stream.readall()
                    csvFile.write(blobContent)
                num_of_files = num_of_files + 1
        loaded_num_of_files[l] = num_of_files

    return loaded_num_of_files

result = load_targeted_blobs(containerClient, train_condition_for_label, train_folder_for_label, train_data)
for k in result.keys():
    print('Loaded file for train:{0} - {1}'.format(k, result[k]))
result = load_targeted_blobs(containerClient, test_condition_for_label, test_folder_for_label, test_data)
for k in result.keys():
    print('Loaded file for test:{0} - {1}'.format(k, result[k]))

import shutil
for fldr in [data_folder_name, test_folder_name]:
    destFName = os.path.join(fldr,soundDataDefFile)
    shutil.copy(soundDataDefFile, destFName)
    


### Create Training and Test data
reform data for training and test

In [None]:
%%writefile loadsounds.py
# l-5
import os
import csv
import numpy as np
import random
import yaml

def load_csvdata(file):
    with open(file) as f:
        reader = csv.reader(f)
        l = [row for row in reader]
        l.pop(0)
        return np.array(l).T.astype(np.int16) / 32768

# data_folder = 'data'
# data_folder_path = os.path.join(os.getcwd(), data_folder)

def parse_file(file_path, labeled_dataset, data_chunk, by_channel=False):
    csvdata = load_csvdata(file_path)
    if by_channel:
        if len(csvdata) > 0:
            num_of_block = int(len(csvdata[0]) / data_chunk)
            if num_of_block > 0 :
                labeled_dataset = np.ndarray(shape=(len(csvdata),num_of_block,data_chunk))

    num_of_chunk = 0
    for index, umicdata in enumerate(csvdata):
        if len(umicdata) % data_chunk == 0:
            micdata1024 = np.split(umicdata, len(umicdata) / data_chunk)
            if by_channel:
                labeled_dataset[index] = micdata1024
            else:
                if len(labeled_dataset) == 0:
                    labeled_dataset = micdata1024
                else:
                    labeled_dataset = np.append(labeled_dataset, micdata1024,axis=0)
            num_of_chunk = num_of_chunk + len(micdata1024)

    # print('{}:{} units'.format(file_path, num_of_chunk))
    return num_of_chunk, labeled_dataset

def load_data_definition(data_def_file_path):
    definition = {}
    with open(data_def_file_path, "r") as ymlFile:
        yml = yaml.safe_load(ymlFile)
        definition.update(yml)
    return definition

def reshape_dataset(sound_data, data_chunk):
    dataset = np.zeros((len(sound_data), 1, 1, data_chunk))
    for index, d in enumerate(sound_data):
        dataset[index][0][0] = d

    return dataset

def load_data(data_folder_path, data_def_file):
    data_chunk = 1024
    
    tdata ={}
    num_of_data = 0
    for df in os.listdir(path=data_folder_path):
        if df == data_def_file:
            ddefFile = os.path.join(data_folder_path, df)
            data_chunk = load_data_definition(ddefFile)['data-chunk']
            print('data_chunk - {}'.format(data_chunk))
            continue
            
        tdata[df] = np.array([])
        ldata_folder_path = os.path.join(data_folder_path,df)
        for datafile in os.listdir(path=ldata_folder_path):
            datafile_path = os.path.join(ldata_folder_path, datafile)
            num_of_chunks, tdata[df] = parse_file(datafile_path, tdata[df], data_chunk)
            num_of_data = num_of_data + num_of_chunks

    data_of_sounds = np.zeros((num_of_data, 1, 1, data_chunk))
    label_of_sounds = np.zeros(num_of_data,dtype=int)
    label_matrix_of_sounds = np.zeros((num_of_data, len(tdata.keys())))
    labelname_of_sounds = np.empty(num_of_data,dtype=object)
    index = 0
    lindex = 0
    labeling_for_train = {}
    for l in tdata.keys():
        for micdata1024 in tdata[l]:
            data_of_sounds[index][0][0] = micdata1024
            label_of_sounds[index] = lindex
            label_matrix_of_sounds[index, lindex] = 1.
            labelname_of_sounds[index] = l
            index = index + 1
        labeling_for_train[l] = lindex
        lindex = lindex + 1

    indexx = np.arange(num_of_data)
    random.shuffle(indexx)

    data_of_sounds = data_of_sounds[indexx]
    label_matrix_of_sounds = label_matrix_of_sounds[indexx]
    label_of_sounds = label_of_sounds[indexx]
    labelname_of_sounds = labelname_of_sounds[indexx]

    # train_dataset is labeled sound data set
    train_dataset = [data_of_sounds, label_of_sounds, labelname_of_sounds]
    
    return train_dataset


### Check sound data for training 
You can confirm that content of sound data by following code.

In [None]:
# l-6
from loadsounds import load_data

# data_chunk shoud be same value for sound capturing chunk
data_def_file = 'sounddata-csv.yml'
print('loading train data...')
train_dataset = load_data(data_folder_path, data_def_file)
print('loading test data...')
test_dataset = load_data(test_folder_path, data_def_file)
print(train_dataset)
import matplotlib.pyplot as plt
sample_size = 4
figure = plt.figure(figsize=(5,4))
fig, grps = plt.subplots(ncols=sample_size)
for d in range(0,sample_size):
    grps[d].plot(train_dataset[0][d][0][0])
    grps[d].set_title(train_dataset[2][d])
plt.show()


## Create Data store for training on remote computer
This logic execution is necessary for only when data is updated.


In [None]:
from azureml.core import Workspace, Datastore, Dataset

# retrieve current datastore in the workspace
datastore = ws.get_default_datastore()

# Upload files to dataset on datastore
datastore.upload(src_dir='data',
                 target_path= dataset_name,
                 overwrite=True,
                 show_progress=True)
datastore.upload(src_dir='test',
                 target_path= testset_name,
                 overwrite=True,
                 show_progress=True)

# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
datastore_paths = [(datastore, dataset_name)]
sound_ds = Dataset.File.from_files(path=datastore_paths)
teststore_paths = [(datastore, testset_name)]
sound_ts = Dataset.File.from_files(path=teststore_paths)

print(sound_ds)

# Register dataset to current workspace
sound_dataset = sound_ds.register(workspace=ws,
                                 name=dataset_name,
                                 description='sound classification training data')
sound_testset = sound_ts.register(workspace=ws,
                                 name=testset_name,
                                 description='sound classification test data')

### Construct CNN model

Following cell show a sample of CNN model for sound classification and training.
The cell logic will be run on this computing environment.


In [None]:
# l-7
# https://www.tensorflow.org/tutorials/images/intro_to_cnns?hl=ja
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

print('tensorflow version - '+tf.__version__)

model = models.Sequential()
model.add(layers.Conv2D(16,input_shape=(1,1,1024),kernel_size=(1,8),padding='same', strides=(1,4), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(1,2), padding='same'))
model.add(layers.Conv2D(filters=16,input_shape=(1,1,128),kernel_size=(1,8),padding='same', strides=(1,4), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(1,2), padding='same'))
model.add(layers.Conv2D(filters=16,input_shape=(1,1,16),kernel_size=(1,8),padding='same', strides=(1,4), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(1,2), padding='same'))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(2, activation='sigmoid'))

# Above code part is same as training logic on remote computer
model.summary()

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

hist = model.fit(train_dataset[0], train_dataset[1], epochs=5, validation_data=(test_dataset[0], test_dataset[1]))

test_loss, test_acc = model.evaluate(test_dataset[0],test_dataset[1])
print('Test accuracy - '+str(test_acc))

for hk in hist.history.keys():
    print(hk)
    
import matplotlib.pyplot as plt

epoch_list = list(range(1, len(hist.history['accuracy']) + 1))
plt.plot(epoch_list, hist.history['accuracy'], epoch_list, hist.history['val_accuracy'])
plt.legend(('Training Accuracy', "Validation Accuracy"))
plt.show()

predictions = model.predict(test_dataset[0])

# Plot a random sample of 10 test images, their predicted labels and ground truth
figure = plt.figure(figsize=(20, 8))
for i, index in enumerate(np.random.choice(test_dataset[1], size=15, replace=False)):
    ax = figure.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
    # Display each image
    ax.plot(test_dataset[0][index][0][0])
    predict_index = np.argmax(predictions[index])
    true_index = np.argmax(test_dataset[1][index])
    # print('{}-{}'.format(predict_index, true_index))
    # Set the title for each image
    ax.set_title("{} ({})".format(test_dataset[2][predict_index], 
                                  test_dataset[2][true_index]),
                                  color=("green" if predict_index == true_index else "red"))

    
# When you need h5 format file, change .pkl -> .h5
model_path = 'sound-classification-csv-model'
#model_path_ext = '.h5'
model_path_ext = '.pkl'
output_dir = 'outputs'
os.makedirs(output_dir, exist_ok=True)
model_pathname = os.path.join(output_dir, model_path + model_path_ext)
model.save(model_pathname)
#model.save('outputs/sound-classification-model.h5')

### Export learned model

exported model will be used in IoT Edge SoundClassifierService module

In [None]:
import os
import datetime
import tarfile

def compress_files(top, archive, dest_folder):
    tarfilename = archive + '.tar.gz'
    topbase = os.path.basename(top)
    if tarfilename is None:
        now = datetime.datetime.now()
        tarfilename = '{0}-{1:%Y%m%d%H%M%S}.tar.gz'.format(topbase, now)
    if dest_folder is not None:
        tarfilename = os.path.join(dest_folder, tarfilename)
        os.makedirs(dest_folder,exist_ok=True)

    tar = tarfile.open(tarfilename, "w:gz")
    for root, dirs, files in os.walk(top):
        for filename in files:
            parent = root[len(top):]
            if parent.startswith('\\'):
                parent = parent[1:]
            archnameraw = os.path.join(parent,filename)
            archname = os.path.join(topbase, archnameraw).replace('\\','/',-1)
            tar.add(os.path.join(root, filename),archname)
    tar.close()
    return tarfilename

export_folder_path = 'export'
os.makedirs(export_folder_path,exist_ok=True)
if model_path_ext == '.pkl':
    compressed = compress_files(model_pathname, model_path + model_path_ext, export_folder_path)
    print('Learned model is exproted as ' + compressed)

### Convert and create tensorflow lite model for micro edge device.
Please refer https://www.tensorflow.org/lite/guide/inference and [/SoundAIonMicroEdge/README.md](../../SoundAIonMicroEdge/README.md) to use the converted file.

In [None]:
import tensorflow as tf
import os

model_file = 'sound-classification-csv-model.pkl' # this value should be same name of above logic
output_dir = 'outputs'
model_pathname = os.path.join(output_dir, model_file)
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

tflite_model_file = 'sound-classification-csv-model.tflite'
export_folder_path = 'export'
tflite_model_pathname = os.path.join(export_folder_path, tflite_model_file)
open(tflite_model_pathname, "wb").write(quantized_model)

### Create Training script

In [None]:
import os
script_folder = os.path.join(os.getcwd(), "sklearn-script")
os.makedirs(script_folder, exist_ok=True)

## CNN Model traing script
The following cell is the script of model definition, model training for running on remote computing cluster.

In [None]:
%%writefile $script_folder/train.py
import argparse
import os
import numpy as np
import glob

from sklearn.externals import joblib

from azureml.core import Run

from loadsounds import load_data
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--test-folder', type=str, dest='test_folder', help='test folder mounting point')
#parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')
args = parser.parse_args()

data_folder = args.data_folder
test_folder = args.test_folder
print('Data folder:{0}, Test folder:{1}'.format( data_folder, test_folder))

data_def_file = 'sounddata-csv.yml'

train_dataset = load_data(data_folder,data_def_file)
test_dataset = load_data(test_folder, data_def_file)

# get hold of the current run
run = Run.get_context()

print('tensorflow version - '+tf.__version__)

model = models.Sequential()
model.add(layers.Conv2D(16,input_shape=(1,1,1024),kernel_size=(1,8),padding='same', strides=(1,4), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(1,2), padding='same'))
model.add(layers.Conv2D(filters=16,input_shape=(1,1,128),kernel_size=(1,8),padding='same', strides=(1,4), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(1,2), padding='same'))
model.add(layers.Conv2D(filters=16,input_shape=(1,1,16),kernel_size=(1,8),padding='same', strides=(1,4), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(1,2), padding='same'))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(2, activation='sigmoid'))

# Above code part is same as training logic on remote computer
model.summary()

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

hist = model.fit(train_dataset[0], train_dataset[1], epochs=5, validation_data=(test_dataset[0], test_dataset[1]))

test_loss, test_acc = model.evaluate(test_dataset[0],test_dataset[1])
print('Test accuracy - '+str(test_acc))

for hk in hist.history.keys():
    print(hk)
    
epoch_list = list(range(1, len(hist.history['accuracy']) + 1))

#import matplotlib.pyplot as plt

# plt.plot(epoch_list, hist.history['accuracy'], epoch_list, hist.history['val_accuracy'])
# plt.legend(('Training Accuracy', "Validation Accuracy"))
# plt.show()

# predictions = model.predict(test_dataset[0])
# Plot a random sample of 10 test images, their predicted labels and ground truth
# figure = plt.figure(figsize=(20, 8))
# for i, index in enumerate(np.random.choice(test_dataset[1], size=15, replace=False)):
#     ax = figure.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
#     # Display each image
#     ax.plot(test_dataset[0][index][0][0])
#     predict_index = np.argmax(predictions[index])
#     true_index = np.argmax(test_dataset[1][index])
    # print('{}-{}'.format(predict_index, true_index))
    # Set the title for each image
#     ax.set_title("{} ({})".format(test_dataset[2][predict_index], 
#                                   test_dataset[2][true_index]),
#                                   color=("green" if predict_index == true_index else "red"))
# Above code result can't be shown in Azure ML Studio

model_path = 'sound-classification-csv-model'
model_path_ext = '.pkl'
output_dir = 'outputs'
os.makedirs(output_dir, exist_ok=True)
model_pathname = os.path.join(output_dir, model_path + model_path_ext)
model.save(model_pathname)


In [None]:
import shutil
shutil.copy('loadsounds.py', script_folder)

### Training and Learning!
train CNN model with sound dataset

In [None]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# to install required packages
env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['azureml-sdk','scikit-learn','azureml-dataprep[pandas,fuse]>=1.1.14','tensorflow==2.1.0','matplotlib'])

env.python.conda_dependencies = cd

In [None]:
from azureml.train.sklearn import SKLearn
from azureml.core import Dataset, Run

# Get a dataset by name
sound_dataset = Dataset.get_by_name(workspace=ws, name=dataset_name)
data_mount = sound_dataset.as_named_input('sound_data').as_mount()
test_dataset = Dataset.get_by_name(workspace=ws, name=testset_name)
test_mount = sound_dataset.as_named_input('sound_test').as_mount()

script_params = {
    # to mount files referenced by mnist dataset
    '--data-folder': data_mount,
    '--test-folder': test_mount
}

est = SKLearn(source_directory=script_folder,
              script_params=script_params,
              compute_target=compute_target,
              environment_definition=env,
              entry_script='train.py')

### For debug
Following three blocks are used to check specified parameters and dataset validity.
When you don't need to debug, please go forward to ["Submit the job to the cluster"](#Submit-the-job-to-the-cluster)

In [None]:
%%writefile $script_folder/testargs.py
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--test-folder', type=str, dest='test_folder', help='test folder mounting point')
#parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')
args = parser.parse_args()

data_folder = args.data_folder
test_folder = args.test_folder
print('Data folder:{0}', 'Test folder:{1}'.format( data_folder, test_folder))

import os
chdir = os.getcwd()
print('Current Dir - '+chdir)
folders = {'data':data_folder,'test':test_folder}
for fld in folders.keys():
    cfld = folders[fld]
    print('Check content of {0} - {1}'.format(fld, cfld))
    for f in os.listdir(cfld):
        print(' '+f)
        cdir = os.path.join(data_folder, f)
        if os.path.isdir(cdir):
            for cf in os.listdir(cdir):
                print('  '+cf)
        
    

In [None]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# to install required packages
env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['azureml-sdk','matplotlib','azureml-dataprep[pandas,fuse]>=1.1.14'])

env.python.conda_dependencies = cd

In [None]:
from azureml.train.sklearn import SKLearn
from azureml.core import Dataset, Run

# Get a dataset by name
sound_dataset = Dataset.get_by_name(workspace=ws, name=dataset_name)
data_mount = sound_dataset.as_named_input('sound_data').as_mount()
test_dataset = Dataset.get_by_name(workspace=ws, name=testset_name)
test_mount = sound_dataset.as_named_input('sound_test').as_mount()
print(data_mount)
print(test_mount)
script_params = {
    # to mount files referenced by mnist dataset
    '--data-folder': data_mount,
    '--test-folder': test_mount
}

est = SKLearn(source_directory=script_folder,
              script_params=script_params,
              compute_target=compute_target,
              environment_definition=env,
              entry_script='testargs.py')

### Local Debug

In [None]:
import numpy as np
from loadsounds import parse_file, load_data_definition, reshape_dataset
import os
import random

#data_def_file = 'sounddata.yml'
#datafile = 'cherry-sound-20200218113643319806.csv'
data_chunk = load_data_definition(data_def_file)['data-chunk']
#csv_dataset = parse_file(datafile,np.array([]),data_chunk)
#sound_dataset = np.zeros((csv_dataset[0], 1, 1, data_chunk))
#index = 0
#for d in csv_dataset[1]:
#    sound_dataset[index][0][0] = d
#    index = index + 1
test_data_files = []
for d in os.listdir(test_folder_path):
    dname = os.path.join(test_folder_path, d)
    if os.path.isdir(dname):
        for f in os.listdir(dname):
            if f.rfind('.csv') >= 0:
                test_data_files.append(os.path.join(dname,f))
random.shuffle(test_data_files)
csv_dataset = parse_file(test_data_files[0], np.array([]), data_chunk, by_channel=True)

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

print('tensorflow version - '+tf.__version__)

#model_file_path ='outputs/sound-classification-model.h5'
model_file_path ='outputs/sound-classification-model.pkl'
# model name should be used other style
model = models.load_model(model_file_path)

for channel, csvds in enumerate(csv_dataset[1]):
    sound_dataset = reshape_dataset(csv_dataset[1][channel], data_chunk['data-chunk'])
    predicted = model.predict(sound_dataset)
    print('channel:{}'.format(channel))
    result = predicted.tolist()
    for r in result:
        print('{0}<->{1}'.format(r[0],r[1]))


### Submit the job to the cluster
Run the experiment by submitting the estimator object. And you can navigate to Azure portal to monitor the run.

In [None]:
run = exp.submit(config=est)
run

### Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

## Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the dependencies (`conda_packages` parameter in the above estimator constructor) don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. 

  This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. 

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=True) # specify True for a verbose log

### Evaluate the model output

In [None]:
print(run.get_metrics())

In [None]:
print(run.get_file_names())

## Are you happy with the model??? Register it in Azure Machine Learning to manage

In [None]:
# register model 
model = run.register_model(model_name='sound_clasification_model', model_path='outputs/')
print(model.name, model.id, model.version, sep = '\t')

## Next step
In this Azure Machine Learning tutorial, you used Python to:

> * Set up your development environment
> * Access and examine the data
> * Train multiple models on a remote cluster using the tensorflow keras machine learning library
> * Review training details and register the best model

You are ready to deploy this registered model using the instructions in the next part of the tutorial series:

> [Tutorial 2 - Deploy models](ai-sound-major-miner-classification-part2-deploy.ipynb)