Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 02.1 Retrain Failed models

This notebook demonstrates how to re-train failed models. It walks through how to download the log file, identify failed models' file names, clean the data, upload cleaned data back to the blob, then register the clean file dataset to the Workspace for re-training.

## Prerequisites

You should run this notebook only when models training failed and failures are logged into the log file.

## 1.0 Set up Workspace and datastore

In [None]:
from azureml.core import Workspace, Datastore

# Set up workspace
ws= Workspace.from_config(path='../aml_config/ws_config.json')

# Take a look at Workspace
ws.get_details()

# Set up datastores
dstore = ws.get_default_datastore()
train_output_dstore = Datastore(ws, 'training_output_datastore')

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, 
      'Training datastore name: '+ train_output_dstore.name,
      'Default datastore name: '+ dstore.name,
      sep = '\n')

## 2.0 Download the log file from the blob

Download the log file from the blob. You can change the date to any date you'd like. Here we use today's date.

In [None]:
import datetime

# Get today's date and set the log_filepath
today_date = datetime.date.today()
today_log_filepath = 'training_log_' + str(today_date) + '/training_log.csv'

We download the log file from training output datastore to a local path called 'training_logs'.

In [None]:
local_path = './training_logs'

# Download log file
train_output_dstore.download(target_path=local_path, prefix=today_log_filepath, overwrite=True)

## 3.0 Read the log file into a dataframe

Then read the log file into a pandas dataframe to identify failed models.

In [None]:
import os
import pandas as pd

# Get filepath
path = os.path.join(local_path, today_log_filepath)

# Read log file
df_log = pd.read_csv(path)

## 4.0 Identify failed models

Here we generate a list that contains file names of all failed models.

In [None]:
# Get filenames of failed models
failed_list = list(df_log['FileName'].loc[df_log.Status.str.contains('Fail')])
failed_list = [f.strip( ) + '.csv' for f in failed_list]

print(failed_list)

## 5.0 Download and read dirty data from the blob

Download all the files that contain dirty data to a local path called 'dirty_data'. 

We use oj_sales_data_small as an example. You can change it to oj_sales_data if trained 11,973 models.

In [None]:
dstore_dir = 'oj_sales_data_small/'

# Download dirty data
for file in failed_list:
    dstore.download(target_path = 'dirty_data', prefix = dstore_dir + str(file))

Read dirty data into dataframes. In this example we have 3 failed models.

In [None]:
# Read dirty data
df_Store1000_dominicks = pd.read_csv('dirty_data/oj_sales_data_small/Store1000_dominicks.csv')
df_Store1032_dominicks = pd.read_csv('dirty_data/oj_sales_data_small/Store1032_dominicks.csv')
df_Store1031_minute_maid = pd.read_csv('dirty_data/oj_sales_data_small/Store1031_minute.maid.csv')

## 6.0 Clean dirty data

Take a look at the data and identify where the data quality issues occur and clean them up.

Here we use Store1031_minute_maid.csv as an example.

In [None]:
# Clean data
df_Store1031_minute_maid.ix[7, 'WeekStarting'] = '8/2/90'
df_Store1031_minute_maid.ix[9, 'Quantity'] = 12020
df_Store1031_minute_maid.ix[27, 'Quantity'] = 11002

## 7.0 Save and upload clean data to the blob

Now we save the cleaned data to csv format and upload them to the default datastore, under directory 'clean_data'.

In [None]:
# Create a local directory
os.mkdir('clean_data')

# Save dataframe to csv
df_Store1031_minute_maid.to_csv('clean_data/Store1031_minute_maid.csv')
df_Store1032_dominicks.to_csv('clean_data/Store1032_dominicks.csv')
df_Store1031_minute_maid.to_csv('clean_data/Store1031_minute_maid.csv')

In [None]:
clean_data_dir = 'clean_data'

# Upload clean data to the datastore
dstore.upload(src_dir= clean_data_dir, 
              target_path= clean_data_dir,
              overwrite=True)

## 8.0 Register the clean filedataset to the Workspace

Finally we register the clean_data folder as filedataset back to the workspace. You can now re-visit 02 Training Pipeline notebook. Then call the 'oj_data_clean' from Workspace as input filedataset in ParallelRunStep to retrain the failed models.

In [None]:
from azureml.core.dataset import Dataset

clean_ds_name = 'oj_data_clean'
path_on_datastore = dstore.path(clean_data_dir + '/')

# Get files as input filedatasets from the path
input_ds = Dataset.File.from_files(path = path_on_datastore, validate=False)

# Register the filedatasets to the workspace
registered_ds = input_ds.register(ws, clean_ds_name, create_new_version=True)