# HyperDrive Run Recovery
In this notebook, we recover into Python the HyperDrive run that was created in the previous notebook, and use it to find the best child run discovered by the HyperDrive search.

The steps in this notebook are
- [import libraries](#import),
- [read in the Azure ML workspace](#workspace),
- [recover a run](#recover), and
- [get the results](#results).

## Imports  <a id='import'></a>

In [None]:
import os
import shutil
import json
import time
import pandas as pd
from azureml.core import Workspace, Experiment, Run, get_run
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import azureml.core
from msrest.exceptions import HttpOperationError
from get_auth import get_auth
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

## Read in the Azure ML workspace  <a id='workspace'></a>
Read in the the workspace created in a previous notebook.

In [None]:
auth = get_auth()
ws = Workspace.from_config(auth=auth)
ws_details = ws.get_details()
print('Name:\t\t{}\nLocation:\t{}'
      .format(ws_details['name'],
              ws_details['location']))

## Recover the run  <a id='recover'></a>
Get an experiment that ran the search.

In [None]:
exp = Experiment(workspace=ws, name='hypetuning')

Get the ID of the HyperDrive run created in the last notebook. That ID was printed with the run when it was submitted in the previous notebook, and we also saved it in a file. You can also find that ID in Azure Portal on your experiment's page. To see it, you may need to add a `RunId` column to the experiment's table of runs.

In [None]:
run_id_path = "run_id.txt"
with open(run_id_path, "r") as fp:
    run_id = fp.read()
run_id

Use the ID of the AutoML run to get a handle to it.

In [None]:
run = AutoMLRun(exp, run_id)
run

## Get the results <a id='results'></a>
Get the metrics logged with each run.

In [None]:
run_metrics = run.get_metrics(recursive=True)

In [None]:
run_metrics[list(run_metrics.keys())[0]]

Get a series with each run's accuracy.

In [None]:
run_accuracy = pd.Series([x['AUC_weighted'] for x in run_metrics.values()], index=run_metrics.keys(), name='accuracy')

Find the RunId of the best run.

In [None]:
best_run_id = run_accuracy.idxmax()

Use it to recover the best run.

In [None]:
best_run = get_run(exp, best_run_id)
best_run

In [None]:
best_run.get_file_names()

In [None]:
best_run_model_path = os.path.join("outputs", best_run_id + ".pkl")
best_run.download_file("outputs/model.pkl", best_run_model_path)

In [None]:
from sklearn.externals import joblib
best_run_model = joblib.load(best_run_model_path)

In [None]:
best_run_model

Read in the test data.

In [None]:
data_path = "data"
test_path = os.path.join(data_path, "balanced_pairs_test.tsv")
test = pd.read_csv(test_path, sep='\t', encoding='latin1')

In [None]:
feature_columns = ["Text_x", "Text_y"]
label_column = "Label"
group_column = 'Id_x'
answerid_column = 'AnswerId_y'
test_X = (test.Text_x + ' ' + test.Text_y).values  # test[feature_columns]
test_y = test[label_column]

In [None]:
test['probabilities'] = best_run_model.predict_proba(test_X)[:, 1]

In [None]:
# Order the testing data by dupe Id and question AnswerId.
test.sort_values([group_column, answerid_column], inplace=True)

# Extract the ordered probabilities.
probabilities = (
    test.probabilities
    .groupby(test[group_column], sort=False)
    .apply(lambda x: tuple(x.values)))

# Get the individual records.
output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
test_score = (test[output_columns_x]
              .drop_duplicates()
              .set_index(group_column))
test_score['probabilities'] = probabilities
test_score.reset_index(inplace=True)
test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']

In [None]:
import numpy as np

def score_rank(scores):
    """Compute the ranks of the scores."""
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    """Compute the index of label in label_order."""
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    """Compute the rank of label using the scores."""
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]

In [None]:
print("Evaluating the model's performance.")

test_rank = test.groupby(group_column).apply(
    lambda x: label_rank(x.AnswerId_x.values,
                         x.probabilities.values,
                         x.AnswerId_y.values))

args_rank = 3
for i in range(1, args_rank+1):
    print('Accuracy @{} = {:.2%}'
          .format(i, (test_rank <= i).mean()))
mean_rank = test_rank.mean()
print('Mean Rank {:.4f}'.format(mean_rank))

In [None]:
help(run)

In [None]:
automl_run = AutoMLRun(exp, run_id)

In [None]:
help(automl_run)

In [None]:
automl_output = automl_run.get_output()

In [None]:
automl_output[0]

In [None]:
automl_output[1]

In [None]:
dir(automl_run)

In [None]:
automl_run.get_run_sdk_dependencies()

In [None]:
automl_run.get_properties()