Copyright (c) Microsoft Corporation. All rights reserved.

![alt text](https://www.microsoftevents.com/accounts/register123/microsoft/msft-v1/c-and-e-v2/events/ce2-ce-2c-mec0028133/Azure%20Academy%20banner_Data.png "Logo Title Text 1")

# HOL02: Azure Machine Learning serice

This lab guides you through Azure Machine Learning service - creation & setup, building experiment and train model (also use automated machine learning technique).

In this use case you build a regression model to predict NYC taxi fare prices. 
This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

In this lab you learn the following tasks:

* Create Azure Machine Learning Workspace
* Download, transform, and clean data using Azure Open Datasets
* Train an automated machine learning regression model
* Calculate model accuracy

If you don’t have an Azure subscription, create a free account before you begin. Try the [free or paid version](https://aka.ms/AMLFree) of Azure Machine Learning service today.

## Prerequisites

* Existing AML Workspace - step can be found on GitHub repo for this lab [AzureAcademy-DataAnalyst-II-ML-AI-HOL01-AML.md](https://github.com/michalmar/azure-labs/blob/master/AzureAcademy-DataAnalyst-II-ML-AI-HOL01-AML.md)

* check and update VM environemt

`pip install --upgrade azureml-sdk[explain,automl,notebooks] azureml-opendatasets azureml-widgets "urllib3==1.24"`


# PART 1: Download and prepare data

We will use [Azure Open Datasets](https://docs.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets) - curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are in the cloud on Microsoft Azure and are readily available to Azure Databricks, Machine Learning service, and Machine Learning Studio. You can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory.


We will use particular dataset: [NYC Taxi & Limousine Commission - green taxi trip records](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/)



Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example) to easily filter date parameters before downloading.

In [None]:
from azureml.opendatasets import NycTlcGreen
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

%config IPCompleter.greedy=True

Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets. To download taxi data, iteratively fetch one month at a time, and before appending it to `green_taxi_df` randomly sample 2,000 records from each month to avoid bloating the dataframe. Then preview the data.

In [None]:
# green_taxi_df = pd.read_csv("./data/taxi_raw_df.csv")
# green_taxi_df['lpepPickupDatetime'] = green_taxi_df['lpepPickupDatetime'].astype('datetime64[ns]')
# # green_taxi_df.head(10)

In [None]:
green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2015","%m/%d/%Y")
end = datetime.strptime("1/31/2015","%m/%d/%Y")

for sample_month in range(12):
    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))

green_taxi_df.head(10)

In [None]:
green_taxi_df.to_parquet("./data/taxi_raw_df.parquet")

In [None]:
green_taxi_df = pd.read_parquet("./data/taxi_raw_df.parquet")
green_taxi_df.head()

Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. 

Use the `apply()` function on the dataframe to iteratively apply the `build_time_features()` function to each row in the taxi data.

In [None]:
def build_time_features(vector):
    pickup_datetime = vector[0]
    month_num = pickup_datetime.month
    day_of_month = pickup_datetime.day
    day_of_week = pickup_datetime.weekday()
    hour_of_day = pickup_datetime.hour
    
    return pd.Series((month_num, day_of_month, day_of_week, hour_of_day))

green_taxi_df[["month_num", "day_of_month","day_of_week", "hour_of_day"]] = green_taxi_df[["lpepPickupDatetime"]].apply(build_time_features, axis=1)
green_taxi_df.head(10)

Remove some of the columns that you won't need for training or additional feature building.

In [None]:
columns_to_remove = ["lpepPickupDatetime", "lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID", 
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)
    
green_taxi_df.head(5)

### Cleanse data 

Run the `describe()` function on the new dataframe to see summary statistics for each field.

In [None]:
green_taxi_df.describe()

In [None]:
# Will allow us to embed images in the notebook
%matplotlib inline
import matplotlib.pyplot as plt


In [None]:
plt.rcParams['figure.figsize'] = [16, 10]
boxplot = green_taxi_df.boxplot(column=list(green_taxi_df.columns[1:-4]))

From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features. 

Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.

Lastly, the `totalAmount` field has negative values for the taxi fares, which don't make sense in the context of our model, and the `passengerCount` field has bad data with the minimum values being zero.

Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training.

In [None]:
final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
final_df = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

Call `describe()` again on the data to ensure cleansing worked as expected. You now have a prepared and cleansed set of taxi, holiday, and weather data to use for machine learning model training.

In [None]:
final_df.describe()

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(final_df, test_size=0.2, random_state=223)

print(f'train:{len(list(train["vendorID"]))} \ntest: {len(list(test["vendorID"]))}')

final_df.to_csv("./data/taxi_final_df.csv", index=False)
train.to_csv("./data/taxi_final_df_train.csv", index=False)
test.to_csv("./data/taxi_final_df_test.csv", index=False)

## PART 2: Train within notebook regression model with AML Serice

In PART2 you train simple regression model within the notebook environement while logging metrics and output trhough AML service Experiment. Also you try to run single parameter sweep of the regression model.

* first you create configure a connection to workspace
* then run the simple training
* lastly you run simple parameter sweep of a regression model
* review results

In [None]:
final_df = pd.read_csv("./data/taxi_final_df.csv")
# final_df.head(10)

### Configure workspace


Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [None]:
import azureml
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
# print(f"Name: {ws.name}, Resource group: {ws.resource_group}, Location: {ws.location}, Subscription: {ws.subscription_id}")
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
# output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

### Train locally within notebook

Split the data into training and test sets by using the `train_test_split` function in the `scikit-learn` library. This function segregates the data into the x (**features**) data set for model training and the y (**values to predict**) data set for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are deterministic.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

import lightgbm as lgb


y_df = final_df.pop("totalAmount")
x_df = final_df

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)

Let's use scikit-learn to train a simple LightGBM regression model.  We use AML to record interesting information about the model in an Experiment.  An Experiment contains a series of trials called Runs.  During this trial we use AML in the following way:
* We access an experiment from our AML workspace by name, which will be created if it doesn't exist
* We use `start_logging` to create a new run in this experiment
* We use `run.log()` to record a parameter, num_leaves, and an accuracy measure - the Mean Squared Error (MSE) to the run.  We will be able to review and compare these measures in the Azure Portal at a later time.
* We store the resulting model in the **outputs** directory, which is automatically captured by AML when the run is complete.
* We use `run.complete()` to indicate that the run is over and results can be captured and finalized

In [None]:
## get exact Run from Experiment
# from azureml.core import Run
# experiment = ws.experiments["train-within-notebook-lightgbm"]
# run  = Run(experiment, run_id = '27b20e6d-eb1a-4151-95eb-c94bd55a167d')
# run.complete()
# run.cancel()
# run.fail()

In [None]:
from azureml.core import Experiment

# Get an experiment object from Azure Machine Learning
experiment = Experiment(workspace=ws, name="HOL-train-in-notebook-lgbm")

# Create a run object in the experiment
run =  experiment.start_logging()


# Log the algorithm parameters to the run
run.log('num_leaves', 31)
run.log('learning_rate', 0.05)
run.log('n_estimators', 20)

# setup model, train and test
gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)
model_gbm = gbm.fit(x_train, y_train,
        eval_set=[(x_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=5)

preds = model_gbm.predict(x_test)

# Output the Mean Squared Error to the notebook and to the run
print('Mean Squared Error is', mean_squared_error(y_test, preds))
run.log('mse', mean_squared_error(y_test, preds))

# Save the model to the outputs directory for capture
model_file_name = './outputs/model.pkl'

joblib.dump(value = model_gbm, filename = model_file_name)

# upload the model file explicitly into artifacts 
# run.upload_file(name = model_file_name, path_or_stream = model_file_name)

# Complete the run
run.complete()

### Simple parameter sweep
Now let's take the same concept from above and modify the **num_leaves** parameter.  For each value of num_leaves we will create a run that will store metrics and the resulting model.  In the end we can use the captured run history to determine which model was the best for us to deploy. 

Note that by using `with experiment.start_logging() as run` AML will automatically call `run.complete()` at the end of each loop.

This example also uses the **tqdm** library to provide a thermometer feedback

In [None]:
import numpy as np
from tqdm import tqdm
# experiment = Experiment(workspace=ws, name="train-locally-within-notebook-sweep3")
# list of numbers from 0 to 1.0 with a 0.05 interval
num_leaves_sweep = np.arange(5, 35, 2)
mses = []

with experiment.start_logging() as run:

    for num_leaves in tqdm(num_leaves_sweep):
        # create a bunch of runs, each train a model with a different parameters
        with run.child_run() as child_run:
            gbm = lgb.LGBMRegressor(num_leaves=num_leaves,
                                    learning_rate=0.05,
                                    n_estimators=20,
                                    silent=True)
            model_gbm = gbm.fit(x_train, y_train,
                    eval_set=[(x_test, y_test)],
                    eval_metric='l1',
                    early_stopping_rounds=5
                    , verbose=False)

            preds = model_gbm.predict(x_test)
            mse = mean_squared_error(y_true=y_test, y_pred=preds)

            # log alpha, mean_squared_error and feature names in run history
            child_run.log(name="num_leaves", value=num_leaves)
            child_run.log(name="mse", value=mse)
            mses.append(mse)

    run.log_list(name="mses", value=mses, description='')


### Viewing run results
Azure Machine Learning stores all the details about the run in the Azure cloud.  Let's access those details by retrieving a link to the run using the default run output.  Clicking on the resulting link will take you to an interactive page presenting all run information.

In [None]:
run

An experiment is a logical container in an Azure ML Workspace. It contains a series of trials called Runs. As such, it hosts run records such as run metrics, logs, and other output artifacts from your experiments.

The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. 

In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. You now have data prepared for auto-training a machine learning model.

## PART 3: Train Regresion model on AML remote Compute

In PART 3, we focus on training models on Remote AML Compute and is divided into two parts where:
A) you train single Regreesion model similar to previous section, just using remote compute
B) you train multiple Regression models at once and select the best one via Autmated ML componement of Azure Machine Learning service. This happens on remote AML compute - simple auto-scaled cluster of machines for parallel training.

The steps are:
* configure datasource - remote storage shared between the parallel runs
* configure AML compute target
* configure and run Automated ML Experiment
* review results

Get default blob store associated with your workspace. Alternatively, you can attach your own blob storage to the Workspace - see [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data).

In [None]:
ds = ws.datastores['workspaceblobstore']
# ds = ws.get_default_datastore()
for attr, value in ds.__dict__.items():
    if (attr in ['name', 'datastore_type', 'container_name', 'account_name']):
        print(f"{attr}: {value}")

Upload prepared data into associated Datastore.

In [None]:
ds.upload(src_dir='./data', target_path='data', overwrite=True, show_progress=True)

### Configure Compute Target (SDK)

Create Compute target in Portal - alternativelly you could create using Pyhton SDK.

Reuse the name of the cluster compute you created in preview step and set appropriatelly variable:

```python
amlcompute_cluster_name = "<#Name your cluster#>" 

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "aml-cluster" #Name your cluster
# amlcompute_cluster_name = "azdemocluster-f" #Name your cluster

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_d2_v2', # Standard_F4s_v2
                                                           max_nodes=10)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use the 'status' property to get a detailed status for the current cluster. 
cts = compute_target.status.serialize()
print(f'Found existing compute target: {amlcompute_cluster_name}\n({"cluster is running" if (int(cts["currentNodeCount"])>0) else "cluster is idle"}) currentNodeCount: {cts["currentNodeCount"]}, vmPriority: {cts["vmPriority"]}, vmSize: {cts["vmSize"]}')


Project folder gets uploaded into docker and will be the ``working directory`` of the executed code.

In [None]:
# from azureml.core.runconfig import DataReferenceConfiguration

# dr = DataReferenceConfiguration(datastore_name=ds.name, 
#                    path_on_datastore='data', 
#                    path_on_compute='/tmp/azureml_runs',
#                    mode='download', # download files from datastore to compute target
#                    overwrite=False)

In [None]:
import os

project_folder = "aml_prj"

if not os.path.exists(project_folder):
    os.makedirs(project_folder)
else:
    print(f"folder '{project_folder}' aready there")

### Part 3 A: Train simple Regression model on remote AML Compute

In [None]:
from azureml.core import Dataset

dataset = Dataset.File.from_files((ds, 'data/taxi_final_df_train.csv'))
dataset

In [None]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name="HOL-train-on-compute-simple")


In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

conda_env = Environment('conda-env')
conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk',
                                                                             'azureml-dataprep[pandas,fuse]',
                                                                             'scikit-learn',
                                                                             'lightgbm',
                                                                            'joblib'])

We must configure the run based on environemnt, script folder with main script and arguments - such as dataset.

**Important** the script is just an ordinary `*.py` file located in the script folder named `train.py`

In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder, 
                      script='train.py', 
                      arguments =[dataset.as_named_input('taxi_data').as_mount()])

src.run_config.framework = 'python'
src.run_config.environment = conda_env
src.run_config.target = compute_target.name
# src.run_config.data_references = {ds.name: dr}

In [None]:
run = exp.submit(config=src)

The experiment is **submitted** to run on remote Compute (AzureML compute cluster) in background. If you run this for the first time, it will run for **10-20min** since it needs to:
- build docker image based on your environment
- send the image to Azure Container Registry - repository for your environment images
- start the compute cluster & upload the docker image in the compute
- start the image with the training code

Wait until the below widget turns green and says **Completed**.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

In [None]:
raise Exception('### INTENDED STOP ### to wait for asynchronous training Job')

### Part 3B: Trainng multiple models in parallel using Automated ML

Observe parameters of DataReferenceConfiguration:

* `path_on_datastore`...folder in container
* `path_on_compute`...folder where the data is mounted/downloaded
* `mode`...wheter download or just mount

The RunConfiguration sets the docker Python environment - packages and Conda dependencies.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
import pkg_resources

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to the Linux DSVM
conda_run_config.target = compute_target
# set the data reference of the run coonfiguration
# conda_run_config.data_references = {ds.name: dr}

pandas_dependency = 'pandas==' + pkg_resources.get_distribution("pandas").version

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], conda_packages=['numpy',pandas_dependency])
conda_run_config.environment.python.conda_dependencies = cd

### Automatically train a model

To automatically train a model, take the following steps:
1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.
1. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.

#### Define training settings

Define the experiment parameter and model settings for training. View the full list of [settings](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 5-10 min, but if you want a shorter run time, reduce the `iterations` parameter.


|Property| Value in this tutorial |Description|
|----|----|---|
|**iteration_timeout_minutes**|2|Time limit in minutes for each iteration. Reduce this value to decrease total runtime.|
|**iterations**|20|Number of iterations. In each iteration, a new machine learning model is trained with your data. This is the primary value that affects total run time.|
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
|**preprocess**| True | By using **True**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
|**verbosity**| logging.INFO | Controls the level of logging.|
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|
|**max_concurrent_iterations**|10|Number of parallel runs - should according to cluster size.|


In [None]:
from azureml.train.automl import AutoMLConfig
import azureml.dataprep as dprep
from azureml.core.dataset import Dataset

# train_data_dprep = dprep.auto_read_file(path=ds.path("data/taxi_final_df_train.csv"))
# valid_data_dprep = dprep.auto_read_file(path=ds.path("data/taxi_final_df_test.csv"))

train_data_dprep = Dataset.Tabular.from_delimited_files(path=(ds, './data/taxi_final_df_test.csv'))
valid_data_dprep = Dataset.Tabular.from_delimited_files(path=(ds, './data/taxi_final_df_train.csv'))


automl_config = AutoMLConfig(task='regression',
                            iteration_timeout_minutes=30,
                            iterations=10,
                            featurization='auto',
                            blocked_models = ["XGBoostRegressor","ElasticNet"],
                            primary_metric='normalized_root_mean_squared_error',
                            training_data=train_data_dprep,
                            validation_data=valid_data_dprep,                             
                            label_column_name="totalAmount",
                            debug_log='automl.log',
                            run_configuration=conda_run_config,
                            model_explainability=False,
                            max_concurrent_iterations=10,
                            path= project_folder)

Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

#### Train the automatic regression model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs. Pass the defined `automl_config` object to the experiment, and set the output to `True` to view progress during the run. 

After starting the experiment, the output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type.

During the training / experiment running you can observe the result and changes in Azure Portal. Also you can view the state of the runs and results using Widget below in sub-section **Explore results**

In [None]:
from azureml.core.experiment import Experiment
experiment=Experiment(ws, 'HOL-train-automl')
remote_run = experiment.submit(automl_config, show_output=False)

### Explore the results

Explore the results of automatic training with a [Jupyter widget](https://docs.microsoft.com/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py). The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector.

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

In [None]:
raise Exception('### INTENDED STOP ### to wait for asynchronous AutoML Job')

### Review trained model & results

#### Retrieve the best model

Select the best model from your iterations. The `get_output` function returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration.

In [None]:
best_run, fitted_model = remote_run.get_output()
print("Best Run:")
print(best_run)

print("")
print("Fitted model:")
print(fitted_model)

#### Test the best model accuracy

Use the best model to run predictions on the test data set to predict taxi fares. The function `predict` uses the best model and predicts the values of y, **trip cost**, from the `x_test` data set. Print the first 10 predicted cost values from `y_predict`.


In [None]:
test = pd.read_csv("./data/taxi_final_df_test.csv")

In [None]:
test.head()

In [None]:
y_test = test.pop("totalAmount")
x_test = test

In [None]:
y_predict = fitted_model.predict(x_test)
print(y_predict[:10])

Calculate the `root mean squared error` of the results. Convert the `y_test` dataframe to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, **cost**. It indicates roughly how far the taxi fare predictions are from the actual fares.


In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
print(f"RMSE: {rmse}")

Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` data sets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values.


In [None]:
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $4.00, and approximately 15% error. 

The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
test_pred = plt.scatter(y_actual, y_predict, color='b')
test_test = plt.scatter(y_actual, y_actual, color='g')
plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)
plt.show()

## PART 4: Model Deployment [optional]

In this part of the lab, you use trained best model to create deployment as webservice through Managed real-time Endpoint.

You will move for a while from this notebook to AzureML studio (just to showcase), but you could do similar through Python SDK or CLI. The steps of deployment process are:

* Navigating to experiment for AutoML and select the best model
* Deploy the model to real-time endpoint
* test the endpoint with your data

First, navigate to you Experiments and find the AutoML Experiment.

![deployment1](./media/deployment01.jpg)

Then, on the Experiment, locate your latest run a select Models to see all trained model during AutoML:
![deploymen2](./media/deployment02.jpg)

On th particular model, select Deploy -> Deploy to real-time endpoint (Preview) and follow the wizard.
![deployment3](./media/deployment03.jpg)

In the wizard, give your enpoint unique name (it is name of the publicly available service), and leave next steps in deafult values.
![deployment4](./media/deployment04.jpg)


At the Compute stage of the wizard lower number of instances to `1`:
![deployment5](./media/deployment05.jpg)

Finish the deployment by hit on blue "Create" button on th mottom of the page.
![deployment6](./media/deployment06.jpg)

After succesfull deployment, navigate to your Endpoints and select newly deployed real-time Endpoint to see the details.
![deployment7](./media/deployment07.jpg)

You can navigate to **Consume** tab, to see the Key, Scoring URI and example in various programming languages on how to call the enpoint -> choose Python and copy & paste into cell below (or only fill the necessary parameters).
![deployment8](./media/deployment08.jpg)

### Test the webservices - score in real-time

In [None]:
service_name ="***" # YOUR ENDPONT NAME
scoring_uri = "***" # YOUR SCORING URI
scoring_key = "***" # YOUR SCORING KEY 

import requests
import json

# get sample test/validation data
sample_df = pd.read_csv("./data/taxi_final_df_test.csv").sample()
vals = sample_df[["vendorID","passengerCount","tripDistance","month_num","day_of_month","day_of_week","hour_of_day","totalAmount"]].values
sample_data = vals.tolist()[0][0:-1]
sample_target = vals.tolist()[0][-1]
test_sample = json.dumps({"Inputs": {"data": sample_data} })

# Convert to JSON string
input_data = json.dumps(test_sample)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
headers['Authorization'] = f'Bearer {scoring_key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
# print(resp.text)

resp_json = json.loads(resp.text)
prediction = resp_json["Results"][0]

print(f"Prediction: {prediction}\nActuals: {sample_target}")

## Clean up resources

Do not complete this section if you plan on running other Azure Machine Learning service tutorials.

### Delete service

In [None]:
service.delete()

### Stop the notebook VM

If you used a cloud notebook server, stop the VM when you are not using it to reduce cost.

1. In your workspace, select **Notebook VMs**.
1. From the list, select the VM.
1. Select **Stop**.
1. When you're ready to use the server again, select **Start**.

### Delete everything

If you don't plan to use the resources you created, delete them, so you don't incur any charges.

1. In the Azure portal, select **Resource groups** on the far left.
1. From the list, select the resource group you created.
1. Select **Delete resource group**.
1. Enter the resource group name. Then select **Delete**.

You can also keep the resource group but delete a single workspace. Display the workspace properties and select **Delete**.

## Next steps

In this machine learning lab, you did the following tasks:

> * Configured a workspace and prepared data for an experiment.
> * Trained by using aregression model locally &  with custom parameters.
> * Trained by using autmated ML aregression model on achine learning compute.
> * Explored and reviewed training results.
> * Deploy model to ACI and test the web service

Visit [docs](https://docs.microsoft.com/azure/machine-learning/service/) with Azure Machine Learning service documenation and tutorials.

Learn by examples and code at [AML GitHub](https://github.com/Azure/MachineLearningNotebooks)