# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment, Dataset, Datastore
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
import logging
from azureml.train.automl import AutoMLConfig
from azureml.automl.core.forecasting_parameters import ForecastingParameters
from azureml.widgets import RunDetails
from sklearn.model_selection import train_test_split

## Dataset

### Overview

This dataset contains information about sales of stores from a Retail Company, like Walmart.The dataset contains historical weekly sales values(target column) and other
supporting variables for that period like:
1. Store identifier
2. average temperature in the week 
3. whether or not there was a holiday during the week
4. fuel price
5. Consumer Price Index(CPI)
6. Unemployment rate 

The goal of this task is to use historical data to forecast sales numbers for the next four weeks(month). These predictions are going to support
finance and business people in the company to manage the store's inventory. <br>
This dataset comes from Kaggle and further details about it can be found [here.](https://www.kaggle.com/datasets/asahu40/walmart-data-analysis-and-forcasting) 

### Get data.
I have downloaded the dataset from Kaggle and uploaded it to this notebook's working directory. <br>Now I am going to import it with the help of the
Dataset class.


In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone-project'

experiment=Experiment(ws, experiment_name)

In [3]:
datastore = Datastore.get(ws, datastore_name='workspaceworkingdirectory')

In [4]:
data = Dataset.Tabular.from_delimited_files(path=(datastore, "Users/hualcosa/nd00333-capstone/data/Walmart Data Analysis and Forcasting.csv"))
type(data)

azureml.data.tabular_dataset.TabularDataset

In [5]:
# getting data as pandas dataframe for local experiment
df = data.to_pandas_dataframe()
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,2010-02-05,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,2010-02-12,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,2010-03-05,1554806.68,0,46.5,2.625,211.350143,8.106


In [6]:
# registering dataset so It can be used for automl experiment
data.register(ws, name="sales_forecasting", description="capstone project dataset")


{
  "source": [
    "('workspaceworkingdirectory', 'Users/hualcosa/nd00333-capstone/data/Walmart Data Analysis and Forcasting.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "a8489742-86d6-4cec-b52b-6098c3e55e4a",
    "name": "sales_forecasting",
    "version": 1,
    "description": "capstone project dataset",
    "workspace": "Workspace.create(name='capstone-project', subscription_id='d2706c67-acfc-4bd3-9067-3ff6ac190bc9', resource_group='capstone-project')"
  }
}

### Now that we have registered the dataset, it appears as an data asset, and can be used as an input source to the automl experiment.


In [7]:
# Choose a name for your CPU cluster
cpu_cluster_name = "capstone-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # To use a different region for the compute, add a location='<region>' parameter
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', 
                                                            min_nodes=1,
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned



## AutoML Configuration
In order to run the the AutoML experiment, we set the following parameters:
- compute_target where the experiment is going to run. In our case, the compute cluster, we just created
- primary metric: The metric we want to optimize. Since we're dealing with a time series forecasting problem, Normalized root mean squared error is a great pick
- experiment_timeout_minutes: Maximum time that the experiment can run. I want the experiment to run for 60 minutes maximum.
- enable_early_stopping: Set it to True to allow the training iteration to prematurely end if the model scores are not improving
- n_cross_validations and cv_step_size: cross validation parameters. Set it to "auto" so automl job can investigate how to best split the data to perform cross validation
- y_min and y_max: The minimum and maximum values used to normalize RMSE. We set this values so we can have the same standard later when running the hyperdrive experiment 
- Verbosity: set logging verbosity to INFO
- Forecasting parameters: Object containing info about the forecasting job that needs to be performed. In our specifc case, it specifies what is the name of the time column, <br>
what is the forecast horizon(4 weeks) and what column(s) identify the time series

In [11]:
forecasting_parameters = ForecastingParameters(time_column_name='Date', 
                                               forecast_horizon=4,
                                               time_series_id_column_names='Store')
                                               
automl_settings = {'compute_target': cpu_cluster,
                    'primary_metric':'normalized_root_mean_squared_error',
                    'experiment_timeout_minutes': 30,
                    'enable_early_stopping': True,
                    'n_cross_validations': "auto",
                    'cv_step_size' : "auto", 
                    'y_min': df.Weekly_Sales.min(),
                    'y_max': df.Weekly_Sales.max(),
                    'verbosity': logging.INFO,
                    'forecasting_parameters': forecasting_parameters}

automl_config = AutoMLConfig(
                             task='forecasting',
                             training_data=data,
                             label_column_name='Weekly_Sales',
                             **automl_settings
                             )

In [12]:
# TODO: Submit your experiment
automl_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
capstone-project,AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [13]:
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [15]:
best_model = automl_run.get_best_child()
best_model

Experiment,Id,Type,Status,Details Page,Docs Page
capstone-project,AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe_22,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [16]:
# getting model properties
best_model.get_properties()

{'runTemplate': 'automl_child',
 'pipeline_id': '__AutoML_Ensemble__',
 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'regression\',\'primary_metric\':\'normalized_root_mean_squared_error\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':True,\'name\':\'capstone-project\',\'compute_target\':\'capstone-cluster\',\'subscription_id\':\'d2706c67-acfc-4bd3-9067-3ff6ac190bc9\',\'region\':\'brazilsouth\',\'time_column_name\':\'Date\',\'grain_column_names\':[\'Store\'],\'max_horizon\':4,\'drop_column_names\':[],\'spark_service\':None}","ensemble_run_id":"AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe_22","experiment_name":"capstone-project","workspace_name":"capstone-project","subscription_id":"d2706c67-acfc-4bd3-9067-3ff6ac190bc9","resource_group_name":"capstone-project"}}]}',
 'training_percent': '100',
 'pr

In [17]:
best_model.get_metrics()

{'normalized_root_mean_squared_log_error': 0.06700666359424767,
 'explained_variance': 0.9883480244278656,
 'normalized_root_mean_squared_error': 0.05943158667421881,
 'spearman_correlation': 0.9944430383653817,
 'median_absolute_error': 28670.01742318944,
 'normalized_mean_absolute_error': 0.0514832799184612,
 'root_mean_squared_log_error': 0.05512074870889492,
 'root_mean_squared_error': 59726.31786033828,
 'mean_absolute_error': 40797.68969088384,
 'mean_absolute_percentage_error': 4.176242281946832,
 'normalized_median_absolute_error': 0.048039543643575704,
 'r2_score': 0.9871650655281033,
 'residuals': 'aml://artifactId/ExperimentRun/dcid.AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe_22/residuals',
 'predicted_true': 'aml://artifactId/ExperimentRun/dcid.AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe_22/predicted_true',
 'forecast_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe_22/forecast_table'}

### Save the best model

In [24]:
model_path = best_model.get_properties()['model_output_path']

In [31]:
# Downloading the best model into the workspace
best_model.download_file('outputs/model.pkl')

# registering best model in the workspace
model = best_model.register_model(model_name = 'capstone_automl_best_model', model_path = model_path)
print(f"best model run id: {best_model.id}")
print(f"registered model name: {model.name}, id: {model.id}, version: {model.version}")

best model run id: AutoML_cf491250-efcb-4015-bd49-e5c11cd8fbfe_22
registered model name: capstone_automl_best_model, id: capstone_automl_best_model:2, version: 2


## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
