Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Forecasting Pipeline
---

In this notebook we create a pipeline to forecast sales with the models we trained in the last step. We will set up the Pipeline for forecasting given the desired forecasting horizon. As we did with training, we utilize the [ParallelRunStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallel_run_step.parallelrunstep) to parallelize the process. For more information about the data and models refer to the previous notebooks. 

### Prerequisites
At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](00_Setup_AML_Workspace.ipynb)
2. Run [01_Data_Preparation.ipynb](01_Data_Preparation.ipynb) to create the dataset
3. Run [02_Training_Pipeline.ipynb](02_Training_Pipeline.ipynb) to train the models

Please ensure you have the latest version of the Azure ML SDK and also install Pipeline Steps Package:

In [None]:
!pip install --upgrade azureml-sdk azureml-pipeline-steps

## 1.0 Connect to workspace and datastore

In [None]:
import azureml.core
from azureml.core import Workspace

# Connect to workspace
ws = Workspace.from_config()

# Get datastore
dstore = ws.get_default_datastore()

print('SDK version: ' + azureml.core.VERSION, 
      'Workspace Name: ' + ws.name,
      'Azure Region: ' + ws.location,
      'Subscription ID: ' + ws.subscription_id,
      'Resource Group: ' + ws.resource_group,
      sep='\n')

## 2.0 Get the test dataset

In the [data preparation notebook](01_Data_Preparation.ipynb), we registered a subset of the orange juice for testing purposes. We will now get a reference to that dataset in our Datastore. We will use the models trained in the [modeling notebook](02_Training_Pipeline.ipynb) to generate forecasts over all rows in each test file.

You can choose to run the pipeline on the subet of files or the full dataset of 11,973 series. If you chose to use only a subset of the files, the test dataset name will be `oj_data_small_test`. Otherwise, the name you'll have to use is `oj_data_test`. We recommend starting with the small dataset.

In [None]:
dataset_name = 'oj_data_small_test'

In [None]:
from azureml.core.dataset import Dataset

dataset = Dataset.get_by_name(ws, name=dataset_name)
dataset_input = dataset.as_named_input(dataset_name)

## 3.0 Choose a compute target

This is the compute cluster you created in the [setup notebook](00_Setup_AML_Workspace.ipynb#3.0-Create-compute-cluster) and used for training in the [training notebook](02_Training_Pipeline.ipynb).

In [None]:
cpu_cluster_name = 'cpucluster'

In [None]:
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, cpu_cluster_name)

## 4.0 Build the forecasting pipeline

As we did with the training pipeline, we'll create a ParallelRunStep to parallelize our forecasting process. You'll notice this code is essentially the same as the last step except that we'll be parallelizing [the forecasting script](scripts/forecast.py) rather than the training script. Note that we still need to pass the timeseries schema (timestamp column name, timeseries ID column names, etc) to the forecasting script. Unlike the training script, however, the name of target column is not required for the forecasting script, just optional. In a true forecasting scenario the actual values of the target are not available, of course, so the forecasting pipeline would just return predictions. In a testing scenario, the forecasting pipeline can also return the actuals, if the column name is provided, so that we may evaluate the accuracy of the forecasts on a hold-out set.

### 4.1 Configure environment for ParallelRunStep

[Option A] Environment for custom script:

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

forecast_env = Environment(name="many_models_environment_customscript")
forecast_conda_deps = CondaDependencies.create(
    pip_packages=['sklearn', 'pandas', 'joblib', 'azureml-core', 'azureml-dataprep[fuse]'])
forecast_env.python.conda_dependencies = forecast_conda_deps

[Option B] Environment for Automated ML:

In [None]:
from scripts.notebooks.modeling import get_automl_environment
forecast_env = get_automl_environment(name='many_models_environment_automl')

### 4.2 Set up ParallelRunConfig

[Option A] Configuration for custom script:

In [None]:
node_count = 1
process_count_per_node = 6
run_invocation_timeout = 180
forecasting_script = 'forecast.py'

[Option B] Configuration for Automated ML:

In [None]:
node_count = 3
process_count_per_node = 6
run_invocation_timeout = 300
forecasting_script = 'forecast_automl.py'

Now let's create the `ParallelRunConfig` object:

In [None]:
from azureml.pipeline.steps import ParallelRunConfig 

parallel_run_config = ParallelRunConfig(
    source_directory='./scripts',
    entry_script=forecasting_script,
    mini_batch_size='1',
    run_invocation_timeout=run_invocation_timeout, 
    error_threshold=-1,
    output_action='append_row', 
    environment=forecast_env, 
    process_count_per_node=process_count_per_node, 
    compute_target=compute, 
    node_count=node_count
)

And validate it:

In [None]:
from scripts.notebooks.modeling import validate_parallel_run_config

validate_parallel_run_config(parallel_run_config)

### 4.3 Set up ParallelRunStep

[Option A] Parameters for custom script:

In [None]:
step_arguments = [
    '--target_column', 'Quantity',  # Since this is a testing scenario, pass the target column name 
    '--timestamp_column', 'WeekStarting', 
    '--timeseries_id_columns', 'Store', 'Brand',
    '--model_type', 'lr'
]

[Option B] Parameters for Automated ML:

In [None]:
step_arguments = [
    '--group_column_names', 'Store', 'Brand',
    '--target_column_name', 'Quantity',  # This is optional, and needs to be passed only if inference data contains target column
    '--time_column_name', 'WeekStarting'  # This is needed for timeseries
]

Now create the `ParallelRunStep` object:

In [None]:
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import ParallelRunStep 

output_dir = PipelineData(name='forecasting_output', datastore=dstore)

parallel_run_step = ParallelRunStep(
    name="many-models-parallel-forecasting",
    parallel_run_config=parallel_run_config,
    inputs=[dataset_input],
    output=output_dir,
    allow_reuse=False,
    arguments=step_arguments
)

### 4.4 Create step to copy predictions

The forecasting pipeline includes a second step that copies the predictions from *parallel_run_step.txt* to a CSV file in a separate container. While this step is simple, it demonstates how a step can be added to the pipeline to upload the predictions to a separate datastore or make additional transformations to the output.

First, we create a datastore named **predictions** to hold the outputs of the pipeline and get a reference to it:

In [None]:
from azureml.core import Datastore
from azureml.data.data_reference import DataReference

output_dstore = Datastore.register_azure_blob_container(
    workspace=ws, 
    datastore_name='predictions',
    container_name='predictions',
    account_name=dstore.account_name,
    account_key=dstore.account_key,
    create_if_not_exists=True
)

output_dref = DataReference(output_dstore)

Next, we define the [PythonScriptStep](https://docs.microsoft.com/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep) and give it our newly create datastore as well as the location of the *parallel_run_step.txt*. Note that the copy script also uses the timeseries schema; the reason is that the copy script creates a header row for the prediction data and, thus, needs to know the column names. The target column is passed here only if it was also passed to the forecasting script.

In [None]:
from azureml.pipeline.steps import PythonScriptStep

upload_predictions_step = PythonScriptStep(
    name='copy_predictions',
    script_name='copy_predictions.py',
    compute_target=compute,
    source_directory='./scripts',
    inputs=[output_dref, output_dir],
    allow_reuse=False,
    arguments=['--parallel_run_step_output', output_dir,
               '--output_dir', output_dref,
               '--target_column', 'Quantity',
               '--timestamp_column', 'WeekStarting',
               '--timeseries_id_columns', 'Store', 'Brand']
)

### 4.5 Create pipeline

Finally, we will create the forecasting pipeline, composed of the ParallelRunStep that will issue forecasts in parallel, and the PythonScriptStep that will copy the predictions into a different Azure Blob Container.

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[parallel_run_step, upload_predictions_step])

## 5.0 Run the forecasting pipeline

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'many-models-forecasting')
print('Experiment name: ' + experiment.name)

run = experiment.submit(pipeline)
print('Run ID:', run.id)

As you did during training, you can run the folowing command if you'd like to monitor the forecasting process in jupyter notebook:

In [None]:
run.wait_for_completion(show_output=False, raise_on_error=True)

## 6.0 View the results of the forecasting pipeline

The forecasting pipeline forecasts the orange juice quantity for a Store by Brand. To see our forecasts, we download the *parallel_run_step.txt*, read the results into a dataframe, and visualize the predictions. Note that we could also download the results from the predictions container we created above.

### 6.1 Download parallel_run_step.txt locally
You need to wait until run that was submitted to Azure Machine Learning Compute Cluster is complete. You can monitor the run status in the [Azure Machine Learning Portal](https://ml.azure.com).

If this notebook has been restarted since you launched forecasting, you might need to run the following to get the `run` object:

In [None]:
# from azureml.core import Experiment
# from azureml.pipeline.core import PipelineRun

# experiment = Experiment(ws, 'many-models-forecasting')
# run = PipelineRun(experiment, '<your-run-id>')

And then download the predictions:

In [None]:
import os
from pathlib import Path

def download_predictions(run, target_dir=None, step_name='many-models-parallel-forecasting', output_name='forecasting_output'):
    stitch_run = run.find_step_run(step_name)[0]
    port_data = stitch_run.get_output_data(output_name)
    port_data.download(target_dir, show_progress=True, overwrite=True)
    return os.path.join(target_dir, 'azureml', stitch_run.id, output_name)

file_path = download_predictions(run, 'output')
file_path

### 6.2 Convert the file to a dataframe

In [None]:
import pandas as pd

df = pd.read_csv(file_path + '/parallel_run_step.txt', 
                 sep=' ',
                 names=['Store', 'Brand', 'WeekStarting', 'Predictions', 'Quantity'], 
                 parse_dates=['WeekStarting'])

df.head()

### 6.3 Visualize the predictions
First, we look at the distribution of predicted quantities by brand:

In [None]:
import seaborn as sns
fig = sns.violinplot(x=df['Brand'], y=df['Predictions'], data=df)
fig.set_title('Predictions by Brand')

Then we look at those predictions over time:

In [None]:
import matplotlib.pyplot as plt

week = df.groupby(['WeekStarting', 'Brand'])
week = week['Predictions'].sum()
week = pd.DataFrame(week.unstack(level=1))

week.plot()
plt.title('Total Predictions by Brand')
plt.xticks(rotation=40)
plt.legend(loc='upper right')
plt.xlabel('Week')
plt.ylabel('Total Predictions')
plt.show()

From there, we can trim the results to look at individual brands:

In [None]:
store = 1001
df_1001 = df[df['Store'] == store]

brands = df_1001.groupby(['WeekStarting','Brand'])
brands= brands['Predictions'].sum()
brands= pd.DataFrame(brands.unstack(level=1))

brands.plot()
plt.legend(loc='upper right', labels=brands.columns.values)
plt.xticks(rotation=40)
plt.title('Predictions for Store 1001')
plt.xlabel('Week')
plt.ylabel('Predicted Quantity')
plt.show()

Since we produced forecasts on a test set, we can also examine forecast errors. The next plot displays the distribution of absolute percentage errors for each date in the testing period. 

In [None]:
import numpy as np

# Compute the absolute percentage error for each forecast at each date
# Warning: percentage errors are not defined if the actuals contain zero values
df['APE'] = 100*np.abs((df['Quantity'] - df['Predictions']) / df['Quantity'])

fig = sns.boxplot(x='WeekStarting', y='APE', data=df)
fig.set_title('Absolute Percentage Error Distributions Over All Stores and Brands')
plt.gcf().set_size_inches(15.0, 6.0)
plt.show()

It is also useful to see the error distributions broken down by store and brand. Here, the boxplots show the distribution of errors over time in the forecast period for each series in the dataset.

In [None]:
fig = sns.boxplot(x='Store', y='APE', hue='Brand', data=df)
fig.set_title('Absolute Percentage Errors by Store and Brand')
plt.gcf().set_size_inches(15.0, 6.0)
plt.show()

## 7.0 *[Optional]* Publish and schedule the pipeline


### 7.1 Publish the pipeline
Once you have a pipeline you're happy with, you can publish a pipeline so you can call it programatically later on. See this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#publish-a-pipeline) for additional information on publishing and calling pipelines.

In [None]:
# published_pipeline = pipeline.publish(
#     name='many-models-forecasting',
#     description='Many Models forecasting pipeline',
#     version='1',
#     continue_on_step_failure=False
# )

### 7.2 Schedule the pipeline
You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically issue forecasts every week.

In [None]:
# from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
# training_pipeline_id = published_pipeline.id

# recurrence = ScheduleRecurrence(frequency="Week", interval=1, start_time="2020-01-01T09:00:00")
# recurring_schedule = Schedule.create(
#     ws, 
#     name="forecasting_pipeline_recurring_schedule", 
#     description="Schedule Forecasting Pipeline to run on the first day of every week",
#     pipeline_id=training_pipeline_id, 
#     experiment_name=experiment.name, 
#     recurrence=recurrence
# )