Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Forecasting Pipeline - Custom Script
---

In this notebook we create a pipeline to do batch forecasting of sales with the models we trained in the last step. The forecasting pipeline we'll set up is similar to the training pipeline in the last step so we'll keep the documentation light. For more details on the steps and functions refer to the last notebook.

### Prerequisites
At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](../00_Setup_AML_Workspace.ipynb)
2. Run [01_Data_Preparation.ipynb](../01_Data_Preparation.ipynb) to setup your compute and create the dataset
3. Run [02_CustomScript_Training_Pipeline.ipynb](02_CustomScript_Training_Pipeline.ipynb) to train the models

Also please ensure you have the latest version of the Azure ML SDK and also install Pipeline Steps Package:

In [None]:
%pip install --upgrade azureml-sdk azureml-pipeline-steps

## 1.0 Connect to workspace and datastore

In [None]:
from azureml.core import Workspace
from azureml.core import Datastore

ws = Workspace.from_config()

# set up datastores
dstore = ws.get_default_datastore()

print('Workspace Name: ' + ws.name, 
      'Azure Region: ' + ws.location, 
      'Subscription Id: ' + ws.subscription_id, 
      'Resource Group: ' + ws.resource_group, sep='\n')

## 2.0 Create an experiment

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'oj_forecasting_customscript_notebook')

## 3.0 Get the dataset

In the [data preparation notebook](../01_Data_Preparation.ipynb), we registered a subset of the orange juice for prediction purposes. We will now get a reference to that dataset in our Datastore. We will use the models trained in the [modeling notebook](02_CustomScript_Training_Pipeline.ipynb) to generate forecasts over all rows in each inference file.

You can choose to run the pipeline on the subet of files or the full dataset of 11,973 series. If you chose to use only a subset of the files, the inference dataset name will be `oj_data_small_inference`. Otherwise, the name you'll have to use is `oj_data_inference`. We recommend starting with the small dataset."

In [None]:
dataset_name = 'oj_data_small_inference'

In [None]:
from azureml.core.dataset import Dataset

dataset = Dataset.get_by_name(ws, name=dataset_name)
dataset_input = dataset.as_named_input(dataset_name)

## 4.0 Create ParallelRunStep for the forecasting pipeline
As we did with the training pipeline, we'll create a ParallelRunStep to parallelize our forecasting process. You'll notice this code is essentially the same as the last step except that we'll be parallelizing [**batch_forecasting.py**](../../scripts/customscript/batch_forecasting.py) rather than train.py. Note that we still need to pass the timeseries schema (timestamp column name, timeseries ID column names, etc) to the forecasting script.

Unlike the training script, however, the name of target column is not required for the forecasting script. In a true forecasting scenario the actual values of the target are not available, of course, so the forecasting pipeline would just return predictions. However, the forecasting pipeline can also return the actuals if they are present in the inference dataset.

### 4.1 Configure environment for ParallelRunStep

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

forecast_env = Environment(name="many_models_environment")
forecast_conda_deps = CondaDependencies.create(pip_packages=['sklearn', 'pandas', 'joblib', 'azureml-defaults', 'azureml-core', 'azureml-dataprep[fuse]'])
forecast_env.python.conda_dependencies = forecast_conda_deps

### 4.2 Choose a compute target

This is the compute cluster you created in the [setup notebook](../00_Setup_AML_Workspace.ipynb#3.0-Create-compute-cluster).

In [None]:
cpu_cluster_name = "cpucluster"

In [None]:
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, cpu_cluster_name)

### 4.3 Specify forecasting script settings

Many Models batch forecasting will be performed by executing a custom forecasting script in the compute target we just chose.
That script is also located under the [`/scripts/customscript`](../../scripts/customscript/) directory in the repository, together with the training script we saw in the previous step and some others.

In [None]:
batchforecast_script_dir = '../../scripts/customscript/'
batchforecast_script_name = 'batch_forecasting.py'

The batch forecasting script uses a settings file that contains the names of timestamp and id columns, as well as a string identifying the model type. We will reuse the settings file we created during training, as the schema and contents are the same.

In [None]:
settings_file = 'customscript_settings.json'

### 4.3 Set up ParallelRunConfig

In [None]:
from azureml.pipeline.steps import ParallelRunConfig 

process_count_per_node = 6
node_count = 1
timeout = 180

parallel_run_config = ParallelRunConfig(
    source_directory=batchforecast_script_dir,
    entry_script=batchforecast_script_name,
    mini_batch_size='1',
    run_invocation_timeout=timeout, 
    error_threshold=10,
    output_action='append_row', 
    environment=forecast_env, 
    process_count_per_node=process_count_per_node, 
    compute_target=compute, 
    node_count=node_count
)

### 4.4 Set up ParallelRunStep

In [None]:
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import ParallelRunStep 

output_dir = PipelineData(name='forecasting_output', datastore=dstore)

parallel_run_step = ParallelRunStep(
    name="many-models-forecasting",
    parallel_run_config=parallel_run_config,
    inputs=[dataset_input],
    output=output_dir,
    allow_reuse=False,
    arguments=['--settings-file', settings_file]
)

## 5.0 Create step to copy predictions

The forecasting pipeline includes a second step that copies the predictions from *parallel_run_step.txt* to a CSV file in a separate container. While this step is simple, it demonstates how a step can be added to the pipeline to upload the predictions to a separate datastore or make additional transformations to the output.

### 5.1 Create a data reference
First, we create a datastore named **predictions** to hold the outputs of the pipeline and get a reference to it:

In [None]:
from azureml.data.data_reference import DataReference

output_dstore = Datastore.register_azure_blob_container(
    workspace=ws, 
    datastore_name="predictions",
    container_name="predictions",
    account_name=dstore.account_name,
    account_key=dstore.account_key,
    create_if_not_exists=True
)

output_dref = DataReference(output_dstore)

### 5.2 Create PythonScriptStep
Next, we define the [PythonScriptStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py) and give it our newly create datastore as well as the location of the *parallel_run_step.txt*. Note that the copy script also uses the settings file; the reason is that the copy script creates a header row for the prediction data and, thus, needs to know the column names.

In [None]:
from azureml.pipeline.steps import PythonScriptStep

upload_predictions_step = PythonScriptStep(
    name="copy_predictions",
    script_name="copy_predictions.py",
    compute_target=compute,
    source_directory=batchforecast_script_dir,
    inputs=[output_dref, output_dir],
    allow_reuse=False,
    arguments=['--parallel_run_step_output', output_dir,
               '--output_dir', output_dref,
               '--settings-file', settings_file]
)

## 6.0 Run the pipeline

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[parallel_run_step, upload_predictions_step])
run = experiment.submit(pipeline)

In [None]:
# Uncomment the following code to get a reference to a previous forecasting run
#from azureml.pipeline.core import PipelineRun
#run = PipelineRun(experiment, '<pipeline run id>')

In [None]:
# Wait for the run to be completed
run.wait_for_completion(show_output=False, raise_on_error=True)

## 7.0 View the results of the forecasting pipeline
To see our forecasts, we download the *parallel_run_step.txt*, read the results into a dataframe, and visualize the predictions. Note that we could also download the results from the predictions container we created above.

### 7.1 Download parallel_run_step.txt locally
You need to wait until run that was submitted to Azure Machine Learning Compute Cluster is complete. You can monitor the run status in https://ml.azure.com

In [None]:
import os
from pathlib import Path

def download_predictions(run, target_dir=None, step_name='many-models-forecasting', output_name='forecasting_output'):
    stitch_run = run.find_step_run(step_name)[0]
    port_data = stitch_run.get_output_data(output_name)
    port_data.download(target_dir, show_progress=True, overwrite=True)
    return os.path.join(target_dir, 'azureml', stitch_run.id, output_name)

file_path = download_predictions(run, 'output')
file_path

### 7.2 Convert the file to a dataframe

In [None]:
import pandas as pd

df = pd.read_csv(file_path + '/parallel_run_step.txt', sep=" ", header=None)
df.columns = ['WeekStarting', 'Predictions', 'Quantity', 'Store', 'Brand']
df['WeekStarting'] = pd.to_datetime(df['WeekStarting'])
df.head()

### 7.3 Visualize the predictions
First, we look at the distribution of predicted quantities by brand:

In [None]:
import seaborn as sns
fig = sns.violinplot(x=df['Brand'], y=df['Predictions'], data=df)
fig.set_title('Predictions by Brand')

Then we look at those predictions over time:

In [None]:
import matplotlib.pyplot as plt

week = df.groupby(['WeekStarting', 'Brand'])
week = week['Predictions'].sum()
week = pd.DataFrame(week.unstack(level=1))

week.plot()
plt.title('Total Predictions by Brand')
plt.xticks(rotation=40)
plt.legend(loc='upper right')
plt.xlabel('Week')
plt.ylabel('Total Predictions')
plt.show()

From there, we can trim the results to look at individual brands:

In [None]:
store = 1001
df_1001 = df[df['Store'] == store]

brands = df_1001.groupby(['WeekStarting','Brand'])
brands= brands['Predictions'].sum()
brands= pd.DataFrame(brands.unstack(level=1))

brands.plot()
plt.legend(loc='upper right', labels=brands.columns.values)
plt.xticks(rotation=40)
plt.title('Predictions for Store 1001')
plt.xlabel('Week')
plt.ylabel('Predicted Quantity')
plt.show()

## 8.0 Publish and schedule the pipeline (Optional)


### 8.1 Publish the pipeline
Once you have a pipeline you're happy with, you can publish a pipeline so you can call it programatically later on. See this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#publish-a-pipeline) for additional information on publishing and calling pipelines.

In [None]:
# published_pipeline = pipeline.publish(name = 'forecast_many_models',
#                                      description = 'forecast many models',
#                                      version = '1',
#                                      continue_on_step_failure = False)

### 8.2 Schedule the pipeline
You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically retrain models every month or based on another trigger such as data drift.

In [None]:
# from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
# training_pipeline_id = published_pipeline.id

# recurrence = ScheduleRecurrence(frequency="Week", interval=1, start_time="2020-01-01T09:00:00")
# recurring_schedule = Schedule.create(ws, name="forecasting_pipeline_recurring_schedule", 
#                             description="Schedule Forecasting Pipeline to run on the first day of every week",
#                             pipeline_id=training_pipeline_id, 
#                             experiment_name=experiment.name, 
#                             recurrence=recurrence)