# Automated Machine Learning - Bike Forecasting download model and leverage in notebook

- Initially you will connect using the Kernal -> Python 3.10 SDK V2
- Then you will use the project_environment

**BikeShare Demand Forecasting**

Recall: https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-automated-ml-forecast

# 1. Connect to Azure Machine Learning Workspace


## 1.1. Import the required libraries

In [None]:
#Required to set to your AutoML Job Name
job_name = 'careful_jelly_wk9dj1jssn'

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl
from azure.ai.ml import Input
import os
import json
import pandas as pd

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<>"
    resource_group = "<>"
    workspace = "<>"

    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

#### Show Azure ML Workspace information

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

# 5. Retrieve the Best Trial (Best Model's trial/run)
Use the MLFLowClient to access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

## 5.1 Initialize MLFlow Client
The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface. 
Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

*IMPORTANT*, you need to have installed the latest MLFlow packages with:

    pip install azureml-mlflow

    pip install mlflow


#### Obtain the tracking URI for MLFlow

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

### Get the AutoML parent Job

- Set your job name

In [None]:
print(job_name)
# Example if providing an specific Job name/ID
# job_name = "591640e8-0f88-49c5-adaa-39b9b9d75531"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

In [None]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

### Get the AutoML best child run

In [None]:
# Get the best model's child run

best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

## 5.2 Get best model run's validation metrics

Access the results (such as models, artifacts, metrics) of a previously completed AutoML Run.

In [None]:
pd.DataFrame(best_run.data.metrics, index=[0]).T

# 6. Model evaluation and deployemnt.
## 6.1 Download the best model

Access the results (such as models, artifacts, metrics) of a previously completed AutoML Run.

In [None]:
# Create local folder
import os

local_dir = "./artifact_downloads/"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

In [None]:
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

In [None]:
# Show the contents of the MLFlow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

### Featurization
We can look at the engineered feature names generated in time-series featurization via the JSON file named 'engineered_feature_names.json' under the run outputs.

In [None]:
with open(os.path.join(local_path, "engineered_feature_names.json"), "r") as f:
    records = json.load(f)

records

### View featurization summary
You can also see what featurization steps were performed on different raw features in the user data. For each raw feature in the user data, the following information is displayed:

+ Raw feature name
+ Number of engineered features formed out of this raw feature
+ Type detected
+ If feature was dropped
+ List of feature transformations for the raw feature

In [None]:
# Render the JSON as a pandas DataFrame
with open(os.path.join(local_path, "featurization_summary.json"), "r") as f:
    records = json.load(f)
fs = pd.DataFrame.from_records(records)

# View a summary of the featurization
fs[
    [
        "RawFeatureName",
        "TypeDetected",
        "Dropped",
        "EngineeredFeatureCount",
        "Transformations",
    ]
]

## Side Track - Loading and testing best model by locally downloading the model

In [None]:
import os

# Show the contents of the MLFlow model folder
os.listdir("./artifact_downloads/outputs/mlflow-model")

In [None]:
import pandas as pd

test_df = pd.read_csv(
    "./test_dataset/bike-no-test.csv"
).reset_index(drop=True)
y_actual = test_df.pop("cnt").values

test_df.shape, y_actual.shape

In [None]:
test_df.head()

In [None]:
test_df['date'].agg(['min', 'max'])

## Create environment from conda.yaml file and activate in notebook to leverage.

```
cd BlackHillsEnergy/BHE_automl-forecasting-task-bike-share/
cd 'artifact_downloads'
cd outputs/mlflow-model/
conda env create -f conda.yaml
conda activate project_environment
ipython kernel install --user --name project_environment --display-name "project_environment"
pip install azure-ai-ml
pip install mlflow
pip install azureml-mlflow
pip install --upgrade plotly
```

- After the environment has been made available to Jupyter, Refresh this session (F5, or Hit refresh on your browser)

When you go to your `Kernel` -> `Change Kernel`, it will be available to select. You will have to rerun the notebook, but when you download the model, you will be using all of the correct versions of libraries.

*Note to remove an environment with conda leverage

```conda env remove -n job_env```

In [None]:
import mlflow.pyfunc
import mlflow.sklearn

In [None]:
type(test_df)

In [None]:
# Get the MLFlow model from the downloaded MLFlow model files

model = mlflow.sklearn.load_model("./artifact_downloads/outputs/mlflow-model")

In [None]:
# # Forecasting models predict with .forecast() or .forecast_quantiles(), not with .predict()

# y_preds = model.forecast(test_df, ignore_data_errors=True)

# y_preds

# # Original forecasting with .forecast_quantiles(X_test)
# # https://github.com/Azure/azureml-examples/blob/main/python-sdk/tutorials/automl-with-azureml/forecasting-energy-demand/forecasting_script.py

## Forecasting from trained model

[https://github.com/Azure/azureml-examples/blob/main/v1/python-sdk/tutorials/automl-with-azureml/forecasting-forecast-function/auto-ml-forecasting-function.ipynb]

2 scenerios:
- 1. Right after training data
- 2. More complex - forecasting when tereh is a gap between training and testing data

### Scenerio One

we have time to retrain the model every time we wish to forecast.  Forecasts that are made on daily and slower cadence.
Retrain the model every time benefits the accuracy because the most recent data is often most informative


![image info](./predict_no_gap.png)

In [None]:
import pandas as pd
X_test = pd.read_csv("./test_dataset/bike-no-test.csv", parse_dates=[model.time_column_name])
y_test = X_test.pop("cnt").values

y_pred_no_gap, xy_nogap =  model.forecast(X_test)
xy_nogap

In [None]:
y_pred_no_gap

## Confidence intervals

Forecasting model may be used for the prediction of forecasting intervals by running forecast_quantiles(). This method accepts the same parameters as forecast().

In [None]:
## Confidence intervals

quantiles = model.forecast_quantiles(X_test)
quantiles

## Distribution forecasts

Often the figure of interest is not just the point prediction, but the prediction at some quantile of the distribution. This arises when the forecast is used to control some kind of inventory, for example of grocery items or virtual machines for a cloud service. In such case, the control point is usually something like "we want the item to be in stock and not run out 99% of the time". This is called a "service level". Here is how you get quantile forecasts.

In [None]:
# specify which quantiles you would like
model.quantiles = [0.01, 0.5, 0.95]
# use forecast_quantiles function, not the forecast() one
y_pred_quantiles = model.forecast_quantiles(X_test)

# quantile forecasts returned in a Dataframe along with the time and time series id columns
y_pred_quantiles

## Destination-date forecast: "just do something"

In some scenarios, the **X_test** is not known. The forecast is likely to be weak, because it is missing contemporaneous predictors, which we will need to impute. If you still wish to predict forward under the assumption that the last known values will be carried forward, you can forecast out to "destination date". The destination date still needs to fit within the forecast horizon from training.

In [None]:
dest = max(X_test['date'])
print(dest)

y_pred_dest, xy_dest = model.forecast(forecast_destination=dest)

# This form also shows how we imputed the predictors which were not given. (Not so well! Use with caution!)
xy_dest

## Forecasting away from training data

Suppose we trained a model, some time passed, and now we want to apply the model without re-training. If the model "looks back" -- uses previous values of the target -- then we somehow need to provide those values to the model.

**Won't cover right now, but good to know that it is an option**

[https://github.com/Azure/azureml-examples/blob/main/v1/python-sdk/tutorials/automl-with-azureml/forecasting-forecast-function/auto-ml-forecasting-function.ipynb]


![image info](./ForecastAwayfromTraining.png)

The notion of forecast origin comes into play: the forecast origin is **the last period for which we have seen the target value**. This applies per time-series, so each time-series can have a different forecast origin.

The part of data before the forecast origin is the **prediction context**. To provide the context values the model needs when it looks back, we pass definite values in `y_test` (aligned with corresponding times in `X_test`).

# Rolling Forecast

 
[https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast#evaluating-model-accuracy-with-a-rolling-forecast]

A best practice procedure is a rolling evaluation that rolls the trained forecaster forward in time over the test set, averaging error metrics over several prediction windows. Ideally, the test set for the evaluation is long relative to the model's forecast horizon. Estimates of forecasting error may otherwise be statistically noisy and, therefore, less reliable.

[https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast#forecasting-with-a-trained-model]

### Often used for evaluation

we use **known actual values** of the target for our context data

The step size for the **rolling forecast** is set to one which means that the forecaster is advanced one period, or one day in our demand prediction example, at each iteration. The total number of forecasts returned by rolling_forecast depends on the length of the test set and this step size.
    

In [None]:
# Make a rolling forecast, advancing the forecast origin by 1 period on each iteration through the test set
X_test = pd.read_csv("./test_dataset/bike-no-test.csv", parse_dates=[model.time_column_name])
y_test = X_test.pop("cnt").values
    
X_rf = model.rolling_forecast(X_test, y_test, step=1, ignore_data_errors=True)

In [None]:
X_rf

In [None]:
# Add predictions, actuals, and horizon relative to rolling origin to the test feature data
assign_dict = {
            model.forecast_origin_column_name: "forecast_origin",
            model.forecast_column_name: "predicted",
            model.actual_column_name: "cnt",
        }
X_rf.rename(columns=assign_dict, inplace=True)
# drop rows where prediction or actuals are nan happens because of missing actuals or at edges of time due to lags/rolling windows]
X_rf.dropna(inplace=True)
print(f"The predictions have {X_rf.shape[0]} rows and {X_rf.shape[1]} columns.")

In [None]:
X_rf

## Reviewing

This is like forecast using the X_test (since there is no gap in the data)

In [None]:
y_pred_no_gap, xy_nogap = model.forecast(X_test)

In [None]:
xy_nogap

In [None]:
y_pred_no_gap

## Prediction into the future


Confidence interval and distributional forecasts

AutoML cannot currently estimate forecast errors beyond the forecast horizon set during training, so the forecast_quantiles() function will return missing values for quantiles not equal to 0.5 beyond the forecast horizon.


**forecast_quanties()** generates forecasts for given quanties of the prediction distribution.



In [None]:
quantiles = [0.025, 0.5, 0.975]
predicted_column_name = "predicted"
PI = "prediction_interval"
model.quantiles = quantiles
pred_quantiles = model.forecast_quantiles(X_test, ignore_data_errors=True)
pred_quantiles[PI] = pred_quantiles[[min(quantiles), max(quantiles)]].apply(lambda x: "[{}, {}]".format(x[0], x[1]), axis=1)

In [None]:
target_column_name = "cnt"
X_test[target_column_name] = y_test
X_test[PI] = pred_quantiles[PI]
X_test[predicted_column_name] = pred_quantiles[0.5]
# drop rows where prediction or actuals are nan
# happens because of missing actuals
# or at edges of time due to lags/rolling windows
clean = X_test[X_test[[target_column_name, predicted_column_name]].notnull().all(axis=1)]
clean

## Single point Prediction

In [None]:
import numpy as np

X_test = pd.read_csv("./test_dataset/bike-no-test.csv", parse_dates=[model.time_column_name])
y_test = X_test.pop("cnt").values

label_query = y_test.copy().astype(np.float)
label_query.fill(np.nan)

#single point prediction
df = model.forecast_quantiles(forecast_destination=pd.Timestamp(2012, 9, 2))

# Get forecasts for the 5th, 50th, and 90th percentiles 
model.quantiles = [0.05, 0.5, 0.9]
df2 = model.forecast_quantiles(forecast_destination=pd.Timestamp(2013, 12, 1))

In [None]:
df

In [None]:
df2