# PRODUCTION phase: About this notebook
- Purpose: Creates 1 PIPELINE to serve the model.
    - `Batch scoring pipeline:` Fetches the best trained model, BUILDs an `Azure Machine Learning pipeline`, to batch score the data in a scheduled or triggered way

## DETAILS - about this notebook and the 2 pipelines, generated            
- 1) Initiate ESMLPipelineFactory:
- 2) `AUTO-GENERATE code: a snapshot folder` via ESML, that generates Python scripts and the `ESML runtime`
    - azure-enterprise-scale-ml\2_A_aml_pipeline\4_inference\batch\\`M11`
        - Edit the feature engineering files if needed
            - azure-enterprise-scale-ml\2_A_aml_pipeline\4_inference\batch\\`M11\your_code\your_custom_code.py`
            - `your_custom_code.py` is referenced from all the `in_2_silver_...` files, such as: 2_A_aml_pipeline\4_inference\batch\M11\\`in2silver_ds01_diabetes.py`  and `silver_merged_2_gold`
- 3) `BUILDS the pipeline` of certain type IN_2_GOLD_SCORING
    - `An Azure Machine Learning pipeline` with steps will be auto-generated by ESML, based on your `lake_settings.json` dataset array.
    - 3b) BUILDS a `training pipeline` of ESML type `IN_2_GOLD_SCORING`
- 4) `EXECUTES the pipeline` (smoke testing purpose - see that it works...)
    - 4b) Batch scoring pipeline (`IN_2_GOLD_SCORING`)
        - Feature engineering of each in-data - via `IN_2_SILVER` step (here sample data is needed, or else StreamAccessException)
        - Merges all SILVERS to `GOLD`
        - Score data: Fetched the best trained model, leading model, to score with
        - Saves scored data to the datalake, and writes metadata about WHAT data was scored, WHEN was the scoring, and with WHAT model_version was used.
- 5) PUBLISH the pipeline
    - Purpose: Now when the pipeline is `smoke tested`, we can publish is, to get a `pipeline_id to use in Azure Data factory`
    - PRINT the pipeline ID after publish also
- DONE.
    

Note: This notebook is called: `M11_v143_esml_regression_batch_scoring.ipynb` in the notebook_templates folder
 

# TODO for you: CONFIGURATION
- 1) Change `p.active_model=11` to correct model number `1` if your model has that number.
    - See  [lake_settings.json](./settings/project_specific/model/lake_settings.json) to find YOUR model number.
- 2) After you run the cell [2) AUTO-GENERATE code: a snapshot folder](#2_generate_snapshot_folder), you need to add YOUR feature engineering logic
    -  This code you probably already have, from the R&D phase, in this CUSTOMIZE cell in the notebook: [1_R&D_phase_M10_M11.ipynb](./1_quickstart/1_R&D_phase_M10_M11.ipynb)
        - You need to this code to the `your_custom_code.py` after you have genereated the snapshot folder, for it to be reachable and uploaded at pipeline creation.
        - Tip: You can CREATE A CLASS, and add static methods, e.g. `ds01_process_in2silver(dataframe1)`  in the `your_custom_code.py` 
- 3) Now you have your code in the `your_custom_code.py`, then you need to reference that code from the auto-generated pipeline-steps files such as `in2silver_ds01_diabetes.py`
    - Note: This snapshot folder will not exist, until you have run the first 2 cells in this notebook, or after this cell has run the cell [2) AUTO-GENERATE code: a snapshot folder](#2_generate_snapshot_folder)

## 1) Initiate ESMLPipelineFactory (Always run thic CELL below)
- To attach ESML controlplane to your project
- To point at `template-data` for the pipeline to know the schema of data.
    - NB! Azure machine learning pipelines need sample data. You need to have sample-data underneath the datalake folder structure:
    - `1` is recommended for `model_version folder`
    - `1000-01-01 00:00:00.243860` is recommended for `date_folder`
    - Example: project002/11_diabetes_model_reg/inference/`1`/ds01_diabetes/in/dev/`1000/01/01/`
- To init the ESMLPipelinefactory

In [1]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from esml import ESMLProject
from baselayer_azure_ml_pipeline import ESMLPipelineFactory, esml_pipeline_types
 
p = ESMLProject() # Will search in ROOT for your copied SETTINGS folder '../settings/model/active/active_scoring_in_folder.json',
p.inference_mode = True
p.active_model = 11 # 10=titanic , 11=Diabetes
p_factory = ESMLPipelineFactory(p)

# Azure machine learling pipelines need sample data to know schema
# model_version= 0 meaning that ESML will find LATEST PROMOTED/best model, and not use a specific Model.versio to score with. It will read data from .../inference/0/... folder
model_version = 5 # 5 = DatabricksPipeline, 36=AutoMLStep, 16=PythonPipeline, 1-4 = AtomlRun (No pipeline)
p_factory.batch_pipeline_parameters[0].default_value = model_version 

training_datefolder = '1000-01-01 10:35:01.243860' # Will override active_scoring_in_folder.json'
p_factory.batch_pipeline_parameters[1].default_value = training_datefolder # overrides ESMLProject.date_scoring_folder.

Using lake_settings.json with ESML version 1.4 - Models array support including LABEL


## "One time a day" - the below is needed to be done, to ensure Azure ML v1

print("NB! The below command you only need to run 1 time a day - then you can disable this cell. comment the code lines")
print("")
# Set LEGACY mode - Azure ML v1 - since private link and DatabricksStep
p.ws = p.get_workspace_from_config()
p.ws.update(v1_legacy_mode=True) # If you happen to have a workspace in v2 mode, and want to change back to v1 legacy mode

# 2) `AUTO-GENERATE code: a snapshot folder`
<a id='2_generate_snapshot_folder'></a>

In [2]:
## Generate CODE - then edit it to get correct environments
p_factory.create_dataset_scripts_from_template(overwrite_if_exists=False) # Do this once, then edit them manually. overwrite_if_exists=False is DEFAULT

Did NOT overwrite script-files with template-files such as 'scoring_gold.py', since overwrite_if_exists=False


# 3) `BUILDS the pipeline, and RUN the pipeline (smoke testing)`

Take note on the `esml_pipeline_types` below, of type: esml_pipeline_types.`IN_2_GOLD_SCORING`

In [3]:
## BUILD
batch_pipeline = p_factory.create_batch_pipeline(esml_pipeline_types.IN_2_GOLD_SCORING) # Note the esml_pipeline_types

Using GEN2 as Datastore
use_project_sp_2_mount: True
Environment ESML-AzureML-144-AutoML_126 exists
Using Azure ML Environment: 'ESML-AzureML-144-AutoML_126' as primary environment for PythonScript Steps
ESML will auto-create a compute...
Note: OVERRIDING enterprise performance settings with project specifics. (to change, set flag in 'dev_test_prod_settings.json' -> override_enterprise_settings_with_model_specific=False)
Using a model specific cluster, per configuration in project specific settings, (the integer of 'model_number' is the base for the name)
Note: OVERRIDING enterprise performance settings with project specifics. (to change, set flag in 'dev_test_prod_settings.json' -> override_enterprise_settings_with_model_specific=False)
Found existing cluster prj001-m11-dev for project and environment, using it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
image_build_compute = prj001-m11-dev
Initiated DEFAULT compute - for

# 4a) `Execute the pipeline (smoke testing)`

In [4]:
## RUN for smoke testing purpose, to see that it works during runtime
pipeline_run = p_factory.execute_pipeline(batch_pipeline) # Tip: Pointing at the wrong folder for the sample data is the most common error "StreamAccessException"
pipeline_run.wait_for_completion(show_output=False)

Execute_pipeline (scoring): Inference_mode: 1
-Scoring data, default value 1000-01-01 10:35:01.243860
Adding pipeline parameters
Created step IN 2 SILVER - ds01_diabetes [846d43ed][0a355e02-dd47-4150-8bf0-aabb29f1d3f3], (This step will run and generate new outputs)
Created step IN 2 SILVER - ds02_other [6cd4cea6][0f53b9ca-97f7-408b-8030-3884205b6fdf], (This step will run and generate new outputs)
Created step SILVER MERGED 2 GOLD [8c095514][87d709b8-7d3a-4901-b7cb-d8e595afcba9], (This step will run and generate new outputs)
Created step SCORING GOLD [e4899d86][7fc1be0c-fc76-463c-958e-379d36566f44], (This step will run and generate new outputs)
Submitted PipelineRun d910a22e-a254-4e15-a103-2c045b41e568
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/d910a22e-a254-4e15-a103-2c045b41e568?wsid=/subscriptions/50ef5835-c45a-4c2e-a596-2a9e0e2a0a33/resourcegroups/dc-heroes-esml-project001-weu-DEV-001-rg/workspaces/aml-prj001-weu-DEV-001&tid=846f02b7-f92a-4053-9a99-094e5ba2e1a4

# 4b) See the RESULTS: Metadata about SCORING & actual SCORING

In [None]:
from azureml.core import Dataset
import pandas as pd

ds_name ="{}_GOLD_SCORED_RUNINFO".format(p.ModelAlias)
meta_ds= Dataset.get_by_name(workspace=p.ws,name=ds_name, version='latest')
pd.set_option('display.max_colwidth', None)
meta_ds.to_pandas_dataframe().head()

Unnamed: 0,pipeline_run_id,scored_gold_path,date_in_parameter,date_at_pipeline_run,model_version,used_model_version,used_model_name
0,34cdc542-6b8e-454d-afa5-47fb243e48e6,projects/project001/11_diabetes_model_reg/inference/0/scored/dev/1000/01/01/34cdc542-6b8e-454d-afa5-47fb243e48e6/,1000-01-01 10:35:01.243860,2023-01-13 15:29:47.611164,0,2,11_diabetes_model_reg


In [None]:
from azureml.data import FileDataset
import pandas as pd
ds_name ="{}_GOLD_SCORED".format(p.ModelAlias)
print("ds_name", ds_name)
meta_ds= Dataset.get_by_name(workspace=p.ws,name=ds_name, version='latest')

if(type(meta_ds) is FileDataset):
    print("FileDataset = True")
    path = meta_ds.take(1).download('./data_temp/', overwrite=True)
    #df = pd.DataFrame(meta_ds.to_path())
    df = pd.DataFrame(path)
    df.head()
else:
    print("TabularDataset = True")
    print(meta_ds.to_pandas_dataframe().head())

ds_name M11_GOLD_SCORED
TabularDataset = True
        AGE       SEX       BMI        BP        S1        S2        S3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005671 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         S4        S5        S6  prediction  
0 -0.002592  0.019908 -0.017646  211.647599  
1 -0.039493 -0.068330 -0.092204   49.926834  
2 -0.002592  0.002864 -0.025930  224.845740  
3  0.034309  0.022692 -0.009362  160.014210  
4 -0.002592 -0.031991 -0.046641   91.652218  


# 5) PUBLISH the TRAINING pipeline & PRINT its ID

In [7]:
# PUBLISH
published_pipeline, endpoint = p_factory.publish_pipeline(batch_pipeline,"_1") # "_1" is optional    to create a NEW pipeline with 0 history, not ADD version to existing pipe & endpoint

# PRINT: Get info to use in Azure data factory
- `published_pipeline.id` (if private Azure ML workspace)

In [8]:
print("2) Fetch scored data: Below needed for Azure Data factory PIPELINE activity (Pipeline OR Endpoint. Choose the latter") 
print ("- Endpoint ID")
print("Endpoint ID:  {}".format(endpoint.id))
print("Endpoint Name:  {}".format(endpoint.name))
print("Experiment name:  {}".format(p_factory.experiment_name))

print("In AZURE DATA FACTORY - This is the ID you need, if using PRIVATE LINK, private Azure ML workspace.")
print("-You need PIPELINE id, not pipeline ENDPOINT ID ( since cannot be chosen in Azure data factory if private Azure ML)")
published_pipeline.id

2) Fetch scored data: Below needed for Azure Data factory PIPELINE activity (Pipeline OR Endpoint. Choose the latter
- Endpoint ID
Endpoint ID:  aedd1e4c-a3eb-40ae-b9c2-a93b6b576523
Endpoint Name:  11_diabetes_model_reg_pipe_IN_2_GOLD_SCORING_EP_1
Experiment name:  11_diabetes_model_reg_pipe_IN_2_GOLD_SCORING
In AZURE DATA FACTORY - This is the ID you need, if using PRIVATE LINK, private Azure ML workspace.
-You need PIPELINE id, not pipeline ENDPOINT ID ( since cannot be chosen in Azure data factory if private Azure ML)


'1930f46b-69ab-4aec-9e93-f6d5998c9e7c'

 # DONE! Next step would be

 - Q: `Next step in PRODUCTION phaase after the 2a and 3a or 3b notebooks are done?`

1) Go to your ESMLProjects `Azure data factory`, and use the `ESML DataOps templates` (Azure data factory templates) for `IN_2_GOLD_SCORING`
    - azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\adf\v1_3\PROJECT000\LakeOnly\`STEP03_IN_2_GOLD_SCORING.zip`
2) Go to the next notebook `mlops` folder, to setup `CI/CD` in Azure Devops
    - Import this in Azure devops
        azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\mlops\01_template_v14\azure-devops-build-pipeline-to-import\\`ESML-v14-project002_M11-DevTest.json`
    - Change the Azure Devops `VARIABLES` for service principle, tenant, etc.
    - Change parameters in the `inlince Azure CLI script` to correct model you want to work with, and the correct data you want to train with, or score.
        - File: `31-deploy_and_smoketest_batch_scoring.py`
        - INLINE code: `--esml_model_number 11 --esml_date_utc "1000-01-01 10:35:01.243860"`