# PRODUCTION phase: About this notebook
- Purpose: Creates 1 of the 2 PIPELINES
    - `2a) training pipeline:` TRAINS a model with Azure AutoML and with AZURE compute cluster and calculates test_set scoring, automatically compares if newly trained model is better.
    
## DETAILS - about this notebook and the 2a pipeline, generated
- 1) Initiate ESMLPipelineFactory:
- 2) `AUTO-GENERATE code: a snapshot folder` via ESML, that generates Python scripts and the `ESML runtime`
    - 2_A_aml_pipeline\4_inference\batch\\`M11`
        - Edit the feature engineerin files if needed
            - 2_A_aml_pipeline\4_inference\batch\\`M11\your_code\your_custom_code.py`
            - `your_custom_code.py` is referenced from all the `in_2_silver_...` files such as: 2_A_aml_pipeline\4_inference\batch\M11\\`in2silver_ds01_diabetes.py` and `silver_merged_2_gold`
        - Edit the AutoML train file, you need to add some configuration. See instructions in notebook cells below.
            -2_A_aml_pipeline\4_inference\batch\\`M11\train_post_automl_step.py`
- 3) `BUILDS the pipeline` of certain (IN_2_GOLD_TRAIN_AUTOML)
    - `Azure machine learning pipeline` with steps will be automatically genereated, based on your `lake_settings.json` dataset array.
    - 3a) BUILDS a `training pipeline` of ESML type `IN_2_GOLD_TRAIN_AUTOML`
- 4) `EXECUTES the pipeline` (smoke testing purpose - see that it works...)
    - 4a) Training pipeline: (`IN_2_GOLD_TRAIN_AUTOML`) steps:
        - Feature engineering of each in-data - via `IN_2_SILVER` steps.
        - Merges all SILVERS to `GOLD`
        - Splits the `GOLD` to 3 buckets: `GOLD_TRAIN, GOLD_VALIDATE, GOLD_TEST`
        - Trains model
        - Registers the newly trained model, tags it as `newly_trained`
        - Calculates test_set scoring with the `ESMLTestescoringFactory`
        - `INNER LOOP MLOps:` Compares in current environment `DEV` if model should be promoted, based on `test_set_scoring`
        - `OUTER LOOP MLOps:` Compares in next environment `TEST` if model should be promoted, based on `test_set_scoring` 
            - E.g. compares best model in `DEV` with the leading model in `TEST`
- 5) PUBLISH the pipeline
    - Purpose: Now when the pipeline is `smoke tested`, we can publish is, to get a `pipeline_id to use in Azure Data factory`
    - We want to PRINT the `pipeline ID` after publish also, for easy access to use in `Azure data factory` for retraining on new data continously (DataOps & MLOps)
- DONE.
    

Note:This notebook is called: `M11_v143_esml_regression_batch_train_automl.ipynb` in the notebook_templates folder
 

## 1) Initiate ESMLPipelineFactory (Always run thic CELL below)
- To attach ESML controlplane to your project
- To point at `template-data` for the pipelinbe to know the schema of data
- To init the ESMLPipelinefactory

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from esml import ESMLProject
from baselayer_azure_ml_pipeline import ESMLPipelineFactory, esml_pipeline_types

p = ESMLProject()
p.inference_mode = False
p.active_model = 11 # 10=titanic , 11=Diabetes
p_factory = ESMLPipelineFactory(p)

training_datefolder = '1000-01-01 10:35:01.243860'
p_factory.batch_pipeline_parameters[0].default_value = 0
p_factory.batch_pipeline_parameters[1].default_value = training_datefolder # overrides ESMLProject.date_scoring_folder.
p_factory.describe()


# The below cells for an IN_2_GOLD_TRAIN_AUTOML pipeline will:
- 1) Generate code files
- 2) Build pipeline, ESML autoguild this, and will upload the snapshot folder together with the Azure ML pipeline.
- 3) Run the pipeline. Smoke testing, see that it works
- 4) IF it works, Publish the pipeline, or else, edit the code files or configuration, retry step 2 and 3.
- 5) Print the pipeline_id, that is essential to use from Azure Data factory 

# 2) `AUTO-GENERATE code: a snapshot folder`

In [None]:
## Generate CODE - then edit it to get correct environments
p_factory.create_dataset_scripts_from_template(overwrite_if_exists=True) # Do this once, then edit them manually. overwrite_if_exists=False is DEFAULT

# TRAINING (3a,4a,5a)

# 3a) `BUILDS the TRANING pipeline`
- esml_pipeline_types.IN_2_GOLD_TRAIN_AUTOML
- Take note on the `esml_pipeline_types` below, of type: esml_pipeline_types.`IN_2_GOLD_TRAIN_AUTOML`

In [None]:
## BUILD (takes ~10-12minutes)
batch_pipeline = p_factory.create_batch_pipeline(esml_pipeline_types.IN_2_GOLD_TRAIN_AUTOML)
# ...which Trains a model on data via date_folder parameters, upload the generated python scripts., and your custom code and ESML runtime, to Azure embedded in the pipeline, using Dockerized image. 

## 4a) `EXECUTES the pipeline`

In [None]:
## RUN and it will train in BIG Data, since using 100% Azure compute for all steps, including SPLITTING data
pipeline_run = p_factory.execute_pipeline(batch_pipeline)
pipeline_run.wait_for_completion(show_output=False)

# 5a) PUBLISH the TRAINING pipeline & PRINT its ID

In [None]:
# PUBLISH
published_pipeline, endpoint = p_factory.publish_pipeline(batch_pipeline,"_1") # "_1" is optional    to create a NEW pipeline with 0 history, not ADD version to existing pipe & endpoint

# PRINT: Get info to use in Azure data factory
- `published_pipeline.id` (if private Azure ML workspace)

In [None]:
print("2) Fetch scored data: Below needed for Azure Data factory PIPELINE activity (Pipeline OR Endpoint. Choose the latter") 
print ("- Endpoint ID")
print("Endpoint ID:  {}".format(endpoint.id))
print("Endpoint Name:  {}".format(endpoint.name))
print("Experiment name:  {}".format(p_factory.experiment_name))

print("In AZURE DATA FACTORY - This is the ID you need, if using PRIVATE LINK, private Azure ML workspace.")
print("-You need PIPELINE id, not pipeline ENDPOINT ID ( since cannot be chosen in Azure data factory if private Azure ML)")
published_pipeline.id

# DONE! Next Step - Deploy model, serve your model for INFERENCING purpose:
- For INFERENCE you may need either to DEPLOY the model 
    - a) ONLINE on AKS endpoint
        - Notebook: 
    - b) BATCH SCORING on an Azure machine learning pipeline
        - Notebook: [your_root]\notebook_templates_quickstart\\`3a_PRODUCTION_phase_BATCH_INFERENCE_Pipeline.ipynb`
    - c) STREAMING using Eventhubs and Azure Databricks structured streaming
        - Notebook: TBA

- Q: `Next step in PRODUCTION phaase after the 2a and 3a or 3b notebooks are done?`
 
- 1) `DataOps+MLOps:` Go to your ESMLProjects `Azure data factory`, and use the `ESML DataOps templates` (Azure data factory templates) for `IN_2_GOLD_TRAIN` and `IN_2_GOLD_SCORING`
    - azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\adf\v1_3\PROJECT000\LakeOnly\\`STEP03_IN_2_GOLD_TRAIN_v1_3.zip`
- 2) `MLOps CI/CD` Go to the next notebook `mlops` folder, to setup `CI/CD` in Azure Devops
    - Import this in Azure devops
        azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\mlops\01_template_v14\azure-devops-build-pipeline-to-import\\`ESML-v14-project002_M11-DevTest.json`
    - Change the Azure Devops `VARIABLES` for service principle, tenant, etc.
    - Change parameters in the `inlince Azure CLI script` to correct model you want to work with, and the correct data you want to train with, or score.
        - Step `21-train_in_2_gold_train_pipeline`
        - INLINE code calls the file: `21-train_in_2_gold_train_pipeline.py`
        - INLINE parameters: `--esml_model_number 11 --esml_date_utc "1000-01-01 10:35:01.243860"`