# PRODUCTION phase: About this notebook
- Purpose: Creates 1 of the 2 PIPELINES
    - `2a) training pipeline:` TRAINS a model with Azure AutoML and with AZURE compute cluster and calculates test_set scoring, automatically compares if newly trained model is better.
    
## DETAILS - about this notebook and the 2a pipeline, generated
- 1) Initiate ESMLPipelineFactory:
- 2) `AUTO-GENERATE code: a snapshot folder` via ESML, that generates Python scripts and the `ESML runtime`
    - 2_A_aml_pipeline\4_inference\batch\\`M11`
        - Edit the feature engineerin files if needed
            - 2_A_aml_pipeline\4_inference\batch\\`M11\your_code\your_custom_code.py`
            - `your_custom_code.py` is referenced from all the `in_2_silver_...` files such as: 2_A_aml_pipeline\4_inference\batch\M11\\`in2silver_ds01_diabetes.py` and `silver_merged_2_gold`
        - Edit the AutoML train file, you need to add some configuration. See instructions in notebook cells below.
            -2_A_aml_pipeline\4_inference\batch\\`M11\train_post_automl_step.py`
- 3) `BUILDS the pipeline` of certain (IN_2_GOLD_TRAIN_AUTOML)
    - An `Azure machine learning pipeline` with steps will be automatically genereated, based on your `lake_settings.json` dataset array.
    - It is a `training pipeline` of ESML type `IN_2_GOLD_TRAIN_AUTOML`
- 4) `EXECUTES the pipeline` (smoke testing purpose - see that it works...)
    - 4a) The below happens in the pipeline steps:Training pipeline: (`IN_2_GOLD_TRAIN_AUTOML`) steps:
        - Feature engineering of each in-data - via `IN_2_SILVER` steps.
        - Merges all SILVERS to `GOLD`
        - Splits the `GOLD` to 3 buckets: `GOLD_TRAIN, GOLD_VALIDATE, GOLD_TEST`
        - Trains model
        - Registers the newly trained model, tags it as `newly_trained`
        - Calculates test_set scoring with the `ESMLTestescoringFactory`
        - `INNER LOOP MLOps:` Compares in current environment `DEV` if model should be promoted, based on `test_set_scoring`
        - `OUTER LOOP MLOps:` Compares in next environment `TEST` if model should be promoted, based on `test_set_scoring` 
            - E.g. compares best model in `DEV` with the leading model in `TEST`
- 5) PUBLISH the pipeline
    - Purpose: Now when the pipeline is `smoke tested`, we can publish is, to get a `pipeline_id to use in Azure Data factory`
    - We want to PRINT the `pipeline ID` after publish also, for easy access to use in `Azure data factory` for retraining on new data continously (DataOps & MLOps)
- DONE.
    

Note:This notebook is called: `M11_v143_esml_regression_batch_train_automl.ipynb` in the notebook_templates folder
 

# TODO for you: CONFIGURATION
- 1) Change `p.active_model=11` to correct model number `1` if your model has that number.
    - See  [lake_settings.json](./settings/project_specific/model/lake_settings.json) to find YOUR model number.
- 2) After you run the cell [2) AUTO-GENERATE code: a snapshot folder](#2_generate_snapshot_folder), you need to add YOUR feature engineering logic
    -  This code you probably already have, from the R&D phase, in this CUSTOMIZE cell in the notebook: [1_R&D_phase_M10_M11.ipynb](./1_quickstart/1_R&D_phase_M10_M11.ipynb)
        - You need to this code to the `your_custom_code.py` after you have genereated the snapshot folder, for it to be reachable and uploaded at pipeline creation.
        - Tip: You can CREATE A CLASS, and add static methods, e.g. `ds01_process_in2silver(dataframe1)`  in the `your_custom_code.py` 
- 3) Now you have your code in the `your_custom_code.py`, then you need to reference that code from the auto-generated pipeline-steps files such as `in2silver_ds01_diabetes.py`
    - Note: This snapshot folder will not exist, until you have run the first 2 cells in this notebook, or after this cell has run [2) AUTO-GENERATE code: a snapshot folder](#2_generate_snapshot_folder)

## 1) Initiate ESMLPipelineFactory (Always run thic CELL below)
- To attach ESML controlplane to your project
- To point at `template-data` for the pipelinbe to know the schema of data
- To init the ESMLPipelinefactory

In [1]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from esml import ESMLProject
from baselayer_azure_ml_pipeline import ESMLPipelineFactory, esml_pipeline_types

p = ESMLProject() # Will search in ROOT for your copied SETTINGS folder '../settings/model/active/active_in_folder.json',
p.inference_mode = False
p.active_model = 11 # 10=titanic , 11=Diabetes
p.ws = p.get_workspace_from_config()
p_factory = ESMLPipelineFactory(p)

training_datefolder = '1000-01-01 10:35:01.243860' # Will override active_in_folder.json
p_factory.batch_pipeline_parameters[0].default_value = 0 # Will override active_in_folder.json.model.version = 0 meaning that ESML will find LATEST PROMOTED, and not use a specific Model.version. It will read data from .../inference/0/... folder
p_factory.batch_pipeline_parameters[1].default_value = training_datefolder # overrides ESMLProject.date_scoring_folder.


Using lake_settings.json with ESML version 1.4 - Models array support including LABEL


## "One time a day" - the below is needed to be done, to ensure Azure ML v1

print("NB! The below command you only need to run 1 time a day - then you can disable this cell. comment the code lines")
print("")
# Set LEGACY mode - Azure ML v1 - since private link and DatabricksStep
p.ws.update(v1_legacy_mode=True) # If you happen to have a workspace in v2 mode, and want to change back to v1 legacy mode

# The below cells for an IN_2_GOLD_TRAIN_AUTOML pipeline will:
- 1) Generate code files
- 2) Build pipeline, ESML autoguild this, and will upload the snapshot folder together with the Azure ML pipeline.
- 3) Run the pipeline. Smoke testing, see that it works
- 4) IF it works, Publish the pipeline, or else, edit the code files or configuration, retry step 2 and 3.
- 5) Print the pipeline_id, that is essential to use from Azure Data factory 

# 2) `AUTO-GENERATE code: a snapshot folder`
<a id='2_generate_snapshot_folder'></a>

In [2]:
## Generate CODE - then edit it to get correct environments
p_factory.create_dataset_scripts_from_template(overwrite_if_exists=False) # Do this once, then edit them manually. overwrite_if_exists=False is DEFAULT

Did NOT overwrite script-files with template-files such as 'scoring_gold.py', since overwrite_if_exists=False


## Databricks TODO 4 YOU:
## 1) In Databricks: Create Databricks Access token, add to keyvault as secret `esml-project-dbx-token`
- Visit the project specific Azure keyvault, `kv-p001-...` and create a new secret called `esml-project-dbx-token`, open a new TAB in your web browser
- In Databricks: Click on `email/Use Settings/Generate new token`
    - Comment optional, example: `azure ml pipeline`
    - Life time, has this empty, since we will set expiration in Azure keuvault instead, for 2 years.
    - Go to your other open web browser tab, paste to Azure keyvalt secret value box. Note: Secret should start with `dapi...`

- VERIFY / TEST access to TOKEN like this: 
    ```python
       p.ws = p.get_workspace_from_config()
       dbx_token = p.ws.get_default_keyvault().get_secret(name='esml-project-dbx-token')
    ```

### 2) In Databricks: Make sure you have the notebooks, M11
- Connect to REPO: The Azure devops repo. The M11 snapshot folder should be here
    - notebooks_databricks/esml/...

### 3) Create a FOLDER in the lake - if you are creating a manual ML model
- Create a folder called 'model', under train, to keep your pickle files
    - Example: ...11_diabetes_model_reg/train/model/

### 4) Configure the ESMLPipelineStepMap
Location is under your SNAPSHOT folder, after you generated files via `p_factory.create_dataset_scripts_from_template(overwrite_if_exists=True)`

Location: `01_pipelines\batch\M11\your_code\ESMLPipelineStepMap.py`

    - 3a) Set your dev,test,prod values:

```python
        all_envs = {
        'dev': {'compute_name': None,'resource_group': 'abc-def-esml-project002-weu-dev-004-rg', 'workspace_name': 'z', 'access_token': 't'},
        'test': {'compute_name': None,'resource_group': 'abc-def-esml-project002-weu-test-004-rg', 'workspace_name': 'z', 'access_token': 't'},
        'prod': {'compute_name': None,'resource_group': 'abc-def-esml-project002-weu-prod-004-rg', 'workspace_name': 'z', 'access_token': 't'}
        }
```

    - 3b) Configure the map, by implementing the method, under `01_pipelines\batch\M11\your_code\ESMLPipelineStepMap.py`

```python
         def your_train_map(self, dataset_folder_names):
```

#### Note: Before running CELL below: You need to RESTART notebook - if you changed code in the `ESMLPipelineStepMap`
- Before running Notebook again, remember to set `False`, for your config-code not to be overwritten, at cell above:
```python
    p_factory.create_dataset_scripts_from_template(overwrite_if_exists=False)
```

In [3]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/")
from esmlrt.interfaces.iESMLPipelineStepMap import IESMLPipelineStepMap
sys.path.insert(0, "../01_pipelines/M11/your_code/")
from ESMLPipelineStepMap import ESMLPipelineStepMap

map = ESMLPipelineStepMap() # TODO 4 YOU: You need to implement this class. See "your_code" folder
p_factory.use_advanced_compute_settings(map)

train_map = map.get_train_map(p.active_model['dataset_folder_names'])
train_map # prints  the map
#has_dbx_silver_merged_2_gold_step,step_name,map_step = map.get_dbx_map_step(train_map,"silver_merged_2_gold")

Azure ML Workspace:
Attached Databricks db_compute_name:
Compute target n1-p000-aml-91 already exists


[{'step_name': 'in2silver_ds01_diabetes',
  'code': '/Repos/jostrm@microsoft.com/esml-aifactory002-prj002/notebook_databricks/esml/dev/project/11_diabetes_model_reg/M11/10_in2silver_ds01_diabetes',
  'compute_type': 'dbx',
  'date_folder_or': None,
  'dataset_folder_names': 'ds01_diabetes',
  'dataset_filename_ending': '*.csv',
  'compute_name': 'n1-p000-aml-91',
  'cluster_id': '0111-230838-10wcl6d4'},
 {'step_name': 'in2silver_ds02_other',
  'code': '/Repos/jostrm@microsoft.com/esml-aifactory002-prj002/notebook_databricks/esml/dev/project/11_diabetes_model_reg/M11/10_in2silver_ds02_other',
  'compute_type': 'dbx',
  'date_folder_or': None,
  'dataset_folder_names': 'ds02_other',
  'dataset_filename_ending': '*.csv',
  'compute_name': 'n1-p000-aml-91',
  'cluster_id': '0111-230838-10wcl6d4'},
 {'step_name': 'silver_merged_2_gold',
  'code': '/Repos/jostrm@microsoft.com/esml-aifactory002-prj002/notebook_databricks/esml/dev/project/11_diabetes_model_reg/M11/20_merge_2_gold',
  'compute_

# TRAINING (3a,4a,5a)

# 3) `BUILDS the TRANING pipeline`
- esml_pipeline_types.IN_2_GOLD_TRAIN_AUTOML
- Take note on the `esml_pipeline_types` below, of type: esml_pipeline_types.`IN_2_GOLD_TRAIN_AUTOML`

In [4]:
## BUILD (takes ~10-12minutes)
batch_pipeline = p_factory.create_batch_pipeline(esml_pipeline_types.IN_2_GOLD_TRAIN_MANUAL)
# ...which Trains a model on data via date_folder parameters, upload the generated python scripts., and your custom code and ESML runtime, to Azure embedded in the pipeline, using Dockerized image. 

Using GEN2 as Datastore
use_project_sp_2_mount: True
Environment ESML-AzureML-144-AutoML_126 exists
Using Azure ML Environment: 'ESML-AzureML-144-AutoML_126' as primary environment for PythonScript Steps
Dataset: ds01_diabetes has advanced mapping - an Azure Databricks mapping
Dataset: ds02_other has advanced mapping - an Azure Databricks mapping
ESML advanced mode: with advanced compute mappings
 - Step: silver_merged_2_gold has advanced mapping - an Azure Databricks mapping
Found attached Databricks compute cluster
previous_step_is_databricks = 1
create_gold_train_step: inference_mode=False
par_date_utc: 1000-01-01 10:35:01.243860
Created Databricks step in pipeline
 - Step: train_split_and_register = train_split_and_register has advanced mapping - an Azure Databricks mapping
previous_step_is_databricks = 1
INPUT GOLD (p.GoldPathDatabricks) is: projects/project001/11_diabetes_model_reg/train/gold/dev/gold_dbx.parquet/*.parquet
ESML-train_path_out = projects/project001/11_diabetes_mod

## 4a) `EXECUTES the pipeline`

### NB! Run in v1 legacy mode
- You need to have your Azure Machine Learning workspace set to `v1_legacy_mode=True`
- HOW do I know if I run v1 or v2? 
  - If you see this error message in `executionlogs.txt in Azure machine learning studio Output+logs tab on pipeline rune`, containing the word in path `backendV2` when executing pipeline (cell below this), it is not in v1 legacy mode:
     - <i>Failed to start the job for runid: 33ff1e3a-1ca7-4de0-bcee-b851cd2bb89d because of exception_type: ServiceInvocationException, error: Failure in StartSnapshotRun while calling service Execution; HttpMethod: POST; Response StatusCode: BadRequest; Exception type: Microsoft.RelInfra.Extensions.HttpRequestDetailException|-Microsoft.RelInfra.Common.Exceptions.ErrorResponseException, stack trace:    at Microsoft.Aether.EsCloud.Common.Client.ExecutionServiceClient.StartSnapshotRunAsync(String jobId, RunDefinition runDefinition, String runId, WorkspaceIdentity workspaceIdentity, String experimentName, CreatedBy createdBy) in D:\a\_work\1\s\src\aether\platform\\`backendV2`\\Clouds\ESCloud\ESCloudCommon\Client\ExecutionServiceClient.cs:line 162
   at Microsoft.Aether.EsCloud.Common.JobProcessor.StartRunAsync(EsCloudJobMetadata job) in D:\a\_work\1\s\src\aether\platform\backendV2\Clouds\ESCloud\ESCloudCommon\JobProcessor.cs:line 605
   </i>
- WHY? 
    - Azure ML SDK v2 does not yet (writing this 2022-10)support Spark jobs in pipeline, nor private endpoint.
- TODO: To set the workspace in LEGACY v1 mode run this code 1 time, in a cell: `p.ws.update(v1_legacy_mode=True)`

In [5]:
#p.ws = p.get_workspace_from_config()
#p.ws.update(v1_legacy_mode=True) # If you happen to have a workspace in v2 mode, and want to change back to v1 legacy mode

In [6]:
## RUN and it will train in BIG Data, since using 100% Azure compute for all steps, including SPLITTING data
pipeline_run = p_factory.execute_pipeline(batch_pipeline) # If this give ERROR message, looking at executionlogs.txt in Azure machine learning studio Output+logs tab on pipeline rune
pipeline_run.wait_for_completion(show_output=False)

Execute_pipeline (scoring): Inference_mode: 0
-Scoring data, default value 1000-01-01 10:35:01.243860
Created step in2silver_ds01_diabetes [db16530b][fd86ad54-6397-4e46-bcb8-a775718be57c], (This step will run and generate new outputs)
Created step in2silver_ds02_other [7c0f8c6a][ff1d2f94-71ce-40da-a4c4-5a4f642024e7], (This step will run and generate new outputs)
Created step silver_merged_2_gold [a2239267][af83568c-a2ac-4426-b989-1b88e4993b83], (This step will run and generate new outputs)
Created step SPLIT AND REGISTER (0.6 % TRAIN) [d2244f74][7d277cf6-9e86-4575-aef4-50b98b17abcd], (This step will run and generate new outputs)
Created step TRAIN in  [dev], COMPARE & REGISTER model in [dev] & PROMOTE to [test] [66339d7c][ee010795-c88a-4604-a2e1-42e3f65e9636], (This step will run and generate new outputs)
Created data reference M11_ds01_diabetes_train_IN for StepId [abac3939][5ab5f509-18b9-4dfe-8bb7-661181fd46a3], (Consumers of this data will generate new runs.)Created data reference M

'Finished'

# 4b) View meta data about the training run
- What DATA was used, WHEN did the training occur, etc

In [11]:
from azureml.core import Dataset
import pandas as pd

ds_name ="{}_GOLD_SCORED_RUNINFO".format(p.ModelAlias)
meta_ds= Dataset.get_by_name(workspace=p.ws,name=ds_name, version='latest')
pd.set_option('display.max_colwidth', None)
meta_ds.to_pandas_dataframe().head()

Unnamed: 0,pipeline_run_id,scored_gold_path,date_in_parameter,date_at_pipeline_run,model_version,used_model_version,used_model_name
0,34cdc542-6b8e-454d-afa5-47fb243e48e6,projects/project001/11_diabetes_model_reg/inference/0/scored/dev/1000/01/01/34cdc542-6b8e-454d-afa5-47fb243e48e6/,1000-01-01 10:35:01.243860,2023-01-13 15:29:47.611164,0,2,11_diabetes_model_reg


# 5a) PUBLISH the TRAINING pipeline & PRINT its ID

In [12]:
# PUBLISH
published_pipeline, endpoint = p_factory.publish_pipeline(batch_pipeline,"_1") # "_1" is optional    to create a NEW pipeline with 0 history, not ADD version to existing pipe & endpoint

# PRINT: Get info to use in Azure data factory
- `published_pipeline.id` (if private Azure ML workspace)

In [None]:
print("2) Fetch scored data: Below needed for Azure Data factory PIPELINE activity (Pipeline OR Endpoint. Choose the latter") 
print ("- Endpoint ID")
print("Endpoint ID:  {}".format(endpoint.id))
print("Endpoint Name:  {}".format(endpoint.name))
print("Experiment name:  {}".format(p_factory.experiment_name))

print("In AZURE DATA FACTORY - This is the ID you need, if using PRIVATE LINK, private Azure ML workspace.")
print("-You need PIPELINE id, not pipeline ENDPOINT ID ( since cannot be chosen in Azure data factory if private Azure ML)")
published_pipeline.id

2) Fetch scored data: Below needed for Azure Data factory PIPELINE activity (Pipeline OR Endpoint. Choose the latter
- Endpoint ID
Endpoint ID:  44be26e4-f92a-4f91-a028-56d1cf64be39
Endpoint Name:  11_diabetes_model_reg_pipe_IN_2_GOLD_TRAIN_EP_3_dbx
Experiment name:  11_diabetes_model_reg_pipe_IN_2_GOLD_TRAIN
In AZURE DATA FACTORY - This is the ID you need, if using PRIVATE LINK, private Azure ML workspace.
-You need PIPELINE id, not pipeline ENDPOINT ID ( since cannot be chosen in Azure data factory if private Azure ML)


'ecb206b1-59b7-4d53-8d82-a97811445566'

# DONE! Next Step - Deploy model, serve your model for INFERENCING purpose:
- For INFERENCE you may need either to DEPLOY the model 
    - a) ONLINE on AKS endpoint
        - Notebook: 
    - b) BATCH SCORING on an Azure machine learning pipeline
        - Notebook: [your_root]\notebook_templates_quickstart\\`3a_PRODUCTION_phase_BATCH_INFERENCE_Pipeline.ipynb`
    - c) STREAMING using Eventhubs and Azure Databricks structured streaming
        - Notebook: TBA

- Q: `Next step in PRODUCTION phaase after the 2a and 3a or 3b notebooks are done?`
 
- 1) `DataOps+MLOps:` Go to your ESMLProjects `Azure data factory`, and use the `ESML DataOps templates` (Azure data factory templates) for `IN_2_GOLD_TRAIN` and `IN_2_GOLD_SCORING`
    - azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\adf\v1_3\PROJECT000\LakeOnly\\`STEP03_IN_2_GOLD_TRAIN_v1_3.zip`
- 2) `MLOps CI/CD` Go to the next notebook `mlops` folder, to setup `CI/CD` in Azure Devops
    - Import this in Azure devops
        azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\mlops\01_template_v14\azure-devops-build-pipeline-to-import\\`ESML-v14-project002_M11-DevTest.json`
    - Change the Azure Devops `VARIABLES` for service principle, tenant, etc.
    - Change parameters in the `inlince Azure CLI script` to correct model you want to work with, and the correct data you want to train with, or score.
        - Step `21-train_in_2_gold_train_pipeline`
        - INLINE code calls the file: `21-train_in_2_gold_train_pipeline.py`
        - INLINE parameters: `--esml_model_number 11 --esml_date_utc "1000-01-01 10:35:01.243860"`