# R&D or PRODUCTION phase: This will generate a PIPELINE with 1-M Databricks steps
- All Databricks steps, or mixed with Python steps
- Purpose: Creates 1 of the 2 PIPELINES
    - `2a) training pipeline:` TRAINS a model with Azure AutoML and with AZURE compute cluster and calculates test_set scoring, automatically compares if newly trained model is better. 

# Prerequisite - Databricks:
- You need to have a `ESML Databricks template snaphshot folder` (M01,M11, ...) in your Databricks workspace
- Run the notebooks in that folder first, interactively, so you know they work - then you come back to THIS notebooks, to generate an Azure ML pipeline, pointing at those Databricks notebooks

# TODO for you: CONFIGURATION
- 1) Change `p.active_model=11` to correct model number `1` if your model has that number.
    - See  [lake_settings.json](./settings/project_specific/model/lake_settings.json) to find YOUR model number.
- 2) After you run the cell [2) AUTO-GENERATE code: a snapshot folder](#2_generate_snapshot_folder), you need to add YOUR feature engineering logic
    -  This code you probably already have, from the R&D phase, in this CUSTOMIZE cell in the notebook: [1_R&D_phase_M10_M11.ipynb](./1_quickstart/1_R&D_phase_M10_M11.ipynb)
        - 2a) You need to add this code to the `your_custom_code.py` after you have genereated the snapshot folder, for it to be reachable and uploaded at pipeline creation.
            - Tip: You can CREATE A CLASS, and add static methods, e.g. `ds01_process_in2silver(dataframe1)`  in the `your_custom_code.py` 
        - 2b) You need to create the Databricks mapping here `ESMLPipelineStepMap.py`, define which steps are going to be handled by a Databricks notebooks and Databricks Sparl cluster.
- 3) Now you have your code in the `your_custom_code.py`, then you need to reference that code from the auto-generated pipeline-steps files such as `in2silver_ds01_diabetes.py`
    - Note: This snapshot folder will not exist, until you have run the first 2 cells in this notebook, or after this cell has run [2) AUTO-GENERATE code: a snapshot folder](#2_generate_snapshot_folder)

## 1) Initiate ESMLPipelineFactory (Always run thic CELL below)
- To attach ESML controlplane to your project
- To point at `template-data` for the pipelinbe to know the schema of data
- To init the ESMLPipelinefactory

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from esml import ESMLProject
from baselayer_azure_ml_pipeline import ESMLPipelineFactory, esml_pipeline_types

p = ESMLProject() # Will search in ROOT for your copied SETTINGS folder '../settings/model/active/active_in_folder.json',
p.inference_mode = False
p.active_model = 11 # 10=titanic , 11=Diabetes
p.ws = p.get_workspace_from_config()
p_factory = ESMLPipelineFactory(p)

training_datefolder = '1000-01-01 10:35:01.243860' # Will override active_in_folder.json
p_factory.batch_pipeline_parameters[0].default_value = 0 # Will override active_in_folder.json.model.version = 0 meaning that ESML will find LATEST PROMOTED, and not use a specific Model.version. It will read data from .../inference/0/... folder
p_factory.batch_pipeline_parameters[1].default_value = training_datefolder # overrides ESMLProject.date_scoring_folder.

all_steps_databricks = False #Notebook parameter: Disabled CELL that includes all mapped steps as DatabricksSteps
simple_mode_but_separate_compute = False


## "One time a day" - the below is needed to be done, to ensure Azure ML v1

In [None]:
print("NB! The below command you only need to run 1 time a day - then you can disable this cell. comment the code lines")
print("")
# Set LEGACY mode - Azure ML v1 - since private link and DatabricksStep
p.ws.update(v1_legacy_mode=True) # If you happen to have a workspace in v2 mode, and want to change back to v1 legacy mode

# The below cells for an IN_2_GOLD_TRAIN_AUTOML pipeline will:
- 1) Generate code files
- 2) Build pipeline, ESML autoguild this, and will upload the snapshot folder together with the Azure ML pipeline.
- 3) Run the pipeline. Smoke testing, see that it works
- 4) IF it works, Publish the pipeline, or else, edit the code files or configuration, retry step 2 and 3.
- 5) Print the pipeline_id, that is essential to use from Azure Data factory 

# 2) `AUTO-GENERATE code: a snapshot folder`
<a id='2_generate_snapshot_folder'></a>

In [None]:
## Generate CODE - then edit it to get correct environments
p_factory.create_dataset_scripts_from_template(overwrite_if_exists=False) # Do this once, then edit them manually. overwrite_if_exists=False is DEFAULT

## Alternative A) - Filter out, use some steps, a whitelist
- Using a whitelist filter. 1-M of your mapped steps

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/")
from esmlrt.interfaces.iESMLPipelineStepMap import IESMLPipelineStepMap
from esmlrt.interfaces.iESMLPipelineStepMap import esml_snapshot_step_names
sys.path.insert(0, "../01_pipelines/M11/your_code/")
from ESMLPipelineStepMap import ESMLPipelineStepMap

dataset_folder_names = p.active_model['dataset_folder_names']
step1 = esml_snapshot_step_names.in2silver_template.format(dataset_folder_names[0])
step2 = esml_snapshot_step_names.in2silver_template.format(dataset_folder_names[1])
step3 = esml_snapshot_step_names.silver_merged_2_gold
step4 = esml_snapshot_step_names.train_split_and_register
step5 = esml_snapshot_step_names.train_manual

step_filter_whitelist = [step1,step2,step3]

my_map = ESMLPipelineStepMap(step_filter_whitelist) # TODO 4 YOU: You need to implement this class. See "your_code" folder 
#map = ESMLPipelineStepMap()
p_factory.use_advanced_compute_settings(my_map)

# Print the Mappning
train_map = my_map.get_train_map(p.active_model['dataset_folder_names']) # Get the map
train_map # prints it

## Alternative B) - Use all possible steps you defined in the ESMLPipleineStepMap
- No whitelist filter

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/")
from esmlrt.interfaces.iESMLPipelineStepMap import IESMLPipelineStepMap
sys.path.insert(0, "../01_pipelines/batch/M11/your_code/")
from ESMLPipelineStepMap import ESMLPipelineStepMap

if(all_steps_databricks):
    mapping = ESMLPipelineStepMap() # TODO 4 YOU: You need to implement this class. See "your_code" folder 
    p_factory.use_advanced_compute_settings(mapping)

    # Print the Mappning
    train_map = mapping.get_train_map(p.active_model['dataset_folder_names']) # Get the map
    train_map # prints it
else:
    print("This notebook CELL is disabled. Change 'all_steps_databricks=True' to enable it.")

#### View pipeline steps, and its types

In [None]:
for s in p_factory.pipeline_steps_array:
    print(type(s))

# TRAINING (3a,4a,5a)

# 3) `BUILDS the TRANING pipeline`
- esml_pipeline_types.IN_2_GOLD_TRAIN_AUTOML
- Take note on the `esml_pipeline_types` below, of type: esml_pipeline_types.`IN_2_GOLD_TRAIN_AUTOML`

In [None]:
## BUILD (takes ~6-12minutes)
if(simple_mode_but_separate_compute):
    p_factory.use_advanced_compute_settings(None)
    batch_pipeline = p_factory.create_batch_pipeline(esml_pipeline_types.IN_2_GOLD_TRAIN_MANUAL, same_compute_for_all=False, aml_compute=None, allow_reuse=True)
else:
    batch_pipeline = p_factory.create_batch_pipeline(esml_pipeline_types.IN_2_GOLD_TRAIN_MANUAL)
    #batch_pipeline = p_factory.create_batch_pipeline(esml_pipeline_types.IN_2_GOLD_TRAIN_MANUAL, same_compute_for_all=True, aml_compute=None, allow_reuse=True)

## 4a) `EXECUTES the pipeline`

### NB! Run in v1 legacy mode
- You need to have your Azure Machine Learning workspace set to `v1_legacy_mode=True`
- HOW do I know if I run v1 or v2? 
  - If you see this error message in `executionlogs.txt in Azure machine learning studio Output+logs tab on pipeline rune`, containing the word in path `backendV2` when executing pipeline (cell below this), it is not in v1 legacy mode:
     - <i>Failed to start the job for runid: 33ff1e3a-1ca7-4de0-bcee-b851cd2bb89d because of exception_type: ServiceInvocationException, error: Failure in StartSnapshotRun while calling service Execution; HttpMethod: POST; Response StatusCode: BadRequest; Exception type: Microsoft.RelInfra.Extensions.HttpRequestDetailException|-Microsoft.RelInfra.Common.Exceptions.ErrorResponseException, stack trace:    at Microsoft.Aether.EsCloud.Common.Client.ExecutionServiceClient.StartSnapshotRunAsync(String jobId, RunDefinition runDefinition, String runId, WorkspaceIdentity workspaceIdentity, String experimentName, CreatedBy createdBy) in D:\a\_work\1\s\src\aether\platform\\`backendV2`\\Clouds\ESCloud\ESCloudCommon\Client\ExecutionServiceClient.cs:line 162
   at Microsoft.Aether.EsCloud.Common.JobProcessor.StartRunAsync(EsCloudJobMetadata job) in D:\a\_work\1\s\src\aether\platform\backendV2\Clouds\ESCloud\ESCloudCommon\JobProcessor.cs:line 605
   </i>
- WHY? 
    - Azure ML SDK v2 does not yet (writing this 2022-10)support Spark jobs in pipeline, nor private endpoint.
- TODO: To set the workspace in LEGACY v1 mode run this code 1 time, in a cell: `p.ws.update(v1_legacy_mode=True)`

In [None]:
#p.ws.update(v1_legacy_mode=True) # If you happen to have a workspace in v2 mode, and want to change back to v1 legacy mode

label = 'Y'
train_df = aml.to_pandas_dataframe()
#y1 = train_df[label]
X=train_df.drop(label, axis=1)
y = train_df.pop(label).to_frame()

#print(y1.head()) # no column
#print(type(y1)) # series

print(X.head())
print("")
print(y.head())

In [None]:
## RUN and it will train in BIG Data, since using 100% Azure compute for all steps, including SPLITTING data
pipeline_run = p_factory.execute_pipeline(batch_pipeline) # If this give ERROR message, looking at executionlogs.txt in Azure machine learning studio Output+logs tab on pipeline rune
pipeline_run.wait_for_completion(show_output=False)

# 4b) View meta data about the training run
- What DATA was used, WHEN did the training occur, etc

In [None]:
from azureml.core import Dataset
ds_name ="{}_GOLD_TRAINED_RUNINFO".format(p.ModelAlias)
meta_ds= Dataset.get_by_name(workspace=p.ws,name=ds_name, version='latest')
meta_ds.to_pandas_dataframe().head()

# 5a) PUBLISH the TRAINING pipeline & PRINT its ID

In [None]:
# PUBLISH
published_pipeline, endpoint = p_factory.publish_pipeline(batch_pipeline,"_1") # "_1" is optional    to create a NEW pipeline with 0 history, not ADD version to existing pipe & endpoint

# PRINT: Get info to use in Azure data factory
- `published_pipeline.id` (if private Azure ML workspace)

In [None]:
print("2) Fetch scored data: Below needed for Azure Data factory PIPELINE activity (Pipeline OR Endpoint. Choose the latter") 
print ("- Endpoint ID")
print("Endpoint ID:  {}".format(endpoint.id))
print("Endpoint Name:  {}".format(endpoint.name))
print("Experiment name:  {}".format(p_factory.experiment_name))

print("In AZURE DATA FACTORY - This is the ID you need, if using PRIVATE LINK, private Azure ML workspace.")
print("-You need PIPELINE id, not pipeline ENDPOINT ID ( since cannot be chosen in Azure data factory if private Azure ML)")
published_pipeline.id

# DONE! Next Step - Deploy model, serve your model for INFERENCING purpose:
- For INFERENCE you may need either to DEPLOY the model 
    - a) ONLINE on AKS endpoint
        - Notebook: 
    - b) BATCH SCORING on an Azure machine learning pipeline
        - Notebook: [your_root]\notebook_templates_quickstart\\`3a_PRODUCTION_phase_BATCH_INFERENCE_Pipeline.ipynb`
    - c) STREAMING using Eventhubs and Azure Databricks structured streaming
        - Notebook: TBA

- Q: `Next step in PRODUCTION phaase after the 2a and 3a or 3b notebooks are done?`
 
- 1) `DataOps+MLOps:` Go to your ESMLProjects `Azure data factory`, and use the `ESML DataOps templates` (Azure data factory templates) for `IN_2_GOLD_TRAIN` and `IN_2_GOLD_SCORING`
    - azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\adf\v1_3\PROJECT000\LakeOnly\\`STEP03_IN_2_GOLD_TRAIN_v1_3.zip`
- 2) `MLOps CI/CD` Go to the next notebook `mlops` folder, to setup `CI/CD` in Azure Devops
    - Import this in Azure devops
        azure-enterprise-scale-ml\copy_my_subfolders_to_my_grandparent\mlops\01_template_v14\azure-devops-build-pipeline-to-import\\`ESML-v14-project002_M11-DevTest.json`
    - Change the Azure Devops `VARIABLES` for service principle, tenant, etc.
    - Change parameters in the `inlince Azure CLI script` to correct model you want to work with, and the correct data you want to train with, or score.
        - Step `21-train_in_2_gold_train_pipeline`
        - INLINE code calls the file: `21-train_in_2_gold_train_pipeline.py`
        - INLINE parameters: `--esml_model_number 11 --esml_date_utc "1000-01-01 10:35:01.243860"`

# StepMap - how to print & look at it?

In [None]:
train_map = map.get_train_map(p.active_model['dataset_folder_names'])
has_dbx,step_name,map_step = map.get_dbx_map_step(train_map,'ds01_diabetes')
print(has_dbx)
print(step_name)

In [None]:
#train_map = map.get_train_map(p.active_model['dataset_folder_names'])
for d in p.Datasets:
    print(d.Name)
    has_dbx,step_name,map_step = map.get_dbx_map_step(train_map,d.Name)
    print("has_dbx:",has_dbx)
    print("step_name",step_name)
    print("")