# R&D phase: About this notebook
- This notebooks support both CLASSIFICATION and REGRESSION
- Purpose: TRAINS a model with Azure AutoML and with AZURE compute cluster and calculates test_set scoring, automatically compares if newly trained model is better.
    - To iteratively try different ML-algorithms see what's best, change performance settings, train again.
    - Also to try different apporoaches, classification or regression approach - which is better for the use case.

- Q: `WHEN to move on form R&D phase to PRODUCTION phase notebook?`
    - When you are happy with the MODEL (or if you have a big dataset that requires pipeline for training) - then go to the next notebook `2a_PRODUCTION_phase` to create PIPELINES: 
        - PRODUCTION PHASE & MLOps requires 1 `training pipeline`, and a `scoring pipeline` or `scoring online endpoint`, for inference 
- This notebook - Details:
    - 1) Automaps data as Azure ML datasets. Based on your `lake_settings.json`
    - 2) Splits the GOLD data into 3 buckets. 
        - NB this is done with local compute, not Azure, use 
             - Option 1: `2a_PRODUCTION_phase` training pipeline if data is too big for local RAM memory
             - Option 2: Stay in this notebook & local split of data, but increase RAM memory of your/this Azure VM developer (DSVM) computer.
             - Option 2: Stay in this notebook & local split of data, but reduce data size. Only use a sample .parquet (or .csv) file in the IN-folder.
    - 3) Trains model
    - 4) Registers model
    - 5) Calculate test_set scoring
    - 6) Deploys model - ONLINE endpoint to AKS
    - 7) Inference: Smoke testing, using the ONLINE endpoint - get result back, saves the result to datalake also
    - DONE.
    
- This notebook is called: `M10_v143_esml_classification_1_train_env_dev.ipynb` in the notebook_templates folder
 

# Login / Switch DEV_TEST_PROD environment (1-timer)

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from azureml.core.authentication import InteractiveLoginAuthentication
from esml import ESMLProject

p = ESMLProject()
p.dev_test_prod="dev"

print(p.tenant)
print(p.workspace_name) # self.workspace_name,subscription_id = self.subscription_id,resource_group = self.resource_group
print(p.subscription_id)
print(p.resource_group)

auth = InteractiveLoginAuthentication(tenant_id = p.tenant)
#auth = InteractiveLoginAuthentication(force=True, tenant_id = p.tenant)
ws, config_name = p.authenticate_workspace_and_write_config(auth)

# 1) ESML - TRAIN Classification, TITANIC model, and DEPLOY with predict_proba scoring

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/")
from esmlfac.adapter import ESMLFactory
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from esml import ESMLProject
import pandas as pd

param_esml_env = "dev" 
param_inference_model_version = "1" # DATALAKE(my_model/inference/active) | settings/project_specific/active/active_scoring_in_folder.json
param_scoring_folder_date = "1000-01-01 00:00:01.243860" # DATALAKE(my_model/inference/active) | settings/project_specific/active/active_scoring_in_folder.json
param_train_in_folder_date = "1000-01-01 00:00:01.243860" # DATALAKE(my_model/train/active) | settings/project_specific/active/active_in_folder.json

p = ESMLProject(param_esml_env,param_inference_model_version,param_scoring_folder_date,param_train_in_folder_date)
#p = ESMLProject() # Alternatively use empty contructor, which takes parameters from settings\project_specific\model\active\active_in_folder.json

p.active_model = 10
p.inference_mode = False
p.ws = p.get_workspace_from_config() #2) Load DEV or TEST or PROD Azure ML Studio workspace
p.verbose_logging = False

# Init a ESMLController from ESMLProject configuration: Needed for 
datastore = p.connect_to_lake() # Connects to the correct ALDS GEN 2 storage account (DEV, TEST or PROD)
controller = ESMLFactory.get_esml_controller_from_notebook(p)
p.describe()

In [None]:
unregister_all_datasets=False
if(unregister_all_datasets):
    p.unregister_all_datasets(p.ws) # For DEMO purpose

## CUSTOMIZE - FEATURE ENGINEERING (classification or regression)

In [None]:
def feature_engieering_regression():
    # Feture engineering: Bronze 2 Gold - working with Azure ML Datasets with Bronze, Silver, Gold concept
    esml_dataset = p.DatasetByName("ds01_diabetes") # Get dataset
    df_bronze = esml_dataset.Bronze.to_pandas_dataframe()
    p.save_silver(esml_dataset,df_bronze) #Bronze -> Silver

    esml_dataset2 = p.DatasetByName("ds02_other") # Get dataset
    df_bronze2 = esml_dataset2.Bronze.to_pandas_dataframe()
    p.save_silver(esml_dataset2,df_bronze2) #Bronze -> Silver

    df = esml_dataset.Silver.to_pandas_dataframe() 
    df_filtered = df[df.AGE > 0.015] 
    gold = p.save_gold(df_filtered)  #Silver -> Gold
    return gold

def feature_engieering_classification():
    # R&D purpose: Try some data wrangling here...we will later incorporate this in an Azure ML Pipeline, as "steps"
    esml_dataset = p.DatasetByName("ds01_titanic") 
    df_bronze = esml_dataset.Bronze.to_pandas_dataframe()
    df_bronze.columns = df_bronze.columns.str.replace("[/]", "_") # Rename werid column names

    df_silver = p.save_silver(esml_dataset,df_bronze) #Bronze -> Silver

    esml_dataset2 = p.DatasetByName("ds02_haircolor")
    esml_dataset3 = p.DatasetByName("ds03_housing")
    esml_dataset4 = p.DatasetByName("ds04_lightsaber")

    p.save_silver(esml_dataset2,esml_dataset2.Bronze.to_pandas_dataframe()) #Bronze -> Silver
    p.save_silver(esml_dataset3,esml_dataset3.Bronze.to_pandas_dataframe()) #Bronze -> Silver
    p.save_silver(esml_dataset4,esml_dataset4.Bronze.to_pandas_dataframe()) #Bronze -> Silver

    gold = p.save_gold(esml_dataset.Silver.to_pandas_dataframe())  #Silver -> Gold STEP
    return gold

## Connect to DATALAKE, call FEATURE ENGINEERING
- DEMO logic: If not the first time, then just CONNECT to lake. If first time, then also call feature engineering

In [None]:
datastore = None
try:
    datastore = p.connect_to_lake() # Connects to the correct ALDS GEN 2 storage account (DEV, TEST or PROD)
    gold_train = p.GoldTrain
    gold_train.name
    print("Not 1st time. We have data mapped already...and splitted. Now connected to LAKE")
except: # If 1st time....no Gold exists, nor any mapping
    print("1st time. Lets init, map what data we have in LAKE, as Azure ML Datasets")
    datastore = p.init() # 3) Automapping from datalake to Azure ML datasets
    if (p.active_model["ml_type"] == "classification"):
        gold = feature_engieering_classification()
    elif (p.active_model["ml_type"] == "regression"):
        gold = feature_engieering_regression()

In [None]:
p.Gold.to_pandas_dataframe().head()

## SUMMARY - step 1
- ESML has now `Automap` and `Autoregister` Azure ML Datasets as: `IN, SILVER, BRONZE, GOLD`
- ESML has read configuration for correct environment (DEV, TEST, PROD). 
    - Both small customers, and large Enterprise customers often wants:  DEV, TEST, PROD in `diffferent Azure ML workspaces` (and different subscriptions)
- User has done feature engineering, and saved GOLD `p.save_gold`

In [None]:
print("rows in GOLD {}".format(p.Gold.to_pandas_dataframe().shape[0]))

### SPLIT option A) ESML default split logic, which you can override

In [None]:
M10_GOLD_TRAIN, M10_GOLD_VALIDATE, M10_GOLD_TEST = p.split_gold_3(0.6,label=p.active_model["label"],stratified=False) # Splits and Auto-registers as AZUREM ML Datasets

### SPLIT option B) Use YOUR split logic, override the default
- You need to create your own class (ESMLSplitter is just an example class) such as MySplitter(IESMLSplitter)

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/")

from esmlrt.interfaces.iESMLSplitter import IESMLSplitter # Just for reference to see where the abstract class exists
from esmlrt.runtime.ESMLSplitter import ESMLSplitter1 # Point at your own code/class here instead..that needst to implement the IESMLSplitter class

my_IESMLSplitter = ESMLSplitter1()
M10_GOLD_TRAIN, M10_GOLD_VALIDATE, M10_GOLD_TEST = p.split_gold_3(train_percentage=0.6,label=p.active_model["label"],stratified=False,override_with_custom_iESMLSplitter=my_IESMLSplitter) # Splits and Auto-registers as AZUREM ML Datasets

# IN_2_GOLD
- If just wanting to refine data to GOLD, for a Power BI report (No ML involved)
- Scenario: You want to refine data from "IN_2_GOLD" with an easy way to READ/WRITE data (using the enterprise datalake via ESML AutoLake and ESML SDK)

In [None]:
p.GoldTrain.to_pandas_dataframe().head()  # Azure ML Dataset

# 2) `ESML` Train model in `5 codelines`

In [None]:
print("We are in environment {}".format(p.dev_test_prod))
# Lets look at our AutoML performance settings:
automl_performance_config = p.get_automl_performance_config() # 1)Get config, for active environment (dev,test or prod)
automl_performance_config

Lets look at our label, and our machine learning task type:

In [None]:
print('Label is: {}'.format(p.active_model["label"]))
print('ml_type / task is: {}'.format(p.active_model["ml_type"]))

### Lets TRAIN with AutoML & Azure compute cluster (M11 demo takes ~ 10-15min)

In [None]:
from esml import ESMLProject
from baselayer_azure_ml import AutoMLFactory,azure_metric_regression,azure_metric_classification
from azureml.train.automl import AutoMLConfig

automl_performance_config = p.get_automl_performance_config() # 1)Get config, for active environment (dev,test or prod)
aml_compute = p.get_training_aml_compute(p.ws) # 2)Get compute, for active environment

automl_config = AutoMLConfig(task = p.active_model["ml_type"], # 4) Override the ENV config, for model(that inhertits from enterprise DEV_TEST_PROD config baseline)
                            primary_metric = p.active_model["ml_metric"], #  Note: Regression[MAE, RMSE,R2,Spearman] Classification[AUC,Accuracy,Precision,Precision_avg,Recall]
                            compute_target = aml_compute,
                            training_data = p.GoldTrain, # is 'train_6' pandas dataframe, but as an Azure ML Dataset
                            experiment_exit_score = p.active_model["ml_time_out_score"], # DEMO purpose. remove experiment_exit_score if you want to have good accuracy (put a comment # on this row to remove it)
                            label_column_name = p.active_model["label"],
                            **automl_performance_config
                        )

best_run, fitted_model, experiment = AutoMLFactory(p).train_as_run(automl_config)

## 2b) ESML Scoring Drift/Concept Drift: Compare with `1-codeline`: Promote model or not? If better, then `Register model`
- `IF` newly trained model in `current` environment (`DEV`, `TEST` or `PROD`) scores BETTER than existing model in `target` environment, then `new model` can be registered and promoted.
- Q: Do we have `SCORING DRIFT / CONCEPT DRIFT?`
- Q: Is a model trained on NEW data better? IS the one in production degraded? (not fit for the data it scores - real world changed, other CONCEPT)
- A: - Lets check. Instead of `DataDrift`, lets look at `actual SCORING` on new data (and/or new code, feature engineering) - See if we should PROMOTE newly trained model...

In [None]:
print("current AI Factory environment: '{}' - AML WS: '{}'".format(p.dev_test_prod, p.ws.name))

## Check if we already have a MODEL with a suitable NAME - to gruoup ur runs and model versions under.
- Purpose: Gets consitent model name, if many runs

In [None]:
from esmlrt.interfaces.iESMLController import IESMLController
current_model,run_id_tag, model_name = IESMLController.get_best_model_via_modeltags_only_DevTestProd(p.ws,controller.experiment_name)

if(current_model is None):
    print("No existing model with experiment name {}. The Model name will now be same as experiment name".format(controller.experiment_name))
    current_model = None
    run_id_tag = ""
    model_name = controller.experiment_name
else:
    print("Current BEST model is: {} from Model registry with experiment_name-TAG {}, run_id-TAG {}  model_name-TAG {}".format(current_model.name,controller.experiment_name,run_id_tag,model_name))
    if ("esml_time_updated" in current_model.tags):
        print("esml_time_updated: {}".format(current_model.tags.get("esml_time_updated")))
    print("status_code : {}".format(current_model.tags.get("status_code")))
    print("model_name  : {}".format(current_model.tags.get("model_name")))
    print("trained_in_workspace   : {}".format(current_model.tags.get("trained_in_workspace")))

# Register new trained model, as NEW: not promoted.
 - Purpose: To be able to TAG scoring on it

In [None]:
from esmlrt.interfaces.iESMLController import IESMLController
import datetime

time_stamp = str(datetime.datetime.now())
ml_flow_stage = IESMLController._get_flow_equivalent(IESMLController.esml_status_new)

tags = {"esml_time_updated": time_stamp,"status_code": IESMLController.esml_status_new,"mflow_stage":ml_flow_stage, "run_id": best_run.id, "model_name": model_name, "trained_in_environment": controller.dev_test_prod, 
    "trained_in_workspace": p.ws.name, "experiment_name": controller.experiment_name, "trained_with": "AutoMLRun"}

model = best_run.register_model(model_name=model_name, tags=tags, description="", model_path=".")
print("model.name", model.name)
print("model.version", model.version)
#model_path = None
#model = controller._register_aml_model(model_path,model_name,tags,ws,"")

## TEST SET SCORING: Calculate test_set SCORING
- Is tagged on MODEL in Azure ML Studio

### Rehydrate RUN - to calulate test_scoring
- if you restarted notebook, and dont want to wait for TRAIN again, you can fetch RUN, FITTED_MODEL, AML_MODEL as below

from azureml.core import Model
from esmlrt.interfaces.iESMLController import IESMLController

if(p.active_model["ml_type"] == "regression"):
    your_model_id = "AutoMLd1093aff80" # See Azure ML Studio - Models registry, 1st column in table
    models_run_id = "AutoML_6cdd26f3-d7bb-4f0b-9051-8e15dac6536a" # Regression: See Azure ML Studio - Models registry, 2nd column in table. If empty, see JOBS id for run_id
elif(p.active_model["ml_type"] == "classification"):
    your_model_id = "AutoML3a56468360"
    models_run_id = "AutoML_3a564683-6824-4ca9-b07d-71652d445da6_0" # Classification

model = Model(p.ws, your_model_id)
run,best_run,fitted_model = IESMLController.init_run(p.ws,controller.experiment_name, models_run_id)


In [None]:
run = best_run.parent # Since AutoML. If manual, then keep best_run since no parent
model, val_1, val_2, val_3,val_4,val_5,reg_plt_6, val_7,class_plt_8 = controller.ESMLTestScoringFactory.get_test_scoring_8(
    p.ws,
    p.active_model["label"],
    p.GoldTest,
    fitted_model,
    run, # run or best_run
    model)

In [None]:
controller.ESMLTestScoringFactory.print_test_scoring(val_1, val_2, val_2, val_3,val_4,val_5,reg_plt_6,val_7)

## Compare - INNER LOOP & Register with PROMOTED status, if better
 - Better than other in DEV?

## PROMOTE model - INNER LOOP

In [None]:
if(best_run is not None):
    print(best_run.parent.id)
if(run is not None):    
    print(run.id)

In [None]:
from esmlrt.interfaces.iESMLController import IESMLController

dev_ws = p.ws
esml_current_env = "dev"
next_environment="dev"
#target_ws = controller.get_target_workspace(current_environment = esml_current_env, current_ws = dev_ws, target_environment = esml_current_env)
target_ws = dev_ws

if(run is None):
    run_id = best_run.parent.id # This is set if you just ran the TRAIN cell in this notebook. AutoMLRun in notebook - we need its parent.
else:
    run_id = run.id # Rehydrated run=parent which is set in a CELL above in this noteboo you may use. If not having a fresh training in RAM.

run_id = IESMLController.get_safe_automl_parent_run_id(run_id)
promote_new_model,source_model_name,source_run_id,source_best_run,source_model,leading_model = controller.ESMLComparer.compare_scoring_current_vs_new_model(
    new_run_id = run_id,#run_id_tag, #automl_step_run_id,
    new_model = None,
    model_name = model.name,
    current_ws = dev_ws,
    current_environment = esml_current_env,
    target_environment = next_environment,
    target_workspace = target_ws,
    experiment_name = controller.experiment_name)

if(source_best_run.id == run_id):
    print("Correct RUN found. Parent run.")

print("INNER LOOP (dev->dev) - PROMOTE?")
if (promote_new_model == True): # Better than all in DEV/Curren environment?
    model_registered_in_target = controller.register_model(source_ws=p.ws, target_env=esml_current_env, source_model=model, run=source_best_run,esml_status=IESMLController.esml_status_promoted_2_dev) 
    print("Promoted model! in environment {}".format(esml_current_env))

# END

#### DEBUG cell - before `PROMOTE model - INNER LOOP`
- Purpose: Rehydrate Run

from azureml.train.automl.run import AutoMLRun
from azureml.core import Experiment
id_1 = best_run.parent.id

print(controller.experiment_name)
print(id_1)
exp = Experiment(p.ws,controller.experiment_name)

run = AutoMLRun(experiment=exp, run_id=id_1)
best_run, fitted_model = run.get_output()

### DEBUG cell - after a TRAIN run
- The train run, will generate a `temporary model_name (Azure ML will not update, when renamed at registration)`. The tag that Azure ML writes: `best_run.properties['model_name']`,this is NOT the correct model_name (since not same after REGISTRATION)
- Why having your OWN model name? Since we in ESML want to "lookup" if a model name already exists, under same MODEL NAME - to collect all under same model name, but with versions. 
    - Aml may create a new random name under same experiment after a couple of runs. Hence good to have your own "known". Example: Stick with the 1st generated name AML creates for you.
    - ESML also collects all models under same `experiment name TAG`, since you can TRAIN a model from a NOTEBOOK, or from a PIPELINE, and these will have different EXPERIMENT NAMES, hence using a TAG with a common name
    - All and all for MLOps: This to be able to determing BEST promoted model, and LATEST challenger model

from azureml.core import Model

my_model_name_that_exists_in_registry = "AutoMLd123123" # Look in Model registry for a model name that exists

print(best_run.id)
model_name1 = best_run.properties['model_name']
print(model_name1)

try:
    model_again = Model(p.ws, model_name1) # This will probably not be found, since this is not the registered models name, it only was the name temporary in the RUN..and is NOT updated when renamed.
    print(model_again.name)
except:
    print("could not find a registered model with name {}. This is due we CUSTOMIZE the name, when register it. This is not updated on run.properties.".format(model_name1))
    model_again = Model(p.ws, my_model_name_that_exists_in_registry) 
    print(model_again.name)  


## DEBUG cell - when SCORING pipeline runs, you pass MODEL VERSION, if VERSION=0...
...then LATEST PROMOTED model is used to score with

Below you can see how to HYDRATE the fitted_model, BEST_RUN and BEST_MODEL

model_version_in_int = 0
print("Fetching BEST MODEL that is promoted. To get its name")
current_model2,run_id_tag, model_name = IESMLController.get_best_model_via_modeltags_only_DevTestProd(p.ws,p.model_folder_name)

if(current_model2 is None):
    print("No existing model with experiment name {}. The Model name will now be same as experiment name".format(p.model_folder_name))
if(model_version_in_int == 0):
    print("Initiating BEST MODEL - PROMOTED leading model (since model_version=0). Hydrating to get its run and fitted model.")

    run_id_2 = current_model2.tags.get("run_id")
    safe_run_id = IESMLController.get_safe_automl_parent_run_id(run_id_2)
    run2,best_run2,fitted_model2 = IESMLController.init_run(p.ws,p.model_folder_name, safe_run_id)