# R&D phase: About this notebook
- Purpose: TRAINS a model with Azure AutoML and with AZURE compute cluster and calculates test_set scoring, automatically compares if newly trained model is better.
    - To iteratively try different ML-algorithms see what's best, change performance settings, train again.
    - Also to try different apporoaches, classification or regression approach - which is better for the use case.

- Q: `WHEN to move on form R&D phase to PRODUCTION phase notebook?`
    - When you are happy with the MODEL (or if you have a big dataset that requires pipeline for training) - then go to the next notebook `2a_PRODUCTION_phase` to create PIPELINES: 
        - PRODUCTION PHASE & MLOps requires 1 `training pipeline`, and a `scoring pipeline` or `scoring online endpoint`, for inference 
- This notebook - Details:
    - 1) Automaps data as Azure ML datasets. Based on your `lake_settings.json`
    - 2) Splits the GOLD data into 3 buckets. 
        - NB this is done with local compute, not Azure, use 
             - Option 1: `2a_PRODUCTION_phase` training pipeline if data is too big for local RAM memory
             - Option 2: Stay in this notebook & local split of data, but increase RAM memory of your/this Azure VM developer (DSVM) computer.
             - Option 2: Stay in this notebook & local split of data, but reduce data size. Only use a sample .parquet (or .csv) file in the IN-folder.
    - 3) Trains model
    - 4) Registers model
    - 5) Calculate test_set scoring
    - 6) Deploys model - ONLINE endpoint to AKS
    - 7) Inference: Smoke testing, using the ONLINE endpoint - get result back, saves the result to datalake also
    - DONE.
    
- This notebook is called: `M10_v143_esml_classification_1_train_env_dev.ipynb` in the notebook_templates folder
 

# Login / Switch DEV_TEST_PROD environment (1-timer)

In [None]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from azureml.core.authentication import InteractiveLoginAuthentication
from esml import ESMLProject

p = ESMLProject()
p.dev_test_prod="dev"

print(p.tenant)
print(p.workspace_name) # self.workspace_name,subscription_id = self.subscription_id,resource_group = self.resource_group
print(p.subscription_id)
print(p.resource_group)

auth = InteractiveLoginAuthentication(tenant_id = p.tenant)
#auth = InteractiveLoginAuthentication(force=True, tenant_id = p.tenant)
ws, config_name = p.authenticate_workspace_and_write_config(auth)

# 1) ESML - TRAIN Classification, TITANIC model, and DEPLOY with predict_proba scoring

In [1]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/esml/common/")
from esml import ESMLProject
import pandas as pd

param_esml_env = "dev" 
param_inference_model_version = "1" # DATALAKE(my_model/inference/active) | settings/project_specific/active/active_scoring_in_folder.json
param_scoring_folder_date = "1000-01-01 00:00:01.243860" # DATALAKE(my_model/inference/active) | settings/project_specific/active/active_scoring_in_folder.json
param_train_in_folder_date = "1000-01-01 00:00:01.243860" # DATALAKE(my_model/train/active) | settings/project_specific/active/active_in_folder.json

p = ESMLProject(param_esml_env,param_inference_model_version,param_scoring_folder_date,param_train_in_folder_date)
#p = ESMLProject() # Alternatively use empty contructor, which takes parameters from settings\project_specific\model\active\active_in_folder.json

p.active_model = 11
p.inference_mode = False
p.ws = p.get_workspace_from_config() #2) Load DEV or TEST or PROD Azure ML Studio workspace
p.verbose_logging = False
p.describe()

Using lake_settings.json with ESML version 1.4 - Models array support including LABEL
Environment: dev
Inference version: 1

 - ds01_diabetes
projects/project002/11_diabetes_model_reg/train/ds01_diabetes/in/dev/1000/01/01/
projects/project002/11_diabetes_model_reg/train/ds01_diabetes/out/bronze/dev/
projects/project002/11_diabetes_model_reg/train/ds01_diabetes/out/silver/dev/

 - ds02_other
projects/project002/11_diabetes_model_reg/train/ds02_other/in/dev/1000/01/01/
projects/project002/11_diabetes_model_reg/train/ds02_other/out/bronze/dev/
projects/project002/11_diabetes_model_reg/train/ds02_other/out/silver/dev/
 

Training GOLD (p.GoldPath)
projects/project002/11_diabetes_model_reg/train/gold/dev/
 

[A) USAGE]: to_score_folder, scored_folder, date_folder = p.get_gold_scored_unique_path()
A)INFERENCE ONLINE: GOLD to score (example if realtime - today)
projects/project002/11_diabetes_model_reg/inference/1/gold/dev/2022_10_14/dd9fe698a4784cb6b0bb0d169df7413a/
 

A)INFERENCE ONLINE: GO

In [None]:
p.ws

In [None]:
unregister_all_datasets=False
if(unregister_all_datasets):
    p.unregister_all_datasets(p.ws) # For DEMO purpose

In [2]:
def test_feature_engieering():
    # Feture engineering: Bronze 2 Gold - working with Azure ML Datasets with Bronze, Silver, Gold concept
    esml_dataset = p.DatasetByName("ds01_diabetes") # Get dataset
    df_bronze = esml_dataset.Bronze.to_pandas_dataframe()
    p.save_silver(esml_dataset,df_bronze) #Bronze -> Silver

    esml_dataset2 = p.DatasetByName("ds02_other") # Get dataset
    df_bronze2 = esml_dataset2.Bronze.to_pandas_dataframe()
    p.save_silver(esml_dataset2,df_bronze2) #Bronze -> Silver

    df = esml_dataset.Silver.to_pandas_dataframe() 
    df_filtered = df[df.AGE > 0.015] 
    gold = p.save_gold(df_filtered)  #Silver -> Gold
    return gold

In [2]:
datastore = None
try:
    datastore = p.connect_to_lake() # Connects to the correct ALDS GEN 2 storage account (DEV, TEST or PROD)
    gold_train = p.GoldTrain
    gold_train.name
    print("Not 1st time. We have data mapped already...and splitted. Now connected to LAKE")
except: # If 1st time....no Gold exists, nor any mapping
    print("1st time. Lets init, map what data we have in LAKE, as Azure ML Datasets")
    datastore = p.init() # 3) Automapping from datalake to Azure ML datasets
    gold = test_feature_engieering()

Using GEN2 as Datastore
Searching for setting in ESML datalake...
ESML in-folder settings override = TRUE 
 - Found settings in the ESML AutoLake  [active_in_folder.json,active_scoring_in_folder.json], to override ArgParse/GIT config with.
 - TRAIN in date:  1000/01/01
 - INFERENCE in date: 2021/06/08 and ModelVersion to score with: 1 (0=latest)
Not 1st time. We have data mapped already...and splitted. Now connected to LAKE


In [3]:
p.Gold.to_pandas_dataframe().head()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


## SUMMARY - step 1
- ESML has now `Automap` and `Autoregister` Azure ML Datasets as: `IN, SILVER, BRONZE, GOLD`
- ESML has read configuration for correct environment (DEV, TEST, PROD). 
    - Both small customers, and large Enterprise customers often wants:  DEV, TEST, PROD in `diffferent Azure ML workspaces` (and different subscriptions)
- User has done feature engineering, and saved GOLD `p.save_gold`

In [6]:
print("rows in GOLD {}".format(p.Gold.to_pandas_dataframe().shape[0]))

rows in GOLD 442


### SPLIT option A) ESML default split logic, which you can override

In [3]:
M10_GOLD_TRAIN, M10_GOLD_VALIDATE, M10_GOLD_TEST = p.split_gold_3(0.6,label=p.active_model["label"],stratified=False) # Splits and Auto-registers as AZUREM ML Datasets

...


### SPLIT option B) Use YOUR split logic, override the default
- You need to create your own class (ESMLSplitter is just an example class) such as MySplitter(IESMLSplitter)

In [4]:
import sys
sys.path.insert(0, "../azure-enterprise-scale-ml/")

from esmlrt.interfaces.iESMLSplitter import IESMLSplitter # Just for reference to see where the abstract class exists
from esmlrt.runtime.ESMLSplitter import ESMLSplitter1 # Point at your own code/class here instead..that needst to implement the IESMLSplitter class

my_IESMLSplitter = ESMLSplitter1()
M10_GOLD_TRAIN, M10_GOLD_VALIDATE, M10_GOLD_TEST = p.split_gold_3(train_percentage=0.6,label=p.active_model["label"],stratified=False,override_with_custom_iESMLSplitter=my_IESMLSplitter) # Splits and Auto-registers as AZUREM ML Datasets

...


# IN_2_GOLD
- If just wanting to refine data to GOLD, for a Power BI report (No ML involved)
- Scenario: You want to refine data from "IN_2_GOLD" with an easy way to READ/WRITE data (using the enterprise datalake via ESML AutoLake and ESML SDK)

In [None]:
p.GoldTrain.to_pandas_dataframe().head()  # Azure ML Dataset

# 2) `ESML` Train model in `5 codelines`

In [None]:
print("We are in environment {}".format(p.dev_test_prod))

Lets look at our AutoML performance settings:

In [None]:
automl_performance_config = p.get_automl_performance_config() # 1)Get config, for active environment (dev,test or prod)
automl_performance_config

Lets look at our label, and our machine learning task type:

In [None]:
print('Label is: {}'.format(p.active_model["label"]))
print('ml_type / task is: {}'.format(p.active_model["ml_type"]))

### Lets TRAIN with AutoML & Azure compute cluster (M11 demo takes ~ 10-15min)

In [None]:
from esml import ESMLProject
from baselayer_azure_ml import AutoMLFactory,azure_metric_regression,azure_metric_classification
from azureml.train.automl import AutoMLConfig

automl_performance_config = p.get_automl_performance_config() # 1)Get config, for active environment (dev,test or prod)
aml_compute = p.get_training_aml_compute(p.ws) # 2)Get compute, for active environment

automl_config = AutoMLConfig(task = p.active_model["ml_type"], # 4) Override the ENV config, for model(that inhertits from enterprise DEV_TEST_PROD config baseline)
                            primary_metric = azure_metric_regression.MAE, #  Note: Regression[MAE, RMSE,R2,Spearman] Classification[AUC,Accuracy,Precision,Precision_avg,Recall]
                            compute_target = aml_compute,
                            training_data = p.GoldTrain, # is 'train_6' pandas dataframe, but as an Azure ML Dataset
                            experiment_exit_score = '0.308', # DEMO purpose. remove experiment_exit_score if you want to have good accuracy (put a comment # on this row to remove it)
                            label_column_name = p.active_model["label"],
                            **automl_performance_config
                        )

best_run, fitted_model, experiment = AutoMLFactory(p).train_as_run(automl_config)

# 3) Production purpose: "once and only once": Wrap code
- 3 Callers: MLOps, AMLPipeline, and this notebook

import sys
sys.path.insert(0, "../../2_A_aml_pipeline/4_inference/batch/M10/your_code/")
from your_custom_code import M01In2GoldProcessor

#p.init()
esml_dataset1 = p.DatasetByName("ds01_titanic") # Get dataset 1
df_bronze = esml_dataset1.Bronze.to_pandas_dataframe()
silver1 = p.save_silver(esml_dataset1,df_bronze) #Bronze -> Silver

esml_dataset2 = p.DatasetByName("ds02_haircolor") # Get dataset 2
df_bronze2 = esml_dataset2.Bronze.to_pandas_dataframe()
silver2 = p.save_silver(esml_dataset2,df_bronze2) #Bronze -> Silver

df1 = M01In2GoldProcessor().M01_ds01_process_in2silver(silver1.to_pandas_dataframe())  # You can then copy this statement in your pipeline-step "in2silver_ds01...py"
df2 = M01In2GoldProcessor().M01_ds02_process_in2silver(silver2.to_pandas_dataframe())  # You can then copy this statement in your pipeline-step "in2silver_ds02...py"

merged_gold = M01In2GoldProcessor().M01_merge_silvers(df1,df2) # # You can then copy this statement in your pipeline-step "silver_merged_2_gold.py"
p.save_gold(merged_gold).to_pandas_dataframe().head()

## 2b) ESML Scoring Drift/Concept Drift: Compare with `1-codeline`: Promote model or not? If better, then `Register model`
- `IF` newly trained model in `current` environment (`DEV`, `TEST` or `PROD`) scores BETTER than existing model in `target` environment, then `new model` can be registered and promoted.
- Q: Do we have `SCORING DRIFT / CONCEPT DRIFT?`
- Q: Is a model trained on NEW data better? IS the one in production degraded? (not fit for the data it scores - real world changed, other CONCEPT)
- A: - Lets check. Instead of `DataDrift`, lets look at `actual SCORING` on new data (and/or new code, feature engineering) - See if we should PROMOTE newly trained model...

In [None]:
print("current AI Factory environment: '{}' - AML WS: '{}'".format(p.dev_test_prod, p.ws.name))

In [None]:
from baselayer_azure_ml_model import ESMLModelCompare

current_env = p.dev_test_prod # dev
target_env = "dev" # Does newly trained Model v3 in DEV, score better than Model v2 in TEST?
print("promote model in DEV to TEST? (move to other Azure ML Studio Workspace)")

compare = ESMLModelCompare(p)
promote,source_model_name,new_run_id,target_model_name, target_best_run_id,target_workspace,source_model = compare.compare_scoring_current_vs_new_model(target_env) # Compare DEV to TEST (or TEST to PROD)  (1min, 17sek VS 33sec)

print("SCORING DRIFT: If new model scores better in DEV (new data, or new code), we can promote this to TEST & PROD \n")
print("New Model: {} in environment {}".format(target_model_name, p.dev_test_prod))
print("Existing Model: {} in environment {}".format(source_model_name,target_env))

if (promote): # Can register="promote" a model in same workspace (test->test), or also register in OTHER Azure ML workspace (test->prod)
    if(p.dev_test_prod == target_env):
        compare.register_active_model(target_env,source_model) # if SAME workspace this brings more "metadata" faster to the model registration
    else:
        compare.register_model_in_correct_ws(target_env) # if REMOTE target workspace we can get same metadata, BUT, just takes performancewise longer. More lookups to "source Run"

# TEST SET SCORING

# Test-set: Ensure we have a TEST_SET splitted

In [None]:
label = p.active_model["label"]
try:
    p.GoldTest.name
except: 
    p.connect_to_lake() # p.init() + automap
    train_6, validate_set_2, test_set_2 = p.split_gold_3(0.6)

### NOW we can calcualate scoring on TEST_SET

In [None]:
from baselayer_azure_ml import ESMLTestScoringFactory

label = p.active_model["label"]
rmse, r2, mean_abs_percent_error,mae,spearman_corr,plt = ESMLTestScoringFactory(p).get_test_scoring_4_regression(label)
print("RMSE:")
print(rmse)
print()
print("R2:")
print(r2)
print()
print("MAPE:")
print(mean_abs_percent_error)
print()
print("MAE:")
print(mae)
print()
print("Spearman:")
print(spearman_corr)

# END