# AutoML 与信リスクモデル & モデル解釈
- Python SDK のインポート
- Azure ML Workspace への接続
- Experiment の作成
- データの準備
- 自動機械学習の事前設定
- モデル学習と結果の確認
- モデル解釈

## Python SDK のインポート

In [1]:
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

In [2]:
# Python SDK バージョン確認
print(azureml.core.VERSION)

1.0.74


## Azure Machine Learning への接続

In [3]:
subscription_id = '9c0f91b8-eb2f-484c-979c-15848c098a6b'
resource_group = 'AML-HOL'
workspace_name = 'azureml'

ws = Workspace(subscription_id, resource_group, workspace_name)
print(ws.name, ws.location, ws.resource_group, ws.location, sep = '\t')

azureml	japaneast	AML-HOL	japaneast


## Experiment の作成

In [4]:
# choose a name for experiment
experiment_name = 'automl-hmeq-ja-remote'
experiment=Experiment(ws, experiment_name)

## データの準備
### 住宅ローン履行 / 不履行の履歴データ

Kaggle の [HMEQ_Data](https://www.kaggle.com/ajay1735/hmeq-data) を学習データにします。

* BAD : 不履行フラグ (0: 返済、1:デフォルト)
* LOAN : 融資依頼金
* MORTDUE : 未払担保金額
* VALUE : 現在資産価値
* REASON : 債務理由
* JOB : 職種
* YOJ : 勤務年数
* DEROG : 信用調査会社問い合わせ数
* DELINQ : 延滞トレードライン数
* CLAGE : 最も古いトレードラインの月齢
* NINQ : 最近のクレジット問い合わせ数
* CLNO : トレード（クレジット）ラインの数
* DEBTINC : 債務対所得割合

In [5]:
dataset = Dataset.get_by_name(ws, name='hmeq_ja')
dataset.to_pandas_dataframe()



Unnamed: 0,不履行フラグ,融資依頼金額,未払担保金額,現在資産価値,債務理由,職種,勤務年数,信用調査会社問い合わせ数,延滞トレードライン数,最も古いトレードラインの月齢,最近のクレジットの問い合わせ数,トレード(クレジット)ラインの数,債務対所得の割合
0,1,1100,25860.00,39025.00,HomeImp,Other,10.50,0.00,0.00,94.37,1.00,9.00,
1,1,1300,70053.00,68400.00,HomeImp,Other,7.00,0.00,2.00,121.83,0.00,14.00,
2,1,1500,13500.00,16700.00,HomeImp,Other,4.00,0.00,0.00,149.47,1.00,10.00,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.00,112000.00,HomeImp,Office,3.00,0.00,0.00,93.33,0.00,14.00,
5,1,1700,30548.00,40320.00,HomeImp,Other,9.00,0.00,0.00,101.47,1.00,8.00,37.11
6,1,1800,48649.00,57037.00,HomeImp,Other,5.00,3.00,2.00,77.10,1.00,17.00,
7,1,1800,28502.00,43034.00,HomeImp,Other,11.00,0.00,0.00,88.77,0.00,8.00,36.88
8,1,2000,32700.00,46740.00,HomeImp,Other,3.00,0.00,2.00,216.93,1.00,12.00,
9,1,2000,,62250.00,HomeImp,Sales,16.00,0.00,0.00,115.80,0.00,13.00,


In [6]:
label = '不履行フラグ'

In [7]:
train_data, test_data = dataset.random_split(percentage=0.8, seed=1234)

In [8]:
train_data.to_pandas_dataframe().head()

Unnamed: 0,不履行フラグ,融資依頼金額,未払担保金額,現在資産価値,債務理由,職種,勤務年数,信用調査会社問い合わせ数,延滞トレードライン数,最も古いトレードラインの月齢,最近のクレジットの問い合わせ数,トレード(クレジット)ラインの数,債務対所得の割合
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.37,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.83,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.47,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.33,0.0,14.0,


## 自動機械学習の事前設定
### 今回はリモート環境でモデル構築をします。

In [13]:
from azureml.core.compute import ComputeTarget
compute_target = ComputeTarget(ws, name = "cpucluster")

In [14]:
automl_settings = {
    "iteration_timeout_minutes": 5,
    "iterations": 10,
    "n_cross_validations": 3,
    "primary_metric": 'accuracy',
    "preprocess": True,
    "enable_voting_ensemble": False,
    "enable_stack_ensemble": False,
    #"model_explainability" : True,
}

automl_config = AutoMLConfig(task = 'classification',
                             training_data = train_data,
                             label_column_name= label,
                             compute_target = compute_target,
                             **automl_settings
                            )

In [15]:
remote_run = experiment.submit(automl_config, show_output = True)

Running on remote compute: cpucluster
Parent Run ID: AutoML_c962acc6-b5ba-4313-8b82-3cffcb5f47af

Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler SGD                               0:01:15       0.8813    0.8813
         1   MaxAbsScaler SGD                               0:01:06       0.8607    0.8813
         2   MaxAb

In [27]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [28]:
automl_run, fitted_model = remote_run.get_output()
automl_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-hmeq-ja-remote,AutoML_c962acc6-b5ba-4313-8b82-3cffcb5f47af_0,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


## モデル解釈

In [29]:
from azureml.train.automl.automl_explain_utilities import AutoMLExplainerSetupClass, automl_setup_model_explanations
from azureml.explain.model.mimic.models.lightgbm_model import LGBMExplainableModel
from azureml.explain.model.mimic_wrapper import MimicWrapper
from azureml.contrib.interpret.visualize import ExplanationDashboard

In [30]:
# モデル解釈に利用するデータの準備
X_train = train_data.drop_columns([label])
y_train = train_data.keep_columns([label])
X_test = test_data.drop_columns([label])
y_test = test_data.keep_columns([label])

In [31]:
automl_explainer_setup_obj = automl_setup_model_explanations(fitted_model, 'classification',
                                                             X=X_train, X_test=X_test,
                                                             y=y_train)

Current status: Setting up data for AutoML explanations
Current status: Setting up the AutoML featurizer
Current status: Setting up the AutoML featurization for explanations
Current status: Setting up the AutoML estimator
Current status: Generating a feature map for raw feature importance
Current status: Finding all classes from the dataset
Current status: Data for AutoML explanations successfully setup


In [32]:
import pandas as pd
pd.DataFrame(automl_explainer_setup_obj.X_test_transform.toarray(), columns=automl_explainer_setup_obj.engineered_feature_names).head()

Unnamed: 0,融資依頼金額_MeanImputer,未払担保金額_MeanImputer,未払担保金額_ImputationMarker,現在資産価値_MeanImputer,現在資産価値_ImputationMarker,勤務年数_MeanImputer,勤務年数_ImputationMarker,信用調査会社問い合わせ数_MeanImputer,信用調査会社問い合わせ数_ImputationMarker,延滞トレードライン数_MeanImputer,...,債務理由_CharGramCountVectorizer,債務理由_CharGramCountVectorizer_debtcon,債務理由_CharGramCountVectorizer_homeimp,職種_CharGramCountVectorizer,職種_CharGramCountVectorizer_mgr,職種_CharGramCountVectorizer_office,職種_CharGramCountVectorizer_other,職種_CharGramCountVectorizer_profexe,職種_CharGramCountVectorizer_sales,職種_CharGramCountVectorizer_self
0,2300.0,28192.0,0.0,40150.0,0.0,4.5,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,2500.0,71408.0,0.0,78600.0,0.0,8.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3000.0,73612.79,1.0,33000.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,3000.0,58000.0,0.0,71500.0,0.0,10.0,0.0,0.25,1.0,2.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,3100.0,39589.0,0.0,36100.0,0.0,1.5,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Engineered Explanation (データ前処理以後の変数)

In [33]:
# Global surrogate model
explainer = MimicWrapper(ws, automl_explainer_setup_obj.automl_estimator, LGBMExplainableModel,
                         init_dataset=automl_explainer_setup_obj.X_transform, run=automl_run,
                         features=automl_explainer_setup_obj.engineered_feature_names,
                         feature_maps=[automl_explainer_setup_obj.feature_map],
                         classes=automl_explainer_setup_obj.classes)

In [34]:
# Compute the engineered explanations
engineered_explanations = explainer.explain(['local', 'global'],get_raw=False,
                                            eval_dataset=automl_explainer_setup_obj.X_test_transform)

In [35]:
ExplanationDashboard(engineered_explanations, automl_explainer_setup_obj.automl_estimator, automl_explainer_setup_obj.X_test_transform, y_test.to_pandas_dataframe().values)

ExplanationWidget(value={'predictedY': [1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0…

<azureml.contrib.interpret.visualize.ExplanationDashboard.ExplanationDashboard at 0x1283bbc18>

### RAW Explanation (データ前処理以前の変数)

In [36]:
# Compute the raw explanations
raw_explanations = explainer.explain(['local', 'global'], get_raw=True,
                                     raw_feature_names=automl_explainer_setup_obj.raw_feature_names,
                                     eval_dataset=automl_explainer_setup_obj.X_test_transform)

In [37]:
ExplanationDashboard(raw_explanations, automl_explainer_setup_obj.automl_pipeline, automl_explainer_setup_obj.X_test_raw, y_test.to_pandas_dataframe().values)

ExplanationWidget(value={'predictedY': [1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0…

<azureml.contrib.interpret.visualize.ExplanationDashboard.ExplanationDashboard at 0x12813f978>