# Azure Machine Learning Interpretability SDK による要因探索

品質を予測する機械学習モデルによって製造工程のデータから製造品の品質を予測することが可能になります。それだけでなく、モデルの構造を理解することで、不良に影響を与える説明変数・因子を特定し、不良の原因を見つける手助けができます。本Notebookでは、**Factory.csv** を利用し、製造工程のデータから品質を予測する機械学習を構築し、**Azure Machine Learning Interpretability SDK** の、品質に対する因子の影響度を分析します。

## 1. Python SDK のインポート
Azure Machine Learning service の Python SDKをインポートします。

In [1]:
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import os

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.23


### Azure ML workspace との接続
Azure Machine Learning service との接続を行います。Azure に対する認証が必要です。

In [3]:
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, ws.location, sep = '\t')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Found the config file in: /Users/konabuta/Project/Manufacturing-ML/.azureml/config.json
azureml	eastus	dllab	eastus


# 2. 学習データの準備

In [4]:
import pandas as pd
#os.makedirs("./outputs", exist_ok=True)
df = pd.read_csv('./data/Factory.csv')

In [5]:
df.tail(10)

Unnamed: 0,ID,Quality,ProcessA-Pressure,ProcessA-Humidity,ProcessA-Vibration,ProcessB-Light,ProcessB-Skill,ProcessB-Temp,ProcessB-Rotation,ProcessC-Density,ProcessC-PH,ProcessC-skewness,ProcessC-Time
4888,4889,0,6.8,0.22,0.36,1.2,0.05,38.0,127.0,0.99,3.04,0.54,9.2
4889,4890,0,4.9,0.23,0.27,11.75,0.03,34.0,118.0,1.0,3.07,0.5,9.4
4890,4891,0,6.1,0.34,0.29,2.2,0.04,25.0,100.0,0.99,3.06,0.44,11.8
4891,4892,0,5.7,0.21,0.32,0.9,0.04,38.0,121.0,0.99,3.24,0.46,10.6
4892,4893,0,6.5,0.23,0.38,1.3,0.03,29.0,112.0,0.99,3.29,0.54,9.7
4893,4894,0,6.2,0.21,0.29,1.6,0.04,24.0,92.0,0.99,3.27,0.5,11.2
4894,4895,0,6.6,0.32,0.36,8.0,0.05,57.0,168.0,0.99,3.15,0.46,9.6
4895,4896,0,6.5,0.24,0.19,1.2,0.04,30.0,111.0,0.99,2.99,0.46,9.4
4896,4897,1,5.5,0.29,0.3,1.1,0.02,20.0,110.0,0.99,3.34,0.38,12.8
4897,4898,0,6.0,0.21,0.38,0.8,0.02,22.0,98.0,0.99,3.26,0.32,11.8


In [6]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["Quality","ID"],axis=1)
y = df["Quality"].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=100,stratify=y)

# 3. 事前設定

In [7]:
Automl_config = AutoMLConfig(task = 'classification',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 10,
                             iterations = 5,
                             experiment_exit_score = 0.999,
                             blacklist_models = ['KNN'],
                             X = X_train,
                             y = y_train,
                             n_cross_validations = 3)

# 4. 実行と結果確認

In [8]:
experiment=Experiment(ws, "automlQC_explain")
local_run = experiment.submit(Automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_56074096-3569-4f95-96d2-b21e9eced330
****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper SGD                      0:00:18       0.7813    0.7813
         1   StandardScalerWrapper SGD                      0:00:18       0.7866    0.7866
         2   MinMaxScaler LightGBM                          0:00:18       0.8421    0.8421
         3   StandardScalerWrapper SGD                      0:00:18       0.7835    0.8421
 

In [9]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [10]:
best_run, fitted_model = local_run.get_output()
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
automlQC_explain,AutoML_56074096-3569-4f95-96d2-b21e9eced330_2,,Completed,Link to Azure Portal,Link to Documentation


In [11]:
fitted_model

Pipeline(memory=None,
     steps=[('MinMaxScaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('LightGBMClassifier', <automl.client.core.common.model_wrappers.LightGBMClassifier object at 0x1189269e8>)])

# 5. Azure Machine Learning Interpretability SDK

[Azure Machine Learning Interpretability SDK](https://docs.microsoft.com/en-US/azure/machine-learning/service/machine-learning-interpretability-explainability?view=azuremgmtcompute-fluent-1.0.0) は、Microsoftと主要な3rd Partyのライブラリ(LIME,SHAP etc)で構成されたモデル解釈のフレームワークで、統合APIをご提供しています。

<img src="https://docs.microsoft.com/en-US/azure/machine-learning/service/media/machine-learning-interpretability-explainability/interpretability-architecture.png#lightbox" width=800 align=left>

In [12]:
from azureml.explain.model.tabular_explainer import TabularExplainer
classes = ["false","true"]
tabular_explainer = TabularExplainer(fitted_model, X_train, features=X_train.columns, classes=classes)

In [13]:
global_explanation = tabular_explainer.explain_global(X_train[:100])

100%|██████████| 100/100 [00:15<00:00,  5.62it/s]


In [14]:
# Sorted SHAP values
print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))
# Corresponding feature names
print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))
# feature ranks (based on original order of features)
print('global importance rank: {}'.format(global_explanation.global_importance_rank))
# per class feature names
print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))
# per class feature importance values
print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))

ranked global importance values: [0.08100462174990572, 0.049235430651789355, 0.03312915013634618, 0.03274960037911625, 0.02957059064940822, 0.02917127940737956, 0.025856538312190404, 0.015239174419165245, 0.012119398148101097, 0.010296292474114035, 0.00836650050404257]
ranked global importance names: ['ProcessC-Time', 'ProcessC-Density', 'ProcessC-PH', 'ProcessB-Skill', 'ProcessB-Light', 'ProcessA-Humidity', 'ProcessB-Temp', 'ProcessA-Vibration', 'ProcessB-Rotation', 'ProcessA-Pressure', 'ProcessC-skewness']
global importance rank: [10, 7, 8, 4, 3, 1, 5, 2, 6, 0, 9]
ranked per class feature names: [['ProcessC-Time', 'ProcessC-Density', 'ProcessC-PH', 'ProcessB-Skill', 'ProcessB-Light', 'ProcessA-Humidity', 'ProcessB-Temp', 'ProcessA-Vibration', 'ProcessB-Rotation', 'ProcessA-Pressure', 'ProcessC-skewness'], ['ProcessC-Time', 'ProcessC-Density', 'ProcessC-PH', 'ProcessB-Skill', 'ProcessB-Light', 'ProcessA-Humidity', 'ProcessB-Temp', 'ProcessA-Vibration', 'ProcessB-Rotation', 'ProcessA-P

In [15]:
dict(zip(global_explanation.get_ranked_global_names(), global_explanation.get_ranked_global_values()))

{'ProcessC-Time': 0.08100462174990572,
 'ProcessC-Density': 0.049235430651789355,
 'ProcessC-PH': 0.03312915013634618,
 'ProcessB-Skill': 0.03274960037911625,
 'ProcessB-Light': 0.02957059064940822,
 'ProcessA-Humidity': 0.02917127940737956,
 'ProcessB-Temp': 0.025856538312190404,
 'ProcessA-Vibration': 0.015239174419165245,
 'ProcessB-Rotation': 0.012119398148101097,
 'ProcessA-Pressure': 0.010296292474114035,
 'ProcessC-skewness': 0.00836650050404257}

In [17]:
local_explanation = tabular_explainer.explain_local(X_test[14:15])

100%|██████████| 1/1 [00:00<00:00,  6.06it/s]


In [18]:
# local feature importance information
local_importance_values = local_explanation.local_importance_values
print('local importance for first instance: {}'.format(local_importance_values[1][0]))

local importance for first instance: [-0.002806487672019617, 0.006808501815953091, -0.019988360369194935, -0.006448343242444671, 0.10223746338070536, -0.03051536385685906, 0.0006870998848662671, 0.11436859818925099, 0.007061757029632271, 0.0359866866866784, 0.32251324354528]


In [19]:
print('local importance feature names: {}'.format(list(local_explanation.features)))

local importance feature names: ['ProcessA-Pressure', 'ProcessA-Humidity', 'ProcessA-Vibration', 'ProcessB-Light', 'ProcessB-Skill', 'ProcessB-Temp', 'ProcessB-Rotation', 'ProcessC-Density', 'ProcessC-PH', 'ProcessC-skewness', 'ProcessC-Time']


In [20]:
dict(zip(local_explanation.features, local_explanation.local_importance_values[1][0]))

{'ProcessA-Pressure': -0.002806487672019617,
 'ProcessA-Humidity': 0.006808501815953091,
 'ProcessA-Vibration': -0.019988360369194935,
 'ProcessB-Light': -0.006448343242444671,
 'ProcessB-Skill': 0.10223746338070536,
 'ProcessB-Temp': -0.03051536385685906,
 'ProcessB-Rotation': 0.0006870998848662671,
 'ProcessC-Density': 0.11436859818925099,
 'ProcessC-PH': 0.007061757029632271,
 'ProcessC-skewness': 0.0359866866866784,
 'ProcessC-Time': 0.32251324354528}