diff --git a/notebooks/features/responsible_ai/Interpretability - ICE explainer.ipynb b/notebooks/features/responsible_ai/Interpretability - ICE explainer.ipynb new file mode 100644 index 0000000000..4d3849c5bb --- /dev/null +++ b/notebooks/features/responsible_ai/Interpretability - ICE explainer.ipynb @@ -0,0 +1,942 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b7488bd3-b1a1-4b4b-a3be-52e447c4a46c", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Partial Dependence (PDP) and Individual Conditional Expectation (ICE) plots" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "9ad6341d-1f5c-4163-a676-29655d900045", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Partial Dependence Plot (PDP) and Individual Condition Expectation (ICE) are interpretation methods which describe the average behavior of a classification or regression model. They are particularly useful when the model developer wants to understand generally how the model depends on individual feature values or to catch unexpected model behavior.\n", + "\n", + "The goal of this notebook is to show how these methods work for a pretrained model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "6d7a6880-7982-41f0-b768-893b50a5fc96", + "showTitle": false, + "title": "" + } + }, + "source": [ + "In this example, we train a classification model with the Adult Census Income dataset. Then we treat the model as an opaque-box model and calculate the PDP and ICE plots for some selected categorical and numeric features. \n", + "\n", + "This dataset can be used to predict whether annual income exceeds $50,000/year or not based on demographic data from the 1994 U.S. Census. The dataset we're reading contains 32,561 rows and 14 columns/features.\n", + "\n", + "[More info on the dataset here](https://archive.ics.uci.edu/ml/datasets/Adult)\n", + "\n", + "We will train a classification model to predict >= 50K or < 50K based on our features.\n", + "\n", + "---\n", + "Python dependencies:\n", + "\n", + "matplotlib==3.2.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "90785380-3bea-4b75-81ea-60203b3daf8c", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "from pyspark.ml import Pipeline\n", + "from pyspark.ml.classification import GBTClassifier\n", + "from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder\n", + "import pyspark.sql.functions as F\n", + "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n", + "\n", + "from synapse.ml.explainers import ICETransformer\n", + "\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "74344307-ac58-4e04-8d80-9fb9d24a5803", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Read and prepare the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "f42818e8-2361-4a46-aa6b-f94367f91dd1", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "df = spark.read.parquet(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet\")\n", + "display(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "81d6bb74-96e9-42ee-b3db-e42c0ea88376", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Fit the model and view the predictions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "29185989-6468-47f3-abaf-f362c74c092a", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "categorical_features = [\"race\", \"workclass\", \"marital-status\", \"education\", \"occupation\", \"relationship\", \"native-country\", \"sex\"]\n", + "numeric_features = [\"age\", \"education-num\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "073cae48-beea-4a24-9c3a-2dde9815a4fb", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "string_indexer_outputs = [feature + \"_idx\" for feature in categorical_features]\n", + "one_hot_encoder_outputs = [feature + \"_enc\" for feature in categorical_features]\n", + "\n", + "pipeline = Pipeline(stages=[\n", + " StringIndexer().setInputCol(\"income\").setOutputCol(\"label\").setStringOrderType(\"alphabetAsc\"),\n", + " StringIndexer().setInputCols(categorical_features).setOutputCols(string_indexer_outputs),\n", + " OneHotEncoder().setInputCols(string_indexer_outputs).setOutputCols(one_hot_encoder_outputs),\n", + " VectorAssembler(inputCols=one_hot_encoder_outputs+numeric_features, outputCol=\"features\"),\n", + " GBTClassifier(weightCol=\"fnlwgt\", maxDepth=7, maxIter=100)])\n", + "\n", + "model = pipeline.fit(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "85fcf25c-b3ad-4afd-a455-cef1c9b21cb4", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Check that model makes sense and has reasonable output. For this, we will check the model performance by calculating the ROC-AUC score." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "bd5b14c1-990d-48b4-b611-46787beed673", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "data = model.transform(df)\n", + "display(data.select('income', 'probability', 'prediction'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "45d0782a-dc16-4a37-977c-b3d426d650c8", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "eval_auc = BinaryClassificationEvaluator(labelCol=\"label\", rawPredictionCol=\"prediction\")\n", + "eval_auc.evaluate(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "8471ec63-9a87-4169-b35e-84f3a54c143a", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Partial Dependence Plots\n", + "\n", + "Partial dependence plots (PDP) show the dependence between the target response and a set of input features of interest, marginalizing over the values of all other input features. It can show whether the relationship between the target response and the input feature is linear, smooth, monotonic, or more complex. This is relevant when you want to have an overall understanding of model behavior. \n", + "\n", + "If you want to learn more please visit [this link](https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence-plots)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "422b70ee-483d-470b-9105-7da2cefb8fec", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Setup the transformer for PDP" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "01ccdea4-3d42-4045-9d3a-e82db0182f17", + "showTitle": false, + "title": "" + } + }, + "source": [ + "To plot PDP we need to set up the instance of `ICETransformer` first and set the `kind` parameter to `average` and then call the `transform` function. For the setup we need to pass the pretrained model, specify the target column (\"probability\" in our case), and pass categorical and numeric feature names." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "df3b3215-5418-4c3e-8931-300cb2481884", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "pdp = ICETransformer(model=model, targetCol=\"probability\", kind=\"average\", targetClasses=[1],\n", + " categoricalFeatures=categorical_features, numericFeatures=numeric_features)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "68774ede-f4ee-4437-bc40-16c7a2192098", + "showTitle": false, + "title": "" + } + }, + "source": [ + "PDP transformer returns a dataframe of 1 row * {number features to explain} columns. Each column contains a map between the feature's values and the model's average dependence for the that feature value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "d79ebaea-69ef-4c07-8bbb-157d6b86ee9b", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "output_pdp = pdp.transform(df)\n", + "display(output_pdp)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1b025798-2650-4d68-b5cd-136596d343cd", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Visualization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "e160c8bf-c4ff-4442-9e6c-f7615212b151", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "# Helper functions for visualization\n", + "\n", + "def get_pandas_df_from_column(df, col_name):\n", + " keys_df = df.select(F.explode(F.map_keys(F.col(col_name)))).distinct()\n", + " keys = list(map(lambda row: row[0], keys_df.collect()))\n", + " key_cols = list(map(lambda f: F.col(col_name).getItem(f).alias(str(f)), keys))\n", + " final_cols = key_cols\n", + " pandas_df = df.select(final_cols).toPandas()\n", + " return pandas_df\n", + "\n", + "def plot_dependence_for_categorical(df, col, col_int=True, figsize=(20, 5)):\n", + " dict_values = {}\n", + " col_names = list(df.columns)\n", + "\n", + " for col_name in col_names:\n", + " dict_values[col_name] = df[col_name][0].toArray()[0]\n", + " marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n", + " sortdict=dict(marklist)\n", + "\n", + " fig = plt.figure(figsize = figsize)\n", + " plt.bar(sortdict.keys(), sortdict.values())\n", + "\n", + " plt.xlabel(col, size=13)\n", + " plt.ylabel(\"Dependence\")\n", + " plt.show()\n", + " \n", + "def plot_dependence_for_numeric(df, col, col_int=True, figsize=(20, 5)):\n", + " dict_values = {}\n", + " col_names = list(df.columns)\n", + "\n", + " for col_name in col_names:\n", + " dict_values[col_name] = df[col_name][0].toArray()[0]\n", + " marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n", + " sortdict=dict(marklist)\n", + "\n", + " fig = plt.figure(figsize = figsize)\n", + "\n", + " \n", + " plt.plot(list(sortdict.keys()), list(sortdict.values()))\n", + "\n", + " plt.xlabel(col, size=13)\n", + " plt.ylabel(\"Dependence\")\n", + " plt.ylim(0.0)\n", + " plt.show()\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "2ebe88c8-0176-47da-9c9a-88c1dcfef123", + "showTitle": false, + "title": "" + } + }, + "source": [ + "#### Example 1: \"age\"\n", + "\n", + "We can observe non-linear dependency. The model predicts that income rapidly grows from 24-46 y.o. age, after 46 y.o. model predictions slightly drops and from 68 y.o. remains stable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "7f1386da-bd06-4dd6-9368-a031c982650c", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "df_education_num = get_pandas_df_from_column(output_pdp, 'age_dependence')\n", + "plot_dependence_for_numeric(df_education_num, 'age')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "fa81f2cb-8646-4392-bda6-53aff144c598", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Your results will look like:\n", + "\n", + "![pdp_age](https://mmlspark.blob.core.windows.net/graphics/explainers/pdp_age.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "d1da8c5b-b4cb-455f-9289-f646bff56c1d", + "showTitle": false, + "title": "" + } + }, + "source": [ + "#### Example 2: \"marital-status\"\n", + "\n", + "The model seems to treat \"married-cv-spouse\" as one category and tend to give a higher average prediction, and all others as a second category with the lower average prediction." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "4bbc72bf-307a-4d20-98f1-73673d81bb03", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "df_occupation = get_pandas_df_from_column(output_pdp, 'marital-status_dependence')\n", + "plot_dependence_for_categorical(df_occupation, 'marital-status', False, figsize=(30, 5))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b29edf66-a088-4837-a9c7-60b20cb71862", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Your results will look like:\n", + "![pdp_marital-status](https://mmlspark.blob.core.windows.net/graphics/explainers/pdp_marital-status.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "16b524bf-7587-461d-aa50-a88972cf2bb3", + "showTitle": false, + "title": "" + } + }, + "source": [ + "#### Example 3: \"capital-gain\"\n", + "\n", + "In the first graph, we run PDP with default parameters. We can see that this representation is not super useful because it is not granular enough. By default the range of numeric features are calculated dynamically from the data.\n", + "\n", + "In the second graph, we set rangeMin = 0 and rangeMax = 10000 to visualize more granular interpretations for the feature of interest. Now we can see more clearly how the model made decisions in a smaller region." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "2d076473-9760-4634-8465-aa90cb2a40dc", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "df_education_num = get_pandas_df_from_column(output_pdp, 'capital-gain_dependence')\n", + "plot_dependence_for_numeric(df_education_num, 'capital-gain_dependence')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "9b60b248-3ff6-42b1-9e29-9f85678fb1f6", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Your results will look like:\n", + "\n", + "![pdp_capital-gain-first](https://mmlspark.blob.core.windows.net/graphics/explainers/pdp_capital-gain-first.png)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "36853ee2-2b55-4cd5-ab53-b2090d8cb2c5", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "pdp_cap_gain = ICETransformer(model=model, targetCol=\"probability\", kind=\"average\", targetClasses=[1], \n", + " numericFeatures=[{\"name\": \"capital-gain\", \"numSplits\": 20, \"rangeMin\": 0.0,\n", + " \"rangeMax\": 10000.0}], numSamples=50)\n", + "output_pdp_cap_gain = pdp_cap_gain.transform(df)\n", + "df_education_num_gain = get_pandas_df_from_column(output_pdp_cap_gain, 'capital-gain_dependence')\n", + "plot_dependence_for_numeric(df_education_num_gain, 'capital-gain_dependence')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "873e7a8f-ce8a-44ad-81c4-dabf2b97c101", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Your results will look like:\n", + "\n", + "![pdp_capital-gain-second](https://mmlspark.blob.core.windows.net/graphics/explainers/pdp_capital-gain-second.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "773be71d-25ba-45ea-bf8f-98a9722e7da6", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Conclusions\n", + "\n", + "PDP can be used to show how features influences model predictions on average and help modeler catch unexpected behavior from the model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "e65a4bca-e276-4ecd-b3ed-8171b4b41867", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Individual Conditional Expectation\n", + "\n", + "ICE plots display one line per instance that shows how the instance’s prediction changes when a feature values changes. Each line represents the predictions for one instance if we vary the feature of interest. This is relevant when you want to observe model prediction for instances individually in more details. \n", + "\n", + "\n", + "If you want to learn more please visit [this link](https://scikit-learn.org/stable/modules/partial_dependence.html#individual-conditional-expectation-ice-plot)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1fe87d15-5a3f-4c40-b28d-3e751da22900", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Setup the transformer for ICE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "6442fdcc-7553-4147-a65f-8b20a9162a27", + "showTitle": false, + "title": "" + } + }, + "source": [ + "To plot ICE we need to set up the instance of `ICETransformer` first and set the `kind` parameter to `individual` and then call the `transform` function. For the setup we need to pass the pretrained model, specify the target column (\"probability\" in our case), and pass categorical and numeric feature names. For better visualization we set the number of samples to 50." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b2c4d0a7-f7ab-4afa-b6fd-4eac4fc8b27b", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "ice = ICETransformer(model=model, targetCol=\"probability\", targetClasses=[1], \n", + " categoricalFeatures=categorical_features, numericFeatures=numeric_features, numSamples=50)\n", + "\n", + "output = ice.transform(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b6272501-44f2-4fff-9abe-b96c8129f943", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Visualization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "6f3dfaa4-a1e0-4308-a702-973325d4c58f", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "# Helper functions for visualization\n", + "from math import pi\n", + "\n", + "from collections import defaultdict\n", + "\n", + "def plot_ice_numeric(df, col, col_int=True, figsize=(20, 10)):\n", + " dict_values = defaultdict(list)\n", + " col_names = list(df.columns)\n", + " num_instances = df.shape[0]\n", + " \n", + " instances_y = {}\n", + " i = 0\n", + "\n", + " for col_name in col_names:\n", + " for i in range(num_instances):\n", + " dict_values[i].append(df[col_name][i].toArray()[0])\n", + " \n", + " fig = plt.figure(figsize = figsize)\n", + " for i in range(num_instances):\n", + " plt.plot(col_names, dict_values[i], \"k\")\n", + " \n", + " \n", + " plt.xlabel(col, size=13)\n", + " plt.ylabel(\"Dependence\")\n", + " plt.ylim(0.0)\n", + " \n", + " \n", + " \n", + "def plot_ice_categorical(df, col, col_int=True, figsize=(20, 10)):\n", + " dict_values = defaultdict(list)\n", + " col_names = list(df.columns)\n", + " num_instances = df.shape[0]\n", + " \n", + " angles = [n / float(df.shape[1]) * 2 * pi for n in range(df.shape[1])]\n", + " angles += angles [:1]\n", + " \n", + " instances_y = {}\n", + " i = 0\n", + "\n", + " for col_name in col_names:\n", + " for i in range(num_instances):\n", + " dict_values[i].append(df[col_name][i].toArray()[0])\n", + " \n", + " fig = plt.figure(figsize = figsize)\n", + " ax = plt.subplot(111, polar=True)\n", + " plt.xticks(angles[:-1], col_names)\n", + " \n", + " for i in range(num_instances):\n", + " values = dict_values[i]\n", + " values += values[:1]\n", + " ax.plot(angles, values, \"k\")\n", + " ax.fill(angles, values, 'teal', alpha=0.1)\n", + "\n", + " plt.xlabel(col, size=13)\n", + " plt.show()\n", + "\n", + "def overlay_ice_with_pdp(df_ice, df_pdp, col, col_int=True, figsize=(20, 5)):\n", + " dict_values = defaultdict(list)\n", + " col_names_ice = list(df_ice.columns)\n", + " num_instances = df_ice.shape[0]\n", + " \n", + " instances_y = {}\n", + " i = 0\n", + "\n", + " for col_name in col_names_ice:\n", + " for i in range(num_instances):\n", + " dict_values[i].append(df_ice[col_name][i].toArray()[0])\n", + " \n", + " fig = plt.figure(figsize = figsize)\n", + " for i in range(num_instances):\n", + " plt.plot(col_names_ice, dict_values[i], \"k\")\n", + " \n", + " dict_values_pdp = {}\n", + " col_names = list(df_pdp.columns)\n", + "\n", + " for col_name in col_names:\n", + " dict_values_pdp[col_name] = df_pdp[col_name][0].toArray()[0]\n", + " marklist= sorted(dict_values_pdp.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n", + " sortdict=dict(marklist)\n", + " \n", + " plt.plot(col_names_ice, list(sortdict.values()), \"r\", linewidth=5)\n", + " \n", + " \n", + " \n", + " plt.xlabel(col, size=13)\n", + " plt.ylabel(\"Dependence\")\n", + " plt.ylim(0.0)\n", + " plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "856f3cbb-fc02-4279-9b16-059afa79e8d2", + "showTitle": false, + "title": "" + } + }, + "source": [ + "#### Example 1: Numeric feature: \"age\"\n", + "\n", + "We can overlay the PDP on top of ICE plots. In the graph, the red line shows the PDP plot for the \"age\" feature, and the black lines show ICE plots for 50 randomly selected observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "8d794aca-2a9e-4791-a2a3-95f294731ebc", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "age_df_ice = get_pandas_df_from_column(output, 'age_dependence')\n", + "age_df_pdp = get_pandas_df_from_column(output_pdp, 'age_dependence')\n", + "\n", + "overlay_ice_with_pdp(age_df_ice, age_df_pdp, col='age_dependence', figsize=(30, 10))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "c7fe65ce-1729-4a09-904b-59b52c22bd8f", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Your results will look like:\n", + "![pdp_age_overlayed](https://mmlspark.blob.core.windows.net/graphics/explainers/pdp_age_overlayed.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "d0a80081-a3f6-41d0-8f6d-b686220dbeaa", + "showTitle": false, + "title": "" + } + }, + "source": [ + "#### Example 2: Categorical feature: \"occupation\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b13e6704-0783-48e2-bc76-c327cf5fba00", + "showTitle": false, + "title": "" + } + }, + "source": [ + "For visualization of categorical features, we are using a star plot.\n", + "\n", + "- The X-axis here is a circle which is splitted into equal parts, each representing a feature value.\n", + "- The Y-coordinate shows the dependence values. Each line represents a sample observation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1fee7eb9-6e89-480c-b864-fed2cfbf33f5", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "occupation_dep = get_pandas_df_from_column(output, 'occupation_dependence')\n", + "\n", + "plot_ice_categorical(occupation_dep, 'occupation_dependence', figsize=(30, 10))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "70090a1a-b979-422d-8870-4da6904afef1", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Your results will look like:\n", + "\n", + "![pdp_occupation-star-plot](https://mmlspark.blob.core.windows.net/graphics/explainers/pdp_occupation-star-plot.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "4f6494ac-6bf5-408c-9ac2-c44b54afba38", + "showTitle": false, + "title": "" + } + }, + "source": [ + "### Conclusions\n", + "\n", + "ICE plots show model behavior on individual observations. Each line represents the prediction from the model if we vary the feature of interest." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b974545e-371b-41b0-b9a4-adb3036d9eb8", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Overall conclusions\n", + "\n", + "\n", + "Interpretation methods are very important responsible AI tools.\n", + "\n", + "Partial dependence plots (PDP) and Individual Conditional Expectation (ICE) plots can be used to visualize and analyze interaction between the target response and a set of input features of interest.\n", + "\n", + "PDP shows the dependence on average, and ICE shows the dependence in a individual sample level.\n", + "\n", + "Using examples above we showed how to calculate and visualize such plots at a scalable manner to understand how a classification or regression model makes predictions, which features heavily impact the model, and how model prediction changes when feature value changes." + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [], + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 2 + }, + "notebookName": "PDP-ICE-tutorial-new", + "notebookOrigID": 2416290700869370, + "widgets": {} + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}