## **Setting Things Up**

**1 If you haven't already, please hit :**

`File > Save a Copy in Drive`

**to copy this notebook to your Google drive, and work on a copy. If you don't do this, your changes won't be saved!**


**2 In order to use a GPU with your notebook, select :**

`Runtime > Change runtime type`

**menu, and then set the hardware accelerator dropdown to GPU. This can significantly speed up the training process.**

**3 In order to have enough memory with your notebook, select :**

`Runtime > Change runtime type`

**menu, and then select High-RAM in the Runtime shape dropdown.**

To facilitate your initial progress, we have included a ready-to-use code on Google Colab for this problem. It allows you to get started immediately without the need to install any software libraries. Additionally, if you prefer not to use Google Colab and prefer setting up your own programming environment or employing alternative methods, the provided files and code will still be valuable.

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun Jun 25 12:54:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


In [4]:
!pip install shap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
# -*-coding:utf8 -*-
import tensorflow as tf
# print("TensorFlow version:", tf.__version__)

import os

import pandas as pd
import numpy as np

from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
from PIL import Image

import shap

from tensorflow.python import keras
from keras.layers import Dense, Flatten, Conv2D
from keras.optimizers import RMSprop, Adam, SGD
from keras.callbacks import LearningRateScheduler
from keras.applications.vgg16 import VGG16
from keras import Model, Input, layers

from keras import Model, Input, layers, regularizers
from keras.models import load_model
from keras import activations

import csv

In [6]:
from google.colab import drive
drive.mount('/content/drive')

# change to your personal project address

%cd /content/drive/MyDrive/Colab Notebooks//Vehicle Rating Prediction

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/Vehicle Rating Prediction


## **1 Data Processing**

In [7]:
var = "total score"
# var = "safety score"
# var = "performance score"
# var = "interior score"
# var = "critics score"

In [8]:
# read info_data
file_name = "parametric data 2571 normalize " + var + ".csv"
info_data = pd.read_csv(file_name)
# get numpy matrix which only contains data (do not contain the title)
info_data = np.array(info_data)
print(info_data.shape)  # (2571, 310)
print(len(info_data))
print(info_data.shape[1])

(2571, 310)
2571
310


In [None]:
# Input Parametric Data (2571x303)
# column 0: origin index
# column 1: model name
# column 2-303: parametric feature
# column 304: total score
# column 305: critics score
# column 306: performance score
# column 307: interior score
# column 308: safety score
# column 309: data split index => 1: train data; 2: validation data; 3:test data

**To get the interpretability results of the model, we only need the input feature("X"). Firstly, we need to obtain the name of each feature.**

In [9]:
with open(file_name,'r') as fr:
    reader = csv.DictReader(fr)
    headers = reader.fieldnames
    print(f"headers={headers}")
    del headers[0:2]
    del headers[302:308]
print(len(headers))
print(headers)

302


In [10]:
# train data shuffle index
num1 = 2055
idx1 = tf.range(num1)
idx1 = tf.random.shuffle(idx1)
# print(idx1)
# print(idx1[0])
with tf.compat.v1.Session():
    index1 = idx1.numpy()
# print(index1.shape)
# print(index1[0])

# validation data shuffle index
num2 = 258
idx2 = tf.range(num2)
idx2 = tf.random.shuffle(idx2)
# print(idx2)
# print(idx2[0])
with tf.compat.v1.Session():
    index2 = idx2.numpy()
# print(index2.shape)
# print(index2[0])

# test data shuffle index
num3 = 258
idx3 = tf.range(num3)
idx3 = tf.random.shuffle(idx3)
# print(idx3)
# print(idx3[0])
with tf.compat.v1.Session():
    index3 = idx3.numpy()
# print(index3.shape)
# print(index3[0])

In [11]:
# Please assign the parametric data, considering that we only require the input feature to obtain the model's interpretability results.
x_train_tab = np.zeros((num1, 302))
for i in range(num1):
    x_train_tab[i, :] = np.array(info_data[index1[i], 2:304], dtype=float)
x_train_tab = tf.convert_to_tensor(x_train_tab)
# print(x_train_tab)


x_validation_tab = np.zeros((num2, 302))
for i in range(num2):
    x_validation_tab[i, :] = np.array(info_data[index2[i] + num1, 2:304], dtype=float)
x_validation_tab = tf.convert_to_tensor(x_validation_tab)
# print(x_validation_tab)

x_test_tab = np.zeros((num3, 302))
for i in range(num3):
    x_test_tab[i, :] = np.array(info_data[index3[i] + num1 + num2, 2:304], dtype=float)
x_test_tab = tf.convert_to_tensor(x_test_tab)
# print(x_test_tab)

x_all = np.zeros((2571, 302))
for i in range(2571):
  x_all[i, :] = np.array(info_data[i, 2:304], dtype=float)

In [12]:
num4 = num2 + num3
print(num4)
x_sample_tab = np.zeros((num4, 302))
for i in range(num4):
    x_sample_tab[i, :] = np.array(info_data[i+num1, 2:304], dtype=float)
x_sample_tab = tf.convert_to_tensor(x_sample_tab)
# print(x_sample_tab)

516


## **2 SHAP analysis**

**First of all, we need to load the parametric model.**

In [13]:
# MLP
MLPmodel = tf.keras.models.load_model('model weight/' + var + '_Par.h5')
for layer in MLPmodel.layers:
  layer._name = layer._name + "_b"
MLPmodel.summary()

Model: "Parametric_Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 tab_dense1_b (Dense)        (None, 302)               91506     
                                                                 
 tab_dropout1_b (Dropout)    (None, 302)               0         
                                                                 
 tab_dense2_b (Dense)        (None, 100)               30300     
                                                                 
 tab_dense3_b (Dense)        (None, 1)                 101       
                                                                 
Total params: 121,907
Trainable params: 121,907
Non-trainable params: 0
_________________________________________________________________


**DeepExplainer, an implementation of Deep SHAP, was developed based on [SHAP](https://arxiv.org/abs/1705.07874) and [DeepLIFT](https://arxiv.org/abs/1704.02685).**

In [14]:
explainer = shap.DeepExplainer(MLPmodel, np.array(x_train_tab))

keras is no longer supported, please use tf.keras instead.
Your TensorFlow version is newer than 2.4.0 and so graph support has been removed in eager mode and some static graphs may not be supported. See PR #1483 for discussion.


## **Obtain the average SHAP values for different vehicle brands**

In [15]:
average_value = np.zeros((55, len(headers)))

In [16]:
count_num = np.zeros((55,))

In [17]:
avg = np.zeros((1, len(headers)))

In [18]:
for i in range(num4):
  current_shap = explainer.shap_values(np.array(x_sample_tab[i:(i+1),:]))[0][0]
  for j in range(55):
    if x_sample_tab[i, j+17] == 1:
      average_value[j, :] = average_value[j, :] + current_shap
      count_num[j] = count_num[j] + 1

for i in range(55):
  average_value[i, :] = average_value[i, :]/count_num[i]

`tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
invalid value encountered in true_divide


In [20]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_brands_features.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(average_value)

f.close()

## **Obtain the average SHAP values as the years progress(2007-2023)**

In [None]:
average_value = np.zeros((17, len(headers)))

In [None]:
count_num = np.zeros((17,))

In [None]:
avg = np.zeros((1, len(headers)))

In [None]:
for i in range(num4):
  current_shap = explainer.shap_values(np.array(x_sample_tab[i:(i+1),:]))[0][0]
  for j in range(17):
    if x_sample_tab[i, j] == 1:
      average_value[j, :] = average_value[j, :] + current_shap
      count_num[j] = count_num[j] + 1

for i in range(17):
  average_value[i, :] = average_value[i, :]/count_num[i]


invalid value encountered in true_divide


In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_years_features.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(average_value)

f.close()

## **Obtain the absolute SHAP value and the SHAP value**

**5**

**Absolute SHAP values and SHAP values of the five feature categories of the parametric data**

**General Information, Exterior Information, Interior Information, Mechanical Information, Safety Information**

In [None]:
headers_new1 = ["General Information", "Exterior Information", "Interior Information", "Mechanical Information", "Safety Information"]

**Absolute SHAP values**

In [None]:
general_average_value = np.zeros((num4, 5))

In [None]:
for i in range(num4):
  shap_values = np.abs(explainer.shap_values(np.array(x_sample_tab[i:i+1]))[0][0])
  count = np.zeros((len(headers_new1),))
  for j in range(302):
    if 0 <= j <= 76 or 299 <= j <= 301:
      count[0] = count[0] + float(shap_values[j])
    elif 77 <= j <= 110:
      count[1] = count[1] + float(shap_values[j])
    elif 111 <= j <= 216:
      count[2] = count[2] + float(shap_values[j])
    elif 217 <= j <= 269:
      count[3] = count[3] + float(shap_values[j])
    elif 270 <= j <= 298:
      count[4] = count[4] + float(shap_values[j])
  for k in range(len(headers_new1)):
    general_average_value[i][k] = count[k]
  # print("Good")
avg1 = np.mean(general_average_value, axis=0, keepdims=True)

`tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.


In [None]:
print(avg1)

[[0.06578347 0.01706766 0.16072238 0.04583692 0.0343218 ]]


In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_abs_5 all.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers_new1)
    f_csv.writerows(general_average_value)

f.close()

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_abs_5 avg.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers_new1)
    f_csv.writerows(avg1)

f.close()

**SHAP values**

In [None]:
general_average_value = np.zeros((num4, 5))

In [None]:
for i in range(num4):
  shap_values = explainer.shap_values(np.array(x_sample_tab[i:i+1]))[0][0]
  count = np.zeros((len(headers_new1),))
  for j in range(302):
    if 0 <= j <= 76 or 299 <= j <= 301:
      count[0] = count[0] + float(shap_values[j])
    elif 77 <= j <= 110:
      count[1] = count[1] + float(shap_values[j])
    elif 111 <= j <= 216:
      count[2] = count[2] + float(shap_values[j])
    elif 217 <= j <= 269:
      count[3] = count[3] + float(shap_values[j])
    elif 270 <= j <= 298:
      count[4] = count[4] + float(shap_values[j])
  for k in range(len(headers_new1)):
    general_average_value[i][k] = count[k]
  # print("Good")
avg1 = np.mean(general_average_value, axis=0, keepdims=True)

In [None]:
print(avg1)

[[ 3.53249017e-04  1.97200447e-05  2.19449101e-03 -6.32397999e-04
   2.39171131e-04]]


In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_5 avg.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers_new1)
    f_csv.writerows(avg1)

f.close()

**21**

**Absolute SHAP values and SHAP values of 21 feature subcategories on the model prediction**

In [None]:
headers_new = ['years', 'brand', 'Drivetrain',
        'Exterior Body Style', 'Exterior Dimensions', 'Exterior Measurements',
        'Interior Convenience & Comfort', 'Interior Dimensions', 'Interior Entertainment', 'Interior Heating Cooling', 'Interior Navigation & Communication', 'Interior Seats',
        'Mechanical Transmission', 'Mechanical Fuel', 'Engine & Performance',
        'Safety Airbags', 'Safety Brakes', 'Safety Features',
        'MSRP', 'mpg_city', 'mpg_hwy']

**Absolute SHAP values**

In [None]:
average_value = np.zeros((num4, len(headers_new)))

In [None]:
for i in range(num4):
  shap_values = np.abs(explainer.shap_values(np.array(x_sample_tab[i:i+1]))[0][0])
  count2 = 0.0
  count3 = 0.0
  count4 = 0.0
  count5 = 0.0
  count6 = 0.0
  count7 = 0.0
  count8 = 0.0
  count9 = 0.0
  count10 = 0.0
  count11 = 0.0
  count12 = 0.0
  count13 = 0.0
  count14 = 0.0
  count15 = 0.0
  count16 = 0.0
  count17 = 0.0
  count18 = 0.0
  count19 = 0.0
  count20 = 0.0
  count21 = 0.0
  count22 = 0.0
  for j in range(302):
    if 0 <= j <= 16:
      count2 = count2 + float(shap_values[j])
    elif 17 <= j <= 71:
      count3 = count3 + float(shap_values[j])
    elif 72 <= j <= 76:
      count4 = count4 + float(shap_values[j])
    elif 77 <= j <= 105:
      count5 = count5 + float(shap_values[j])
    elif 106 <= j <= 109:
      count6 = count6 + float(shap_values[j])
    elif j == 110:
      count7 = count7 + float(shap_values[j])
    elif 111 <= j <= 162:
      count8 = count8 + float(shap_values[j])
    elif 163 <= j <= 169:
      count9 = count9 + float(shap_values[j])
    elif 170 <= j <= 186:
      count10 = count10 + float(shap_values[j])
    elif 187 <= j <= 190:
      count11 = count11 + float(shap_values[j])
    elif 191 <= j <= 194:
      count12 = count12 + float(shap_values[j])
    elif 195 <= j <= 216:
      count13 = count13 + float(shap_values[j])
    elif 217 <= j <= 230:
      count14 = count14 + float(shap_values[j])
    elif 231 <= j <= 239:
      count15 = count15 + float(shap_values[j])
    elif 240 <= j <= 269:
      count16 = count16 + float(shap_values[j])
    elif 270 <= j <= 276:
      count17 = count17 + float(shap_values[j])
    elif 277 <= j <= 280:
      count18 = count18 + float(shap_values[j])
    elif 281 <= j <= 298:
      count19 = count19 + float(shap_values[j])
    elif j == 299:
      count20 = count20 + float(shap_values[j])
    elif j == 300:
      count21 = count21 + float(shap_values[j])
    elif j == 301:
      count22 = count22 + float(shap_values[j])
  average_value[i][0] = count2
  average_value[i][1] = count3
  average_value[i][2] = count4
  average_value[i][3] = count5
  average_value[i][4] = count6
  average_value[i][5] = count7
  average_value[i][6] = count8
  average_value[i][7] = count9
  average_value[i][8] = count10
  average_value[i][9] = count11
  average_value[i][10] = count12
  average_value[i][11] = count13
  average_value[i][12] = count14
  average_value[i][13] = count15
  average_value[i][14] = count16
  average_value[i][15] = count17
  average_value[i][16] = count18
  average_value[i][17] = count19
  average_value[i][18] = count20
  average_value[i][19] = count21
  average_value[i][20] = count22

In [None]:
avg = np.mean(average_value, axis=0, keepdims=True)

In [None]:
print(avg)

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_abs_21 avg.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers_new)
    f_csv.writerows(avg)

f.close()

**SHAP values**

In [None]:
average_value = np.zeros((num4, len(headers_new)))

In [None]:
for i in range(num4):
  shap_values = explainer.shap_values(np.array(x_sample_tab[i:i+1]))[0][0]
  count2 = 0.0
  count3 = 0.0
  count4 = 0.0
  count5 = 0.0
  count6 = 0
  count7 = 0
  count8 = 0
  count9 = 0
  count10 = 0
  count11 = 0
  count12 = 0
  count13 = 0
  count14 = 0
  count15 = 0
  count16 = 0
  count17 = 0
  count18 = 0
  count19 = 0
  count20 = 0
  count21 = 0
  count22 = 0
  for j in range(302):
    if 0 <= j <= 16:
      count2 = count2 + float(shap_values[j])
    elif 17 <= j <= 71:
      count3 = count3 + float(shap_values[j])
    elif 72 <= j <= 76:
      count4 = count4 + float(shap_values[j])
    elif 77 <= j <= 105:
      count5 = count5 + float(shap_values[j])
    elif 106 <= j <= 109:
      count6 = count6 + float(shap_values[j])
    elif j == 110:
      count7 = count7 + float(shap_values[j])
    elif 111 <= j <= 162:
      count8 = count8 + float(shap_values[j])
    elif 163 <= j <= 169:
      count9 = count9 + float(shap_values[j])
    elif 170 <= j <= 186:
      count10 = count10 + float(shap_values[j])
    elif 187 <= j <= 190:
      count11 = count11 + float(shap_values[j])
    elif 191 <= j <= 194:
      count12 = count12 + float(shap_values[j])
    elif 195 <= j <= 216:
      count13 = count13 + float(shap_values[j])
    elif 217 <= j <= 230:
      count14 = count14 + float(shap_values[j])
    elif 231 <= j <= 239:
      count15 = count15 + float(shap_values[j])
    elif 240 <= j <= 269:
      count16 = count16 + float(shap_values[j])
    elif 270 <= j <= 276:
      count17 = count17 + float(shap_values[j])
    elif 277 <= j <= 280:
      count18 = count18 + float(shap_values[j])
    elif 281 <= j <= 298:
      count19 = count19 + float(shap_values[j])
    elif j == 299:
      count20 = count20 + float(shap_values[j])
    elif j == 300:
      count21 = count21 + float(shap_values[j])
    elif j == 301:
      count22 = count22 + float(shap_values[j])
  average_value[i][0] = count2
  average_value[i][1] = count3
  average_value[i][2] = count4
  average_value[i][3] = count5
  average_value[i][4] = count6
  average_value[i][5] = count7
  average_value[i][6] = count8
  average_value[i][7] = count9
  average_value[i][8] = count10
  average_value[i][9] = count11
  average_value[i][10] = count12
  average_value[i][11] = count13
  average_value[i][12] = count14
  average_value[i][13] = count15
  average_value[i][14] = count16
  average_value[i][15] = count17
  average_value[i][16] = count18
  average_value[i][17] = count19
  average_value[i][18] = count20
  average_value[i][19] = count21
  average_value[i][20] = count22

In [None]:
avg = np.mean(average_value, axis=0, keepdims=True)

In [None]:
print(avg)

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_21 avg.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers_new)
    f_csv.writerows(avg)

f.close()

**302**

**Absolute SHAP values and SHAP values of 302 every single feature on the model prediction**

**Absolute SHAP values**

In [None]:
average_value = np.zeros((num4, len(headers)))

In [None]:
for i in range(num4):
  # get shap value
  average_value[i,:] = np.abs(explainer.shap_values(np.array(x_sample_tab[i:(i+1),:]))[0][0])

In [None]:
avg = np.mean(average_value, axis=0, keepdims=True)

In [None]:
print(avg)

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_abs_all_302.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(average_value)

f.close()

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_abs_avg_302.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(avg)

f.close()

**SHAP values**

In [None]:
average_value = np.zeros((num4, len(headers)))

In [None]:
for i in range(num4):
  # get shap value
  average_value[i,:] = explainer.shap_values(np.array(x_sample_tab[i:(i+1),:]))[0][0]

In [None]:
avg = np.mean(average_value, axis=0, keepdims=True)

In [None]:
print(avg)

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_all_302.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(average_value)

f.close()

In [None]:
with open('SHAP/SHAP_Parametric/'+ var + '_SHAP_avg_302.csv', 'w', newline='') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(avg)

f.close()