# Interpret Explainability for Mobile Prices

In the following notebook a price range for mobile phones is analyzed by observeing how  individual components of this are taken into account and relate to the complete mobile price. For considering not the exact value but a range this becomes a classification instead of a regression task. To know which attribute contribut the most a lime explainability modell is used

### import common libs

Import of the basic libraries pandas and numpy which are widely used and explicitly setting the labels for each feature which should be displayed in the explainer modell

In [2]:
import pandas as pd
import numpy as np

In [None]:
# set labels
feature_names = ["battery_power","blue","clock_speed","dual_sim","fc","four_g","int_memory","m_dep","mobile_wt","n_cores","pc","px_height","px_width","ram","sc_h","sc_w","talk_time","three_g","touch_screen","wifi","price_range"]
class_names = ["very cheap", # label 0
               "cheap", # label 1 
               "expensive", # label 2
               "very expensive"] # label 3

### load training data
The dataset for the mobile prices is available at https://www.kaggle.com/iabhishekofficial/mobile-price-classification . It is a simple csv-file which can be parsed by pandas. Target for the training is the price range and every other data should be used to determine this range. It is not neccessary to clean the data for the source already provides a good data quality 

In [3]:
# load from the provided csv-file ( source: https://www.kaggle.com/iabhishekofficial/mobile-price-classification )
trainingsdata = pd.read_csv("../data/mobile-price-classification/train.csv")

In [4]:
# trainings label are provided by column price_range
labels = trainingsdata["price_range"]

In [5]:
# drop coulmn of the trainings label to get only the trainings data 
trainingsdata = trainingsdata.drop("price_range", axis=1)
#trainingsdata["ram"] = trainingsdata["ram"].apply
trainingsdata

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,842,0,2.2,0,1,0,7,0.6,188,2,2,20,756,2549,9,7,19,0,0,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,6,905,1988,2631,17,3,7,1,1,0
2,563,1,0.5,1,2,1,41,0.9,145,5,6,1263,1716,2603,11,2,9,1,1,0
3,615,1,2.5,0,0,0,10,0.8,131,6,9,1216,1786,2769,16,8,11,1,0,0
4,1821,1,1.2,0,13,1,44,0.6,141,2,14,1208,1212,1411,8,2,15,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,794,1,0.5,1,0,1,2,0.8,106,6,14,1222,1890,668,13,4,19,1,1,0
1996,1965,1,2.6,1,0,0,39,0.2,187,4,3,915,1965,2032,11,10,16,1,1,1
1997,1911,0,0.9,1,1,1,36,0.7,108,8,3,868,1632,3057,9,1,5,1,1,0
1998,1512,0,0.9,0,4,1,46,0.1,145,5,5,336,670,869,18,10,19,1,1,1


### load test data

In [6]:
# load additional test data, here no labels for a check are provided
testdata = pd.read_csv("../data/mobile-price-classification/test.csv")
testdata.drop("id", axis=1)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,1043,1,1.8,1,14,0,5,0.1,193,3,16,226,1412,3476,12,7,2,0,1,0
1,841,1,0.5,1,4,1,61,0.8,191,5,12,746,857,3895,6,0,7,1,0,0
2,1807,1,2.8,0,1,0,27,0.9,186,3,4,1270,1366,2396,17,10,10,0,1,1
3,1546,0,0.5,1,18,1,25,0.5,96,8,20,295,1752,3893,10,0,7,1,1,0
4,1434,0,1.4,0,11,1,49,0.5,108,6,18,749,810,1773,15,8,7,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1700,1,1.9,0,0,1,54,0.5,170,7,17,644,913,2121,14,8,15,1,1,0
996,609,0,1.8,1,0,0,13,0.9,186,4,2,1152,1632,1933,8,1,19,0,1,1
997,1185,0,1.4,0,1,1,8,0.5,80,1,12,477,825,1223,5,0,14,1,0,0
998,1533,1,0.5,1,0,0,50,0.4,171,2,12,38,832,2509,15,11,6,0,1,0


### Train the classifier, fit the data
Before creating the actual classifier a linear modell is used to estimate the complexity of the dataset

In [14]:
#import the base package and the pipeline model
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

# decomposition for preprocess the data
from sklearn.decomposition import PCA

# simple regression models 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# classifier models 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [12]:
# get the range for the labels
print("min:", min(labels),"max:", max(labels))

min: 0 max: 3


### splitting the dataset
allthough there is a test data set there are no labels given for this dataset so it can not be taken for the validation. therefore the trainingsdataset will be split in 67% trainingsdata and 33% for the model validation. to handle the dataset more easily the indices will be replaced by a sequence

In [9]:
# prepare the data for training by splitting up the labeled data into two sets. the first to train the model (67 % of all)
# and 33% to validate the efficiency of the model

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(trainingsdata, labels, test_size=0.33, random_state=101)

# to handle the data more easily the index will be reassigned
X_train.index = [i for i in range(len(X_train))]

#### Test linear regression as classifier

In [None]:
#first analyse the data with a linear regression
lm = LinearRegression()
lm.fit(X_train,y_train)
print("linear regression score for the trainingdata:", lm.score(X_train,y_train))


In [None]:
print("linear regression score overall:", lm.score(trainingsdata,labels))

In [None]:
print("linear regression score for the testdata:", lm.score(X_test,y_test))

The coefficient of confidence of 0.9132801488185275 indicates that the linear regression model does fit the data properly. A better model is not needed but will be considered in the further ado

## Train a blackbox classification model

As a classfication model the Random Forest Classifier with a preprocessing by the Principal Component Analysis will be used.

In [15]:
# analyse the data with an appropriate model for instance a random forest classifier
model = Pipeline
pca = PCA() # preproccess the data
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model = Pipeline([('pca', pca), ('rf', rf)])
model.fit(X_train, y_train)

In [None]:
# get an prediction of the model for all data
pred = model.predict(trainingsdata)
print("Overall score for modell: ", sklearn.metrics.f1_score(labels, pred, average='weighted'))

In [None]:
# get an prediction of the model for all data
pred = model.predict(X_train)
print("Score for trainingsdata: ", sklearn.metrics.f1_score(y_train, pred, average='weighted'))

In [None]:
# get an prediction of the model for all data
pred = model.predict(X_test)
print("Score for testdata: ", sklearn.metrics.f1_score(y_test, pred, average='weighted'))

In [None]:
# prediction influence of components
np.set_printoptions(threshold=30)
print(model.predict_proba(X_train).round(3))

## Lime Blackbox explainer

The Interpret Lime Tabular instance expects a pandas dataset so that all labels for each feaute are set correctly.
Normally it expects an instance where the explainer gives the value in the second component of the prediction function which actually labels the class.
In contrast to that the sklearn moduls uses a array which sets for every class the probability so that the item is within this class. This issue can be overcome by creating a (lamda) function which sorts the prediction by occurence of the probability and then set the label for this class. This is not done here to address this difference to aix360

In [2]:
# import the LimeTabular Explainer to explain which properties leads to the classfication
from interpret.blackbox import LimeTabular
from interpret import show

# provide the prediction funciotn (model.predict_proba) and the trainingsdata to the explainer
lime = LimeTabular(predict_fn=model.predict_proba, data=X_train, random_state=1)


#create the explanation in a graph for the first 30 instances of the test dataset 
lime_local = lime.explain_local(X_test[:5], y_test[:5], name='LIME')

show(lime_local, key=20)


NameError: name 'model' is not defined

## Overall analyse with MorrisSensitivity

In [None]:
Interpret offers a wide range of overall explainability model which are applied in the further ado.
It can be seen that the mobile price dataset was generated and do not depend on real mobile phone prices.


In [3]:
from interpret.blackbox import MorrisSensitivity

sensitivity = MorrisSensitivity(predict_fn=model.predict_proba, data=X_train)
sensitivity_global = sensitivity.explain_global(name="Global Sensitivity")

show(sensitivity_global)



NameError: name 'model' is not defined

## Overall analyse with PartialDependence

In [None]:
from interpret.blackbox import PartialDependence

pdp = PartialDependence(predict_fn=model.predict_proba, data=X_train)
pdp_global = pdp.explain_global(name='Partial Dependence')

show(pdp_global)

Summary

In [None]:
show([lime_local, sensitivity_global, pdp_global])

## Conclusion

As expected for this simple classification model the classification for the mobile prices can be done by respecting only few properties as for example the RAM.
For a better classficiation the model should be adapted.

## Alternative Approch Glassbox model with the Explainable Boosting Machine (EBM)

In [None]:
## Glass
from interpret.glassbox import ExplainableBoostingClassifier, LogisticRegression, ClassificationTree, DecisionListClassifier

ebm = ExplainableBoostingClassifier(random_state=1)
ebm.fit(X_train, y_train)   #Works on dataframes and numpy arrays

## Overall analysis of all data

In [None]:
ebm_global = ebm.explain_global(name='EBM')
show(ebm_global)

In [None]:
ebm_local = ebm.explain_local(X_test[:5], y_test[:5], name='EBM')
show(ebm_local)