<img src=images/gdd-logo.png align=right width=300px style='padding:20px'>


# Hackathon: Laptop Prices
Welcome to the hackathon! In this hackathon, you'll get the opportunity to try out your chosen explainability technique(s) on a dataset on laptop price prediction. 


### Outline
1. [Problem Introduction](#intro)
1. [About the data](#data)
1. [Creating the model](#model) 
1. [Assignment](#assignment)

<a id = 'intro'></a>

## Problem Introduction

You are about to be moved into a brand new team and everyone will need to buy a new laptop this time next year. Everyone has submitted some specifications they'd like their laptop to be (weight, RAM, memory, GPU, Manufacturer etc.) and you want to be able to estimate the cost of these new laptops.

You have data on a collection laptops along with the prices that they are. Your model should be able to determine the price of the laptop based on the information you have.

Since you want to keep costs down you want to be able to interpret your model so that you know which specifications/details cause the laptop's price to change the most and therefore what to suggest people compromise on the most to reduce costs.

<img src="images/laptop.jpeg" style="display: block;margin-left: auto;margin-right: auto;height: 200px"/>

<a id = 'data'></a>

## About the Data 

The features in the dataset are described below:

|Column|Type|Description|
|---|---|---|
| company| String |Laptop Manufacturer|
| product |String |Brand and Model|
| type_name |String |Type (Notebook, Ultrabook, Gaming, etc.)|
| inches |Numeric|Screen Size|
| screen_resolution |String| Screen Resolution|
| screen_resolution_width |String| Screen Resolution width only|
| screen_resolution_height |String| Screen Resolution height only|
| cpu| String |Central Processing Unit (CPU)|
| ram |String|Laptop RAM in GB|
| memory_disk |String|Hard Disk Memory in GB|
| memory_ssd |String|SSD Memory|
| gpu |String| Graphics Processing Units (GPU)|
| op_sys |String| Operating System|
| weight |String| Laptop Weight in kilograms|
| price |Numeric| Price (Euro)|

In [1]:
import pandas as pd

laptops = pd.read_csv('data/laptops.csv', encoding = "ISO-8859-1")
laptops.head()

Unnamed: 0.1,Unnamed: 0,laptop_id,company,product,type_name,inches,screen_resolution,screen_resolution_width,screen_resolution_height,cpu,ram,memory_disk,memory_ssd,gpu,op_sys,weight,price
0,0,1,Apple,MacBook Pro,Ultrabook,13.3,2560x1600,2560,1600,Intel Core,8,128,SSD,Intel,macOS,1.37,1339.69
1,1,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,1440,900,Intel Core,8,128,Storage,Intel,macOS,1.34,898.94
2,2,3,HP,250 G6,Notebook,15.6,1920x1080,1920,1080,Intel Core,8,256,SSD,Intel,No OS,1.86,575.0
3,3,4,Apple,MacBook Pro,Ultrabook,15.4,2880x1800,2880,1800,Intel Core,16,512,SSD,AMD,macOS,1.83,2537.45
4,4,5,Apple,MacBook Pro,Ultrabook,13.3,2560x1600,2560,1600,Intel Core,8,256,SSD,Intel,macOS,1.37,1803.6


<a id = 'model'></a>

## Creating the model

Split the data into `X` and `y` where `X` is the feature matrix and `y` is the target (`price`)

Exclude `company`, `product` and `screen_resolution` from the feature matrix due to the large amount of unique values.

In [28]:
features = ['laptop_id', 'type_name', 'inches',
       'screen_resolution_width', 'screen_resolution_height', 
        'cpu', 'ram', 'memory_disk', 'memory_ssd', 'gpu', 'op_sys', 'weight']

X = laptops.loc[:, features]
y = laptops.loc[:, 'price']

Check the shape of `X` and `y`. 

In [30]:
X.shape, y.shape

((1302, 12), (1302,))

Perform the train test split on the data to create `X_train`, `X_test`, `y_train`, `y_test`

Use a `random_state` to ensure the split is the same each time it is run.

In [31]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=111)

print('Shape of X_train and y_train', X_train.shape, y_train.shape)
print('Shape of x_test and y_test', X_test.shape, y_test.shape)

It seems as though we have some categorical variables. 

In [24]:
categorical_columns = X.select_dtypes('object').columns
print(categorical_columns)

Index(['type_name', 'cpu', 'memory_ssd', 'gpu', 'op_sys'], dtype='object')


Since there are categorical columns, we will need to encode these. First let's try out the one hot encoding. Since there are categorical columns, we will need to encode these. None are ordinal so we will use `OneHotEncoder`.

Now check to see if there is any missing data.

In [11]:
X.isnull().sum()

laptop_id                   0
type_name                   0
inches                      0
screen_resolution_width     0
screen_resolution_height    0
cpu                         0
ram                         0
memory_disk                 0
memory_ssd                  0
gpu                         0
op_sys                      0
weight                      0
dtype: int64

There are no missing values. 

Preprocessing needed:

- Since we have categorical features that have no ranking, we will need to use `OneHotEncoder()`
- Since we are building a `Linear Regression` we will want to `scale` the data so that the coefficients can be compared.

Now we need to build a column transformer so that we can only encode the categorical columns.

- Import the `ColumnTransformer` from `sklearn.compose`
- Instantiate the `ColumnTransformer()` with the `OneHotEncoder` on the categorical columns
- Use the parameter `remainder='passthrough'` for the rest
- use `.fit_transform()` with the column transformer on the `X_train` data and save this as `X_train_encoded`

In [14]:
from sklearn.compose import ColumnTransformer

column_transformer = ColumnTransformer(
    [
        ('onehot', OneHotEncoder(drop='first', sparse=False), categorical_columns)
    ], remainder='passthrough'
)

X_train_encoded = column_transformer.fit_transform(X_train)
X_train_encoded.shape

(976, 24)

Now let's try out a scaler. **You can only do this on the encoded data since you cannot scale categorical features!!**

Choose from the below and import it in from `sklearn.preprocessing`

- `StandardScaler`
- `RobustScaler`
- `MinMaxScaler`

Instantiate your scaler (eg. `scaler = RobustScaler()`) and try it out by performing:

```python
pd.DataFrame(scaler.fit_transform(X_train_encoded), columns=column_transformer.get_feature_names())
```

In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

pd.DataFrame(scaler.fit_transform(X_train_encoded), columns = column_transformer.get_feature_names())

Unnamed: 0,onehot__x0_Gaming,onehot__x0_Netbook,onehot__x0_Notebook,onehot__x0_Ultrabook,onehot__x0_Workstation,onehot__x1_Intel Core,onehot__x1_Intel Other,onehot__x2_Hybrid,onehot__x2_SSD,onehot__x2_Storage,...,onehot__x4_Windows 10,onehot__x4_Windows 7,onehot__x4_macOS,laptop_id,inches,screen_resolution_width,screen_resolution_height,ram,memory_disk,weight
0,-0.449411,-0.133142,-1.124500,-0.410929,6.585107,0.424464,-0.341621,-0.090909,1.037591,-0.244372,...,-2.177424,5.818689,-0.133142,1.352103,0.385114,0.050518,0.029779,-0.098553,-0.539908,0.790295
1,-0.449411,-0.133142,0.889284,-0.410929,-0.151858,0.424464,-0.341621,-0.090909,-0.963771,-0.244372,...,-2.177424,-0.171860,-0.133142,1.431424,0.385114,-1.066847,-1.062757,-0.844196,0.125709,0.056728
2,-0.449411,-0.133142,0.889284,-0.410929,-0.151858,0.424464,-0.341621,-0.090909,1.037591,-0.244372,...,0.459258,-0.171860,-0.133142,-1.566898,0.385114,0.050518,0.029779,-0.844196,-0.889084,-0.227716
3,-0.449411,-0.133142,0.889284,-0.410929,-0.151858,0.424464,-0.341621,-0.090909,-0.963771,-0.244372,...,0.459258,-0.171860,-0.133142,0.445204,0.385114,0.050518,0.029779,-0.844196,1.489678,0.505850
4,-0.449411,-0.133142,0.889284,-0.410929,-0.151858,0.424464,-0.341621,-0.090909,1.037591,-0.244372,...,0.459258,-0.171860,-0.133142,0.968720,0.385114,-1.066847,-1.062757,-0.844196,-0.889084,0.176494
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
971,-0.449411,-0.133142,0.889284,-0.410929,-0.151858,0.424464,-0.341621,-0.090909,-0.963771,-0.244372,...,0.459258,-0.171860,-0.133142,-1.397680,1.580140,0.050518,0.029779,-0.844196,1.489678,-0.092980
972,-0.449411,-0.133142,0.889284,-0.410929,-0.151858,-2.355915,2.927222,-0.090909,1.037591,-0.244372,...,0.459258,-0.171860,-0.133142,0.104125,0.385114,-1.066847,-1.062757,-0.844196,-0.889084,-0.302570
973,-0.449411,-0.133142,-1.124500,2.433513,-0.151858,0.424464,-0.341621,-0.090909,1.037591,-0.244372,...,0.459258,-0.171860,-0.133142,-1.484933,0.385114,0.050518,0.029779,-0.098553,-0.539908,-0.347483
974,2.225134,-0.133142,-1.124500,-0.410929,-0.151858,0.424464,-0.341621,-0.090909,-0.963771,-0.244372,...,0.459258,-0.171860,-0.133142,0.217818,1.580140,0.050518,0.029779,-0.098553,-0.889084,0.954973


Now that we have a scaler chosen, we're ready to build a pipeline.

- Import `Pipeline` from `sklrean.pipeline` and `LinearRegression` from `sklearn.linear_model`.
- Instantiate the model with no parameters
- Instantiate the pipeline with the scaler and model as the 2 steps.

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

model = LinearRegression()

pipeline = Pipeline(steps = [
    ('onehot', column_transformer),
    ('scaler', StandardScaler()),
    ('model', model)
])
pipeline.fit(X_train, y_train)

Fit the pipeline to `X_train` and `y_train`

<a id = 'assignment'></a>

# <mark>Assignment</mark>

### Theory Questions

1. Read the problem description. Which type of explainability method do you imagine would be most suitable for this problem: 
    - Local (explains one single prediction) or global (explains model behaviour)? 
    - Feature importance or feature sensitivity? 


2. Are there any inherently interpretable models that spring to mind that can help you address the need for explainability for this problem? The model implemented is Linear Regression. Was that a good choice?


3. What model-agnostic techniques would be appropriate to address the need for explainability for this problem?


#### Bonus 
4. Some explainability methods are less useful when features are highly correlated. Is that applicable to this dataset, and if so, what can you do to discover what features impact the laptop pricing the most?

### Do-it-yourself
The explainability techniques covered in the workshop were: 
* Ceteris Paribus (local sensitivity)
* Prediction Break-Down (local feature importance)
* Permutation Feature Importance (global feature importance)
* Partial Dependence Plots (global sensitivity)

1. Implement the technique that you deem most appropriate for this problem. Consider both the problem statement, as well as the advantages and disadvantages of each method. Refer back to the [slides](https://github.com/marysia/explainability-workshop/blob/master/presentation.pdf) if necessary. 

2. Create your own datapoint with a combination of laptop specifications. Use the pipeline to predict the price. 

3. Now imagine you want to cut the cost by \$100. What change would you need to make to the laptop specifications to get that result?  

4. Can you find out what laptop specifications, in general, contribute most to high price predictions? 




#### Bonus challenges: 
* Also try out the other explainability techniques and see if you can discover something interesting.
* Try out other models as well, and compare these 
* Extract the feature importance from the Linear Regression model using the coefficients. Does this match with the result of permutation feature importance? 

<img src='images/gdd-logo.png' align=right width=300px>