### Zaawansowane Metody Uczenia Maszynowego

#### Laboratorium 8

Pakiet *dalex*

GitHub
https://github.com/ModelOriented/DALEX

Dokumentacja
https://dalex.drwhy.ai/python/api/

Przykład użycia
https://dalex.drwhy.ai/python-dalex-titanic.html

In [1]:
import dalex as dx

  "class": algorithms.Blowfish,


### Zadanie 1
---
Na podstawie danych o wycenie nieruchomości przygotuj:

- przynajmniej 3 różne modele (modele, które mają różną konstrukcję - na przykład regresja liniowa, drzewo decyzyjne oraz komietet modeli)

- wykorzystując pakiet `dalex` oraz metodę permutacyjnych ważności zmiennych oraz krzywe PDP zbadaj ważność zmiennych, oceń które z nich są kluczowe przy wycenie nieruchomości

- wykorzystując wiedzę z wsześniejszego punktu przygotuj model z ograniczoną liczbą zmiennych, czy mimo znaczącego ograniczenia zmiennych "mniejszy" model osiąga nadal dobrą jakość predykcyjną? 

- przetestuj metody lokalne na wybranych nieruchmościach (najlepiej wybrać takie, gdzie nasz model się myli najbardziej i takie, których bład predykcji jest mały)

Dane: https://github.com/mini-pw/2023Z-DataVisualizationTechniques/tree/main/homeworks/hw1

### Przygotowanie danych 
----

In [2]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mini-pw/2023Z-DataVisualizationTechniques/main/homeworks/hw1/house_data.csv")

In [3]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [4]:
drop_columns = ['id', 'date']
categorical_columns = ['waterfront']
numerical_columns = ['bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'zipcode']

In [5]:
df = df.drop(drop_columns, axis=1)

In [6]:
y = df.price
X = df.drop(["price"], axis = 1)

### Przygotowanie modeli
----

In [7]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.model_selection import train_test_split

## transformacja zmiennych numerycznych
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy = 'mean'))
])

## transformacja zmiennych kategorycznych
categorical_transformer = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

from sklearn.compose import ColumnTransformer


preprocessor = ColumnTransformer([
    ('numerical', numerical_transformer, numerical_columns),
    ('categorical', categorical_transformer, categorical_columns)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

X_train = pd.DataFrame(X_train, columns=preprocessor.get_feature_names_out())
X_test = pd.DataFrame(X_test, columns=preprocessor.get_feature_names_out())



In [8]:
# model regresji liniowej
lm = LinearRegression()
lm.fit(X_train, y_train)

# model drzewam
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

# model extra-trees
xt = ExtraTreeRegressor()
xt.fit(X_train, y_train)

### Przygotowanie explainera
----

In [9]:
# obiekt explainer

explainer_lm = dx.Explainer(lm, X_train, y_train, label = "Linear Regression")
explainer_dt = dx.Explainer(dt, X_train, y_train, label = "Decision Tree")
explainer_xt = dx.Explainer(xt, X_train, y_train, label = "Extra trees")

Preparation of a new explainer is initiated

  -> data              : 17290 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 17290 values
  -> model_class       : sklearn.linear_model._base.LinearRegression (default)
  -> label             : Linear Regression
  -> predict function  : <function yhat_default at 0x00000161BE7D51C0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = -5.21e+05, mean = 5.4e+05, max = 3.38e+06
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.33e+06, mean = 1.03e-07, max = 4.32e+06
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 17290 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Co

### Jakość modeli
----

In [10]:
from sklearn.metrics import mean_squared_error

print("MSE Linearn Regression: ", np.round(mean_squared_error(y_train, lm.predict(X_train)), 4))
print("MSE Decision Tree: ", np.round(mean_squared_error(y_train, dt.predict(X_train)), 4))
print("MSE Extra Trees: ", np.round(mean_squared_error(y_train, xt.predict(X_train)), 4))

MSE Linearn Regression:  41179774932.6451
MSE Decision Tree:  88423473.8991
MSE Extra Trees:  88423473.8991


### Permutacyjna ważność zmiennych
----

In [11]:
fi_lm = explainer_lm.model_parts()
fi_dt = explainer_dt.model_parts()
fi_xt = explainer_xt.model_parts()


In [12]:
fi_lm

Unnamed: 0,variable,dropout_loss,label
0,_full_model_,198734.481226,Linear Regression
1,numerical__sqft_lot,198735.545419,Linear Regression
2,numerical__floors,198830.569494,Linear Regression
3,numerical__yr_renovated,199000.234735,Linear Regression
4,numerical__sqft_living15,199105.762002,Linear Regression
5,numerical__sqft_lot15,199559.883555,Linear Regression
6,numerical__sqft_basement,199640.681955,Linear Regression
7,categorical__waterfront_1,200085.825001,Linear Regression
8,categorical__waterfront_0,200191.634359,Linear Regression
9,numerical__condition,200389.571343,Linear Regression


In [13]:
fi_xt.plot([fi_dt, fi_lm])

### Model z mniejszą liczbą zmiennych
---

In [14]:
X_train_small = X_train[['numerical__grade', 'numerical__lat', 'numerical__sqft_living', 'numerical__long']]

In [15]:
# model regresji liniowej
lm_small = LinearRegression()
lm_small.fit(X_train_small, y_train)

# model drzewam
dt_small = DecisionTreeRegressor()
dt_small.fit(X_train_small, y_train)

# model extra-trees
xt_small = ExtraTreeRegressor()
xt_small.fit(X_train_small, y_train)

In [16]:
from sklearn.metrics import mean_squared_error

print("MSE Linearn Regression: ", np.round(mean_squared_error(y_train, lm_small.predict(X_train_small)), 4))
print("MSE Decision Tree: ", np.round(mean_squared_error(y_train, dt_small.predict(X_train_small)), 4))
print("MSE Extra Trees: ", np.round(mean_squared_error(y_train, xt_small.predict(X_train_small)), 4))

MSE Linearn Regression:  53688198180.8415
MSE Decision Tree:  93936346.8442
MSE Extra Trees:  94338775.994


### Profile częściowej zależności (PDP)

In [17]:
pdp_lm = explainer_lm.model_profile(type = 'partial', label="RL")
pdp_dt = explainer_dt.model_profile(type = 'partial', label="DT")
pdp_xt = explainer_xt.model_profile(type="partial", label = "XT")

Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 67.39it/s]
Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 70.86it/s]
Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 68.27it/s]


In [18]:
pdp_xt.plot([pdp_dt, pdp_lm])

### Analiza dla pojedynczej obserwacji
----


In [20]:
roznica = xt.predict(X_train) - y_train
np.where(np.max(np.abs(roznica)) == np.abs(roznica))

(array([4622, 7658], dtype=int64),)

In [22]:
X_train.iloc[4622,:]

numerical__bedrooms              2.0000
numerical__bathrooms             1.0000
numerical__sqft_living        1080.0000
numerical__sqft_lot           4000.0000
numerical__floors                1.0000
numerical__view                  0.0000
numerical__condition             3.0000
numerical__grade                 7.0000
numerical__sqft_above         1080.0000
numerical__sqft_basement         0.0000
numerical__yr_built           1940.0000
numerical__yr_renovated          0.0000
numerical__lat                  47.6902
numerical__long               -122.3870
numerical__sqft_living15      1530.0000
numerical__sqft_lot15         4240.0000
numerical__zipcode           98117.0000
categorical__waterfront_0        1.0000
categorical__waterfront_1        0.0000
Name: 4622, dtype: float64

### Wykres Break Down

In [23]:
bd_xt = explainer_xt.predict_parts(X_train.iloc[4622,:], type = "break_down")

In [24]:
bd_xt

Unnamed: 0,variable_name,variable_value,variable,cumulative,contribution,sign,position,label
0,intercept,,intercept,539989.561828,539989.561828,1.0,20,Extra trees
1,numerical__lat,47.69,numerical__lat = 47.69,644712.150578,104722.588751,1.0,19,Extra trees
2,numerical__yr_built,1940.0,numerical__yr_built = 1940.0,702187.068652,57474.918074,1.0,18,Extra trees
3,numerical__long,-122.4,numerical__long = -122.4,732101.236466,29914.167814,1.0,17,Extra trees
4,numerical__bedrooms,2.0,numerical__bedrooms = 2.0,744648.053181,12546.816715,1.0,16,Extra trees
5,numerical__floors,1.0,numerical__floors = 1.0,732682.32166,-11965.731521,-1.0,15,Extra trees
6,numerical__zipcode,98120.0,numerical__zipcode = 98120.0,712056.905581,-20625.416079,-1.0,14,Extra trees
7,categorical__waterfront_0,1.0,categorical__waterfront_0 = 1.0,710981.601938,-1075.303644,-1.0,13,Extra trees
8,numerical__yr_renovated,0.0,numerical__yr_renovated = 0.0,709102.581579,-1879.020359,-1.0,12,Extra trees
9,numerical__sqft_lot,4000.0,numerical__sqft_lot = 4000.0,707024.447802,-2078.133777,-1.0,11,Extra trees


In [25]:
bd_xt.plot()

### Wykres Ceteris Paribus

In [26]:
cp_lm = explainer_lm.predict_profile(X_train.iloc[4622,:])
cp_dt = explainer_dt.predict_profile(X_train.iloc[4622,:])
cp_xt = explainer_xt.predict_profile(X_train.iloc[4622,:])

Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 111.12it/s]
Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 137.24it/s]
Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 157.29it/s]


In [29]:
cp_xt.plot([cp_dt, cp_lm], variables=['numerical__sqft_living', 'numerical__grade', 'numerical__lat', 'numerical__long'])