# Anna Urbala PD5
## Załadowanie modeli

In [1]:
import dalex as dx
import pandas as pd
import numpy as np
import pickle

In [13]:
rf = pickle.load(open("../../../../WB-XAI-Projekt/RF_model", "rb"))

In [4]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split


# Wczytanie i przygotowanie danych 
full_data = pd.read_csv("hotel_bookings.csv")
full_data["agent"] = full_data["agent"].astype(str)
treshold = 0.005 * len(full_data)
agents_to_change = full_data['agent'].value_counts()[full_data['agent'].value_counts() < treshold].index
full_data.loc[full_data["agent"].isin(agents_to_change), "agent"] = "other"

countries_to_change = full_data['country'].value_counts()[full_data['country'].value_counts() < treshold].index
full_data.loc[full_data["country"].isin(countries_to_change), "country"] = "other"


# Określenie cech uwzględnionych w modelu
num_features = ["lead_time", "arrival_date_week_number",
                "stays_in_weekend_nights", "stays_in_week_nights", 
                "adults", "previous_cancellations",
                "previous_bookings_not_canceled",
                "required_car_parking_spaces", "total_of_special_requests", 
                "adr", "booking_changes"]

cat_features = ["hotel", "market_segment", "country", 
                "reserved_room_type",
                "customer_type", "agent"]

features = num_features + cat_features

# Podział na zmienne wyjaśniające i target
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

categorical_names = {}
for feature in cat_features:
    col = X[[feature]]
    cat_transformer = SimpleImputer(strategy="constant", fill_value="Unknown")
    col = cat_transformer.fit_transform(col)
    X[feature] = col
    le = LabelEncoder()
    le.fit(X[[feature]])
    X[[feature]] = le.transform(X[[feature]])
    categorical_names[feature] = le.classes_

categorical_names
# Preprocessing
num_transformer = SimpleImputer(strategy="constant")

preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features)],
                                remainder = 'passthrough')

for feature in num_features:
    X[feature] = X[feature].astype(float)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2, random_state=42)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

def train_model_pipe(model):
    model_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', model)])
    model_pipe.fit(X_train, y_train)
    return model_pipe

lr = train_model_pipe(LogisticRegression(random_state=42,n_jobs=-1))
dt = train_model_pipe(DecisionTreeClassifier(random_state=42))
xgb = train_model_pipe(XGBClassifier(random_state=42, n_jobs=-1))







In [15]:
explainer_rf = dx.Explainer(rf, X_train, y_train, label = "Random Forest")
explainer_lr = dx.Explainer(lr, X_train, y_train, label = "Logistic Regression")
explainer_dt = dx.Explainer(dt, X_train, y_train, label = "Decision Tree")
explainer_xgb = dx.Explainer(xgb, X_train, y_train, label = "XGBoost")

Preparation of a new explainer is initiated

  -> data              : 95512 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 95512 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Random Forest
  -> predict function  : <function yhat_proba_default at 0x7fb7a4faa9d8> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0, mean = 0.371, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.943, mean = -0.00218, max = 0.957
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 95512 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Serie

## Dla wybranych zmiennych ze zbioru danych policz Partial Dependence Profiles (PDP)

In [16]:
pdp_rf=explainer_rf.model_profile()
pdp_lr = explainer_lr.model_profile()
pdp_dt = explainer_dt.model_profile()
pdp_xgb = explainer_xgb.model_profile()

Calculating ceteris paribus: 100%|██████████| 17/17 [00:03<00:00,  4.26it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 28.01it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 55.17it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 23.35it/s]


In [17]:
pdp_rf.plot([pdp_lr, pdp_dt, pdp_xgb])

Generalnie widać, że partial dependence profile jest porównywalny dla wszystkich zmiennych u wszystkich modeli poza regresją logistyczną (która ma najgorsze wyniki i odbiegała już ostatnio - jest to związane ze specyfiką tego modelu). Można zatem uznać, że 3 nasze modele są zgodne. Są czasem różnice (np. dla `previous_cancellations` XGB ma predykcję wyższą od reszty o ok 0.2), ale generalnie predykcje mają podobny _kształt,_ więc te różnice wpływają już głównie na skuteczność modelu.

## Dla wybranych zmiennych ze zbioru danych policz Accumulated Local Dependence (ALE).

In [18]:
ale_rf = explainer_rf.model_profile(type = 'accumulated')
ale_lr = explainer_lr.model_profile(type = 'accumulated')
ale_dt = explainer_dt.model_profile(type = 'accumulated')
ale_xgb = explainer_xgb.model_profile(type = 'accumulated')

Calculating ceteris paribus: 100%|██████████| 17/17 [00:03<00:00,  4.27it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  8.03it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 27.59it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  8.10it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 55.51it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  8.18it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 25.54it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  7.73it/s]


In [20]:
ale_rf.plot([ale_lr, ale_dt, ale_xgb])

Generalnie kształty wyglądają podobnie do modeli PDP, co potwierdza hipotezę o zgodności naszych 3 modeli. Porównajmy jednak jeszcze PDP i ALE naszego głównego modelu. 

In [21]:
ale_rf.result['_label_'] = "ALE"
pdp_rf.result['_label_'] = "PDP"

In [22]:
ale_rf.plot(pdp_rf)

Krzywe są równoległe do siebie i leżą bardzo blisko (największe różnice w predykcji są dla `country` ~ 0.15 i `lead_time` ~ 0.2). Ogólnie nie ma powodów do niepokoju, nasze profile są poprawne i powinny dawać prawidłowe podsumowania.


##### Uwaga
Przy naszym modelu wywołanie `pdp.plot(geom="profiles")` było bardzo złym pomysłem. Całość rysowała się około 10 minut, po czym dostaliśmy po prostu obrazki z szarym tłem, dlatego postanowiłam nie załączać outputu :(.
![plot](plot.png)