Zbiór danych: https://www.kaggle.com/jessemostipak/hotel-booking-demand
# Stworzenie i wytrenowanie modelu:

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import FunctionTransformer

In [3]:
%%capture
full_data = pd.read_csv("hotel_bookings.csv")
num_features = ["lead_time","arrival_date_week_number","arrival_date_day_of_month",
                "stays_in_weekend_nights","stays_in_week_nights","adults","children",
                "babies","is_repeated_guest", "previous_cancellations",
                "previous_bookings_not_canceled",
                "required_car_parking_spaces", "total_of_special_requests", "adr"]

cat_features = ["hotel","arrival_date_month","meal","market_segment",
                "distribution_channel","reserved_room_type","deposit_type","customer_type", "agent","company"]

full_data["agent"] = full_data["agent"].astype(str)
full_data["company"] = full_data["company"].astype(str)

treshold = 0.005 * len(full_data)
agents_to_change = full_data['agent'].value_counts()[full_data['agent'].value_counts() < treshold].index
full_data.loc[full_data["agent"].isin(agents_to_change), "agent"] = "other"

treshold_2 = 50
companies_to_change = full_data['company'].value_counts()[full_data['company'].value_counts() < 50].index
full_data.loc[full_data["company"].isin(companies_to_change), "company"] = "other"

# Separate features and predicted value
features = num_features + cat_features
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

# preprocess numerical feats:
# for most num cols, except the dates, 0 is the most logical choice as fill value
# and here no dates are missing.
num_transformer = SimpleImputer(strategy="constant")

# Preprocessing for categorical features:
cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
    ("onehot", OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical features:
preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features),
                                               ("cat", cat_transformer, cat_features),
                                               ("log", FunctionTransformer(np.log1p), ["lead_time"])],
                                remainder = 'passthrough')

In [4]:
%%capture
rf_model_enh = RandomForestClassifier(n_estimators=160,
                               max_features=0.4,
                               min_samples_split=2,
                               n_jobs=-1,
                               random_state=0, 
                               verbose=3)

model_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', rf_model_enh)])

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2, random_state=42)

model_pipe.fit(X_train, y_train)

In [5]:
from sklearn.metrics import accuracy_score
y_predict = model_pipe.predict(X_test)
accuracy_score(y_test, y_predict)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 112 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 160 out of 160 | elapsed:    0.2s finished


0.8711784906608594

# Predykcja dla wybranej obserwacji

In [6]:
selected_X = X_train.iloc[[2137]]
selected_y = y_train.iloc[2137]
predicted_y = model_pipe.predict(selected_X)
print("Prawdziwa wartość:", selected_y)
print("Przewidziana wartość:", predicted_y)
selected_X.head()

Prawdziwa wartość: 0
Przewidziana wartość: [0]


[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 112 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 160 out of 160 | elapsed:    0.0s finished


Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,...,hotel,arrival_date_month,meal,market_segment,distribution_channel,reserved_room_type,deposit_type,customer_type,agent,company
101424,92,46,11,0,1,2,0.0,0,0,0,...,City Hotel,November,BB,Offline TA/TO,TA/TO,A,No Deposit,Transient-Party,other,


Dla wybranej obserwacji udało nam się poprawnie przewidzieć brak odwołania rezerwacji
# Dekompozycja modelu

In [7]:
%%capture
import dalex as dx
explainer = dx.Explainer(model_pipe, X_train, y_train)

In [10]:
%%capture
pp_bd = explainer.predict_parts(selected_X, type="break_down")
pp_shap = explainer.predict_parts(selected_X, type="shap")

### Break down

In [24]:
pp_bd.plot()

### Shapley

In [25]:
pp_shap.plot()

Spostrzeżenia:
- już Break Down na tym modelu robi się dość długo, podczas shapleya można zrobić i zjeść kolację, dlatego dalej niestety nie będę próbować liczyć shapleya
- Shapley wyciągnął na górę wpływowe zmienne, Break Down zdecydowanie nie - mamy dużo zmiennych w modelu, podał kilka, które mają minimalny wpływ, zaś pozostałe zmieniają średnią odpowiedź o 0.338, co przy predykowaniu zmiennej decyzyjnej jej gigantyczną wręcz liczbą
- obie metody się zgadzają, że liczba specjalnych żądań zwiększa prawdopodobieństwo anulowania rezerwacji, a miesiąc zmniejsza, co do tygodnia przyjazdu metody się nie zgadzają - ale wiadomo, jest to związane m.in. z kolejnością zmiennych

# 2 obserwacje o różnych ważnych zmiennych

In [26]:
pp_bd

Unnamed: 0,variable_name,variable_value,variable,cumulative,contribution,sign,position,label
0,intercept,1,intercept,0.371326,0.371326,1.0,25,RandomForestClassifier
1,total_of_special_requests,0.0,total_of_special_requests = 0.0,0.464128,0.092802,1.0,24,RandomForestClassifier
2,lead_time,92.0,lead_time = 92.0,0.500135,0.036007,1.0,23,RandomForestClassifier
3,required_car_parking_spaces,0.0,required_car_parking_spaces = 0.0,0.5286,0.028465,1.0,22,RandomForestClassifier
4,adults,2.0,adults = 2.0,0.534909,0.00631,1.0,21,RandomForestClassifier
5,arrival_date_week_number,46.0,arrival_date_week_number = 46.0,0.542082,0.007172,1.0,20,RandomForestClassifier
6,previous_bookings_not_canceled,0.0,previous_bookings_not_canceled = 0.0,0.54462,0.002538,1.0,19,RandomForestClassifier
7,arrival_date_day_of_month,11.0,arrival_date_day_of_month = 11.0,0.550724,0.006104,1.0,18,RandomForestClassifier
8,company,,company = nan,0.55762,0.006897,1.0,17,RandomForestClassifier
9,stays_in_week_nights,1.0,stays_in_week_nights = 1.0,0.571256,0.013636,1.0,16,RandomForestClassifier


Dla wybranej wcześniej obserwacji najważniejsze zmienne to `customer_type` oraz `market_segment`

In [30]:
%%capture
selected_X2 = X_train.iloc[[420]]
pp_bd2 = explainer.predict_parts(selected_X2, type="break_down")

In [31]:
pp_bd2

Unnamed: 0,variable_name,variable_value,variable,cumulative,contribution,sign,position,label
0,intercept,1,intercept,0.371326,0.371326,1.0,25,RandomForestClassifier
1,market_segment,Online TA,market_segment = Online TA,0.388508,0.017182,1.0,24,RandomForestClassifier
2,lead_time,29.0,lead_time = 29.0,0.401357,0.01285,1.0,23,RandomForestClassifier
3,required_car_parking_spaces,0.0,required_car_parking_spaces = 0.0,0.421798,0.020441,1.0,22,RandomForestClassifier
4,customer_type,Transient,customer_type = Transient,0.451421,0.029623,1.0,21,RandomForestClassifier
5,arrival_date_week_number,16.0,arrival_date_week_number = 16.0,0.440556,-0.010865,-1.0,20,RandomForestClassifier
6,arrival_date_month,April,arrival_date_month = April,0.433321,-0.007236,-1.0,19,RandomForestClassifier
7,previous_bookings_not_canceled,0.0,previous_bookings_not_canceled = 0.0,0.435017,0.001697,1.0,18,RandomForestClassifier
8,company,,company = nan,0.435088,7.1e-05,1.0,17,RandomForestClassifier
9,is_repeated_guest,0.0,is_repeated_guest = 0.0,0.435558,0.00047,1.0,16,RandomForestClassifier


Dla kolejnego przykładu zaś najważniejszymi zmiennymi są `total_of_special_requests` oraz `deposit_type`. Może to jednak wynik z kolejności zmiennych. Sprawdźmy to. Ustalimy na sztywno kolejność zmiennych.

In [36]:
%%capture
order = list(X_train.columns)
pp_bd_order = explainer.predict_parts(selected_X, type="break_down", order=order)
pp_bd2_order = explainer.predict_parts(selected_X2, type="break_down", order=order)

In [37]:
pp_bd_order

Unnamed: 0,variable_name,variable_value,variable,cumulative,contribution,sign,position,label
0,intercept,1,intercept,0.371326,0.371326,1.0,25,RandomForestClassifier
1,lead_time,92.0,lead_time = 92.0,0.403416,0.03209,1.0,24,RandomForestClassifier
2,arrival_date_week_number,46.0,arrival_date_week_number = 46.0,0.411143,0.007727,1.0,23,RandomForestClassifier
3,arrival_date_day_of_month,11.0,arrival_date_day_of_month = 11.0,0.415732,0.004589,1.0,22,RandomForestClassifier
4,stays_in_weekend_nights,0.0,stays_in_weekend_nights = 0.0,0.417132,0.001401,1.0,21,RandomForestClassifier
5,stays_in_week_nights,1.0,stays_in_week_nights = 1.0,0.427392,0.01026,1.0,20,RandomForestClassifier
6,adults,2.0,adults = 2.0,0.429569,0.002177,1.0,19,RandomForestClassifier
7,children,0.0,children = 0.0,0.427179,-0.002391,-1.0,18,RandomForestClassifier
8,babies,0.0,babies = 0.0,0.427064,-0.000115,-1.0,17,RandomForestClassifier
9,is_repeated_guest,0.0,is_repeated_guest = 0.0,0.427762,0.000698,1.0,16,RandomForestClassifier


In [40]:
pp_bd2_order

Unnamed: 0,variable_name,variable_value,variable,cumulative,contribution,sign,position,label
0,intercept,1,intercept,0.371326,0.371326,1.0,25,RandomForestClassifier
1,lead_time,29.0,lead_time = 29.0,0.386412,0.015086,1.0,24,RandomForestClassifier
2,arrival_date_week_number,16.0,arrival_date_week_number = 16.0,0.378245,-0.008168,-1.0,23,RandomForestClassifier
3,arrival_date_day_of_month,14.0,arrival_date_day_of_month = 14.0,0.373797,-0.004448,-1.0,22,RandomForestClassifier
4,stays_in_weekend_nights,0.0,stays_in_weekend_nights = 0.0,0.378853,0.005056,1.0,21,RandomForestClassifier
5,stays_in_week_nights,3.0,stays_in_week_nights = 3.0,0.371367,-0.007486,-1.0,20,RandomForestClassifier
6,adults,1.0,adults = 1.0,0.371561,0.000195,1.0,19,RandomForestClassifier
7,children,0.0,children = 0.0,0.369501,-0.002061,-1.0,18,RandomForestClassifier
8,babies,0.0,babies = 0.0,0.369407,-9.3e-05,-1.0,17,RandomForestClassifier
9,is_repeated_guest,0.0,is_repeated_guest = 0.0,0.370148,0.00074,1.0,16,RandomForestClassifier


Najważniejsze zmienne dla pierwszej obserwacji teraz to: `customer_type` i `total_of_special_requests` (`market_segment` dalej gra dość istotną rolę, ale nie aż tak).
Najważniejsze zmienne dla drugiej obserwacji teraz to: `agent` i `deposit_type` (`total_of_special_requests` jest na trzecim miejscu, więc dalej pełni istotną rolę).

Udało się zatem znaleźć 2 obserwacje, które nawet po ustawieniu sztywnej kolejności kolumn mają inne istotne zmienne.

# Inny efekt dla tych samych zmiennych
już dla powyższych obserwacji widać, że dla braku `company` efekt jest inny (są to jednak liczby tak znikome, że spróbujemy znaleźć lepszy przykład).

In [42]:
%%capture
selected_X3 = X_train.iloc[[13]]
pp_bd3_order = explainer.predict_parts(selected_X3, type="break_down", order=order)

In [43]:
pp_bd3_order

Unnamed: 0,variable_name,variable_value,variable,cumulative,contribution,sign,position,label
0,intercept,1,intercept,0.371326,0.371326,1.0,25,RandomForestClassifier
1,lead_time,41.0,lead_time = 41.0,0.389585,0.018259,1.0,24,RandomForestClassifier
2,arrival_date_week_number,33.0,arrival_date_week_number = 33.0,0.391454,0.001869,1.0,23,RandomForestClassifier
3,arrival_date_day_of_month,9.0,arrival_date_day_of_month = 9.0,0.391112,-0.000342,-1.0,22,RandomForestClassifier
4,stays_in_weekend_nights,2.0,stays_in_weekend_nights = 2.0,0.38867,-0.002443,-1.0,21,RandomForestClassifier
5,stays_in_week_nights,1.0,stays_in_week_nights = 1.0,0.388553,-0.000117,-1.0,20,RandomForestClassifier
6,adults,2.0,adults = 2.0,0.391214,0.002661,1.0,19,RandomForestClassifier
7,children,0.0,children = 0.0,0.388851,-0.002363,-1.0,18,RandomForestClassifier
8,babies,0.0,babies = 0.0,0.388756,-9.5e-05,-1.0,17,RandomForestClassifier
9,is_repeated_guest,0.0,is_repeated_guest = 0.0,0.389561,0.000804,1.0,16,RandomForestClassifier


*porównanie drugiej i trzeciej obserwacji, tj. obserwacji o numerach 420 i 13*
Tym razem udało nam się znaleźć 2 zmienne wspólne, które mają inny impakt mimo tych samych wartości. Znowu jest to puste `company`. Do tego doszedł jeszcze `market_segment=Online TA`, który już daje wyraźniejszy wpływ na ostateczną wartość.

### Ogólne wnioski
- kolejność zmiennych w break down ma znaczenie
- te same wartości danej zmiennej mogą mieć różny wpływ na predykcję
- shapley długo się liczy, ale przynajmniej wyciąga istotne zmienne na górę
- tabelki są całkiem przydatne przy analizie dużych modeli