# Praca domowa 2

## Konstanty Kraszewski

Przetestowanie modeli `Random Forest` oraz `XGBoost` na przykładowych zadaniach regresji i klasyfikacji.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor

In [2]:
aps_train = pd.read_csv('../data/airline_passenger_satisfaction/train.csv', index_col=0)
aps_test = pd.read_csv('../data/airline_passenger_satisfaction/test.csv', index_col=0)
cpp = pd.read_csv('../data/car_prices_poland/Car_Prices_Poland.csv', index_col=0)

## Airline passenger satisfaction

In [3]:
aps_train.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [4]:
data = [aps_train, aps_test]
for i, d in enumerate(data):
    d['Gender'] = (d['Gender'] == 'Male') * 1
    d['Customer Type'] = (d['Customer Type'] == 'Loyal Customer') * 1
    d['Type of Travel'] = (d['Type of Travel'] == 'Personal Travel') * 1
    d['satisfaction'] = (d['satisfaction'] == 'satisfied') * 1
    d['Class_Eco_Plus'] = (d['Class'] == 'Eco Plus') * 1
    d['Class_Business'] = (d['Class'] == 'Business') * 1
    d.loc[d['Arrival Delay in Minutes'].isna(), 'Arrival Delay in Minutes'] = d['Arrival Delay in Minutes'].mean()
    d.drop(['id', 'Class'], axis = 1, inplace = True)

In [5]:
X_train_val = aps_train[aps_train.columns[aps_train.columns != 'satisfaction']]
y_train_val = aps_train['satisfaction']
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=123)
X_test = aps_test[aps_test.columns[aps_test.columns != 'satisfaction']]
y_test = aps_test['satisfaction']

### RandomForestClassifier

In [6]:
best_parameters = [[0, 0], [0 ,0]]
best_accuracy = 0
best_rmse = float('inf')
for n_estimators in [5, 25, 50]:
    for max_depth in [2, 5, 8]:
        rfc = RandomForestClassifier(random_state=123, n_estimators=n_estimators, max_depth=max_depth)
        rfc.fit(X_train, y_train)
        print(f"------------ (n_estimators={n_estimators:2d}, max_depth={max_depth:2d})")
        accuracy = accuracy_score(y_val, rfc.predict(X_val))
        print(f"Accuracy: {accuracy:.2f}")
        rmse = mean_squared_error(y_val, rfc.predict(X_val), squared = False)
        print(f"RMSE:     {rmse:.2f}\n")
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_parameters[0] = [n_estimators, max_depth]
        if rmse < best_rmse:
            best_rmse = rmse
            best_parameters[1] = [n_estimators, max_depth]

------------ (n_estimators= 5, max_depth= 2)
Accuracy: 0.84
RMSE:     0.40

------------ (n_estimators= 5, max_depth= 5)
Accuracy: 0.90
RMSE:     0.31

------------ (n_estimators= 5, max_depth= 8)
Accuracy: 0.94
RMSE:     0.25

------------ (n_estimators=25, max_depth= 2)
Accuracy: 0.88
RMSE:     0.35

------------ (n_estimators=25, max_depth= 5)
Accuracy: 0.92
RMSE:     0.28

------------ (n_estimators=25, max_depth= 8)
Accuracy: 0.94
RMSE:     0.25

------------ (n_estimators=50, max_depth= 2)
Accuracy: 0.87
RMSE:     0.35

------------ (n_estimators=50, max_depth= 5)
Accuracy: 0.92
RMSE:     0.28

------------ (n_estimators=50, max_depth= 8)
Accuracy: 0.94
RMSE:     0.25



Widać, że zarówno zwiększanie liczby estymatorów, jak i maksymalnej głębokości prowadzi do polepszenia wyników modelu (przynajmniej dla wybranych przykładowych wartości).

In [7]:
print(f"Best parameters (accuracy={best_accuracy:.2f}): n_estimators={best_parameters[0][0]}, max_depth={best_parameters[0][1]}.")
print(f"Best parameters (RMSE={best_rmse:.2f}):     n_estimators={best_parameters[1][0]}, max_depth={best_parameters[1][1]}.")

Best parameters (accuracy=0.94): n_estimators=50, max_depth=8.
Best parameters (RMSE=0.25):     n_estimators=50, max_depth=8.


W przypadku obydwu metryk najlepsze wyniki zostały osiągnięte dla tych samych parametrów.

In [8]:
rfc = RandomForestClassifier(random_state=123, n_estimators=50, max_depth=8)
rfc.fit(X_train, y_train)
print("Ostateczne wyniki modelu na zbiorze testowym:")
accuracy = accuracy_score(y_test, rfc.predict(X_test))
print(f"accuracy: {accuracy:.2f},")
rmse = mean_squared_error(y_test, rfc.predict(X_test), squared = False)
print(f"RMSE:     {rmse:.2f}.")

Ostateczne wyniki modelu na zbiorze testowym:
accuracy: 0.94,
RMSE:     0.25.


### XGBClassifier

In [9]:
xgbc = XGBClassifier(objective="binary:logistic", seed = 123, use_label_encoder=False)
xgbc.fit(X_train, y_train, verbose=False, early_stopping_rounds=15, eval_metric="error", eval_set=[(X_val, y_val)])
print("Ostateczne wyniki modelu na zbiorze testowym:")
accuracy = accuracy_score(y_test, xgbc.predict(X_test))
print(f"accuracy: {accuracy:.2f},")
rmse = mean_squared_error(y_test, xgbc.predict(X_test), squared = False)
print(f"RMSE:     {rmse:.2f}.")

Ostateczne wyniki modelu na zbiorze testowym:
accuracy: 0.96,
RMSE:     0.19.


## Car prices Poland

In [10]:
cpp.head()

Unnamed: 0,mark,model,generation_name,year,mileage,vol_engine,fuel,city,province,price
0,opel,combo,gen-d-2011,2015,139568,1248,Diesel,Janki,Mazowieckie,35900
1,opel,combo,gen-d-2011,2018,31991,1499,Diesel,Katowice,Śląskie,78501
2,opel,combo,gen-d-2011,2015,278437,1598,Diesel,Brzeg,Opolskie,27000
3,opel,combo,gen-d-2011,2016,47600,1248,Diesel,Korfantów,Opolskie,30800
4,opel,combo,gen-d-2011,2014,103000,1400,CNG,Tarnowskie Góry,Śląskie,35900


In [11]:
cpp['mark'] = LabelEncoder().fit_transform(cpp['mark'])
cpp['province'] = LabelEncoder().fit_transform(cpp['province'])
encoded = pd.get_dummies(cpp[["fuel"]].astype(str))
encoded = encoded.drop("fuel_Electric", axis = 1)
cpp = pd.concat([cpp, encoded], axis = 1)
cpp.drop(['model', 'generation_name', 'fuel', 'city'], axis = 1, inplace = True)

In [12]:
X = cpp[cpp.columns[cpp.columns != 'price']]
y = cpp['price']
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=123)

### RandomForestRegressor

In [13]:
best_parameters = [[0, 0], [0 ,0]]
best_r2 = 0
best_rmse = float('inf')
for n_estimators in [5, 25, 50]:
    for max_depth in [2, 5, 8]:
        rfc = RandomForestRegressor(random_state=123, n_estimators=n_estimators, max_depth=max_depth)
        rfc.fit(X_train, y_train)
        print(f"------------ (n_estimators={n_estimators:2d}, max_depth={max_depth:2d})")
        r2 = r2_score(y_val, rfc.predict(X_val))
        print(f"R^2:   {r2:.2f}")
        rmse = mean_squared_error(y_val, rfc.predict(X_val), squared = False)
        print(f"RMSE:  {rmse:.2f}\n")
        if r2 > best_r2:
            best_r2 = r2
            best_parameters[0] = [n_estimators, max_depth]
        if rmse < best_rmse:
            best_rmse = rmse
            best_parameters[1] = [n_estimators, max_depth]

------------ (n_estimators= 5, max_depth= 2)
R^2:   0.60
RMSE:  53829.80

------------ (n_estimators= 5, max_depth= 5)
R^2:   0.80
RMSE:  37624.55

------------ (n_estimators= 5, max_depth= 8)
R^2:   0.84
RMSE:  33693.23

------------ (n_estimators=25, max_depth= 2)
R^2:   0.59
RMSE:  54459.37

------------ (n_estimators=25, max_depth= 5)
R^2:   0.80
RMSE:  37877.44

------------ (n_estimators=25, max_depth= 8)
R^2:   0.85
RMSE:  33383.63

------------ (n_estimators=50, max_depth= 2)
R^2:   0.59
RMSE:  54459.46

------------ (n_estimators=50, max_depth= 5)
R^2:   0.80
RMSE:  37728.68

------------ (n_estimators=50, max_depth= 8)
R^2:   0.85
RMSE:  33122.26



O ile w tym przypadku zwiększanie liczby estymatorów również prowadzi do polepszenia wyników modelu, to przy zmianie maksymalnej głębokości nie jest to już tak widoczne.

In [14]:
print(f"Best parameters (R^2={best_r2:.2f}):     n_estimators={best_parameters[0][0]}, max_depth={best_parameters[0][1]}.")
print(f"Best parameters (RMSE={best_rmse:.2f}):  n_estimators={best_parameters[1][0]}, max_depth={best_parameters[1][1]}.")

Best parameters (R^2=0.85):     n_estimators=50, max_depth=8.
Best parameters (RMSE=33122.26):  n_estimators=50, max_depth=8.


Również tutaj dla obydwu metryk uzyskujemy te same parametry.

In [15]:
rfc = RandomForestRegressor(random_state=123, n_estimators=50, max_depth=8)
rfc.fit(X_train, y_train)
print("Ostateczne wyniki modelu na zbiorze testowym:")
r2 = r2_score(y_test, rfc.predict(X_test))
print(f"R^2:    {r2:.2f},")
rmse = mean_squared_error(y_test, rfc.predict(X_test), squared = False)
print(f"RMSE:   {rmse:.2f}.")

Ostateczne wyniki modelu na zbiorze testowym:
R^2:    0.87,
RMSE:   30545.16.


### XGBRegressor

In [16]:
xgbc = XGBRegressor(objective="reg:squarederror", seed = 123, use_label_encoder=False)
xgbc.fit(X_train, y_train, verbose=False, early_stopping_rounds=15, eval_set=[(X_val, y_val)])
print("Ostateczne wyniki modelu na zbiorze testowym:")
r2 = r2_score(y_test, xgbc.predict(X_test))
print(f"R^2:   {r2:.2f},")
rmse = mean_squared_error(y_test, xgbc.predict(X_test), squared = False)
print(f"RMSE:  {rmse:.2f}.")

Ostateczne wyniki modelu na zbiorze testowym:
R^2:   0.91,
RMSE:  26385.20.


## Podsumowanie

Jak widać XGBoost osiągnął lepsze wyniki od lasu losowego zarówno dla zadania klasyfikacji, jak i regresji.