# Predictions, part I
- drop columns: no
- normalization: no
- hyperparameter tuning: no
- one-hot encoding: yes, the dataset was found encoded
- oversampling: no

In this session, I tested every model's performance on unscaled dataset.\
**RandomForestClassifier seems like the best performer**, as a standalone model, and as a base estimator to 2 ensemble methods.\
These are the best accuracy scores, expressed in %, sorted in descending order:

- RandomForestClassifier = 86.31
- BaggingClassifier(RandomForestClassifier) = 86.18
- AdaBoostClassifier(RandomForestClassifier) = 86.01
- DecisionTreeClassifier = 82.35
- GradientBoostingClassifier = 81.49
- LogisticRegression = 79.94
- KNeighborsClassifier = 77.2

In the next session, I will scale the data and repeat these tests, to compare the scores.

# train_test_split

In [None]:
%run common_imports.py

df = pd.read_csv("../data/02.csv")
features = df.drop(columns=["is_canceled"])
target = df["is_canceled"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

display(X_train.head(), "")
display(y_train.head(), "")

## Models

In [None]:
# KNeighborsClassifier = 77.2

knn = KNeighborsClassifier()
display(knn.fit(X_train, y_train), "")

accuracy = knn.score(X_test, y_test) * 100
print(f"KNN accuracy is {accuracy:.2f}% \n")

pred = knn.predict(X_test)
print(classification_report(y_pred = pred, y_true = y_test), "\n")

In [None]:
"""
LogisticRegression = 79.94

- top 5 features by absolute coefficient
  - deposit_type_No_Deposit 1.97
  - deposit_type_Non_Refund 1.69
  - previous_cancellations 1.27
  - required_car_parking_spaces 0.98
  - market_segment_Offline_TA_TO 0.78
"""

lr = LogisticRegression()
display(lr.fit(X_train, y_train), "")

accuracy = lr.score(X_test, y_test) * 100
print(f"Logistic Regression score is {accuracy:.2f}% \n")

pred = lr.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

# individual coefficients
coefficients = lr.coef_[0]
feature_names = X_train.columns
coefficients_df = pd.DataFrame({"Feature": feature_names, "Coefficient": coefficients})
coefficients_df["Absolute Coefficient"] = abs(coefficients_df["Coefficient"])
coefficients_df[["Coefficient", "Absolute Coefficient"]] = coefficients_df[["Coefficient", "Absolute Coefficient"]].round(2)
coefficients_df = coefficients_df.sort_values(by="Absolute Coefficient", ascending=False)
display(coefficients_df)

In [None]:
"""
DecisionTreeClassifier = score 82.35

- tree importance:
  - 'deposit_type_Non_Refund': '0.24',
  - 'lead_time': '0.16',
  - 'avg_daily_rate': '0.12',
  - 'arrival_date_day_of_month': '0.08',
  - 'arrival_date_week_number': '0.06',
  - 'total_of_special_requests': '0.06',
  - 'stays_in_week_nights': '0.04',
  - 'market_segment_Online_TA': '0.04',
  - 'stays_in_weekend_nights': '0.03',
  - 'previous_cancellations': '0.03',
  - 'required_car_parking_spaces': '0.02',
  - 'arrival_date_month': '0.01',
  - 'adults': '0.01',
  - 'children': '0.01',
  - 'previous_bookings_not_canceled': '0.01',
  - 'hotel_Resort': '0.01',
  - 'meal_BB': '0.01',
  - 'reserved_room_type_A': '0.01',
  - 'reserved_room_type_D': '0.01',
  - 'customer_type_Transient': '0.01'
"""

dt = DecisionTreeClassifier()
display(dt.fit(X_train, y_train), "")

accuracy = dt.score(X_test, y_test) * 100
print(f"Decision Tree accuracy is {accuracy:.2f}% \n")

pred = dt.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

tree_importance = {feature: f"{importance:.2f}" for feature, importance in zip(X_train.columns, dt.feature_importances_)}
sorted_tree_importance = {k: v for k, v in sorted(tree_importance.items(), key=lambda item: item[1], reverse=True)}
display(sorted_tree_importance, "")

dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)

dot_data = export_graphviz(dt, out_file="tree.dot", filled=True, rounded=True, feature_names=X_train.columns)

with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

## Ensemble methods

I have used the following 4 ensemble methods (with base estimators where possible) = and these are the best accuracy scores:
1. BaggingClassifier(RandomForestClassifier) = 86.18
2. **RandomForestClassifier = 86.31**
3. GradientBoostingClassifier = 81.49
4. AdaBoostClassifier(RandomForestClassifier) = 86.01

### BaggingClassifier

- #1 BaggingClassifier = 85.27
- #2 BaggingClassifier(KNeighborsClassifier) = 77.58
- #3 BaggingClassifier(LogisticRegression) = 79.94
- #4 BaggingClassifier(DecisionTreeClassifier) = 85.09
- #5 **BaggingClassifier(RandomForestClassifier) = 86.18**
- #6 BaggingClassifier(GradientBoostingClassifier) = 81.59

In [None]:
#1 BaggingClassifier = 85.27

bagging = BaggingClassifier()
display(bagging.fit(X_train, y_train), "")

accuracy = bagging.score(X_test, y_test) * 100
print(f"Bagging accuracy is {accuracy:.2f}% \n")

pred = bagging.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#2 BaggingClassifier(KNeighborsClassifier) = 77.58

bagging_knn = BaggingClassifier(KNeighborsClassifier())
display(bagging_knn.fit(X_train, y_train), "")

accuracy = bagging_knn.score(X_test, y_test) * 100
print(f"Bagging KNN accuracy is {accuracy:.2f}% \n")

pred = bagging_knn.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#3 BaggingClassifier(LogisticRegression) = 79.94

bagging_lr = BaggingClassifier(LogisticRegression())
display(bagging_lr.fit(X_train, y_train), "")

accuracy = bagging_lr.score(X_test, y_test) * 100
print(f"Bagging LR accuracy is {accuracy:.2f}% \n")

pred = bagging_lr.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#4 BaggingClassifier(DecisionTreeClassifier) = 85.09

bagging_dt = BaggingClassifier(DecisionTreeClassifier())
display(bagging_dt.fit(X_train, y_train), "")

accuracy = bagging_dt.score(X_test, y_test) * 100
print(f"Bagging DT accuracy is {accuracy:.2f}% \n")

pred = bagging_dt.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#5 BaggingClassifier(RandomForestClassifier) = 86.18

bagging_rf = BaggingClassifier(RandomForestClassifier())
display(bagging_rf.fit(X_train, y_train), "")

accuracy = bagging_rf.score(X_test, y_test) * 100
print(f"Bagging RF accuracy is {accuracy:.2f}% \n")

pred = bagging_rf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#6 BaggingClassifier(GradientBoostingClassifier) = 81.59

bagging_gb = BaggingClassifier(GradientBoostingClassifier())
display(bagging_gb.fit(X_train, y_train), "")

accuracy = bagging_gb.score(X_test, y_test) * 100
print(f"Bagging GB accuracy is {accuracy:.2f}% \n")

pred = bagging_gb.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
# RandomForestClassifier = 86.31

rf = RandomForestClassifier()
display(rf.fit(X_train, y_train), "")

accuracy = rf.score(X_test, y_test) * 100
print(f"RandomForest accuracy is {accuracy:.2f}% \n")

pred = rf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
# GradientBoostingClassifier = 81.49

gb = GradientBoostingClassifier()
display(gb.fit(X_train, y_train), "")

accuracy = gb.score(X_test, y_test) * 100
print(f"GradientBoosting accuracy is {accuracy:.2f}% \n")

pred = gb.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

### AdaBoostClassifier
- #1 AdaBoostClassifier = 81.39
- #2 AdaBoostClassifier(LogisticRegression) = 79.46
- #3 AdaBoostClassifier(DecisionTreeClassifier) = 82.25
computationally expensive models:
- #4 **AdaBoostClassifier(RandomForestClassifier) = 86.01, runtime 10m**
- #5 AdaBoostClassifier(GradientBoostingClassifier) = 84.06, runtime 8m

In [None]:
#1 AdaBoostClassifier = 81.39

ab = AdaBoostClassifier()
display(ab.fit(X_train, y_train), "")

accuracy = ab.score(X_test, y_test) * 100
print(f"AdaBoost accuracy is {accuracy:.2f}% \n")

pred = ab.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#2 AdaBoostClassifier(LogisticRegression) = 79.46

ab_lr = AdaBoostClassifier(LogisticRegression())
display(ab_lr.fit(X_train, y_train), "")

accuracy = ab_lr.score(X_test, y_test) * 100
print(f"AdaBoost LR accuracy is {accuracy:.2f}% \n")

pred = ab_lr.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#3 AdaBoostClassifier(DecisionTreeClassifier) = 82.25

ab_dt = AdaBoostClassifier(DecisionTreeClassifier())
display(ab_dt.fit(X_train, y_train), "")

accuracy = ab_dt.score(X_test, y_test) * 100
print(f"AdaBoost DT accuracy is {accuracy:.2f}% \n")

pred = ab_dt.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#4 AdaBoostClassifier(RandomForestClassifier) = 86.01

ab_rf = AdaBoostClassifier(RandomForestClassifier())
display(ab_rf.fit(X_train, y_train), "")

accuracy = ab_rf.score(X_test, y_test) * 100
print(f"AdaBoost RF accuracy is {accuracy:.2f}% \n")

pred = ab_rf.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

In [None]:
#5 AdaBoostClassifier(GradientBoostingClassifier) = 84.06

ab_gb = AdaBoostClassifier(GradientBoostingClassifier())
display(ab_gb.fit(X_train, y_train), "")

accuracy = ab_gb.score(X_test, y_test) * 100
print(f"AdaBoost GB accuracy is {accuracy:.2f}% \n")

pred = ab_gb.predict(X_test)
print(classification_report(y_pred=pred, y_true=y_test), "\n")

Next: notebook_05_machine_learning_02_scaling