<h1><center>Canceled bookings at a hotel</center></h1>


You have been assigned the task of building a model that will predict whether or not a customer of a hotel will cancel their booking. The data for this assingment is found in the csv file `hotel_clf`

<br> 
<div>
<img src="https://5.imimg.com/data5/PC/BL/MY-33192851/hotel-reservation-services-500x500.jpg" width="400"/>
</div>
<br> 
If the model predicts that a customer will cancel their booking, that customer will be sent a special deal to try to keep the customer from cancel the booking. If the prediction is correct (a True Positive), the expected gain is 1000 SEK. However, if the prediction is wrong (a False Positive), the expected loss is 500 SEK. 

Your goal is to build the most profitable model possible.

<hr style="border:1px solid pink"> </hr>

## Q1 | Choose Metric

Reason about which metric you think will be best to optimize your model for.

- Recall?
- Precision?
- Accuracy?
- F1-score?

Make a decision about which metric you think will lead to the most profitable model

# Answer

If missing cancellations (False Negatives) is costly, we should emphasize Recall.
If wrongly flagging cancellations (False Positives) is costly, we should emphasize Precision.
Since False Positives incur a loss (-500 SEK) and True Positives bring a gain (+1000 SEK), we want to maximize profit.

The optimal trade-off between precision and recall is captured by the F1-score, which is useful when both types of errors are important.

Decision: F1-score: is the most balanced and profit-oriented metric in this case.

## Q2 | Data prepatation

- Prepare your data so that you end up with a clean and preprocessed train and test set
    
    
- Instructions for train test split:    
    - Test size = 0.2
    - Random state = 42

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

df = pd.read_csv("../data/hotel_clf.csv")

df = df.dropna(subset=["children"])

In [23]:
correlation_matrix = df.select_dtypes(include=['number']).corr()
correlation_matrix

Unnamed: 0,is_canceled,lead_time,adults,children,booking_changes,adr
is_canceled,1.0,0.315002,0.057276,0.002468,-0.143197,0.033869
lead_time,0.315002,1.0,0.128026,-0.038613,-0.029949,-0.054567
adults,0.057276,0.128026,1.0,0.037699,-0.075726,0.177399
children,0.002468,-0.038613,0.037699,1.0,0.052715,0.231473
booking_changes,-0.143197,-0.029949,-0.075726,0.052715,1.0,0.009761
adr,0.033869,-0.054567,0.177399,0.231473,0.009761,1.0


In [15]:
X = df[["lead_time", "adults", "children", "booking_changes", "adr"]]  # pick relevant numeric columns
y = df["is_canceled"]  # binary target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-Score:", f1)

# Decide which metric is most profitable based on the cost of false positives vs. false negatives.

Accuracy: 0.6815
Precision: 0.6518518518518519
Recall: 0.3473684210526316
F1-Score: 0.4532188841201717


In [16]:
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
profit = tp * 1000 - fp * 500

print("TP:", tp, "FP:", fp)
print("Estimated Profit (SEK):", profit)

TP: 264 FP: 141
Estimated Profit (SEK): 193500


## Q3 | Build a LogReg Model

Guidelines:
- Use a LogisticRegression model
    - Random state = 42
- Use the metric you decided on in the previous question

- You are not allowed to change the model after looking at the performance on test data
- Your models predictions on test data will be translated into SEK. I.e:
    - 10 TP = 10 * 1 000 SEK = +10 000 
    - 10 FP = 10 * -500 SEK = -5 000 SEK
        - Expected Value from model = +5 000 SEK 
        
        
After you have trained your model, make predictions for your test data and calculate the profitable of the model

In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

rf_cv = RandomForestClassifier(random_state=42)
scores = cross_val_score(rf_cv, X, y, cv=5, scoring='f1')  # or 'precision', 'recall', etc.

print("Cross-validation F1 scores:", scores)
print("Mean F1 score:", scores.mean())

Cross-validation F1 scores: [0.5975522  0.62848752 0.62093352 0.60448808 0.58479532]
Mean F1 score: 0.6072513272564232


In [None]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)
prec_rf = precision_score(y_test, y_pred_rf)
rec_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

print("RFC - Accuracy:", acc_rf)
print("RFC - Precision:", prec_rf)
print("RFC - Recall:", rec_rf)
print("RFC - F1-Score:", f1_rf)

# Profit
tn_rf, fp_rf, fn_rf, tp_rf = confusion_matrix(y_test, y_pred_rf).ravel()
profit_rf = tp_rf * 1000 - fp_rf * 500
print("RFC - Profit (SEK):", profit_rf)

RFC - Accuracy: 0.72
RFC - Precision: 0.6533742331288344
RFC - Recall: 0.5605263157894737
RFC - F1-Score: 0.603399433427762
RFC - Profit (SEK): 313000


## Q4 | Build a RandomForestClassifier model

- Use a RandomForestClassifier model:
    - random_state = 42


- After you have trained your model, make predictions for your test data and calculate the profitable of the model

- Which model was more profitable, the LogReg or the RandomForestClassifier?

In [22]:
print("More profitable model:", "LogReg" if profit > profit_rf else "RandomForest")

More profitable model: RandomForest


## Q5 | Did you choose the right metric? 

Calculate the profitablity for the RandomForestClassifier for all 4 different metrics. Then rank order the outcome. I.e.:

- RFC (precision) = 1
- RFC (accuracy) = 2
- ...
- ...


***Note:*** You don't have to use a param_grid for this question, just run the RandomForest with default settings

In [None]:
from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None]
}

grid = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=params,
    scoring='f1',
    cv=5
)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best score:", grid.best_score_)

Best parameters: {'max_depth': None, 'n_estimators': 50}
Best score: 0.6012536503927767


In [20]:
from sklearn.metrics import make_scorer

# Helper function
def evaluate_profit(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return tp * 1000 - fp * 500

# Evaluate RFC under each metric
metrics = {
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score,
    'f1': f1_score
}

profits = {}
for name, scorer in metrics.items():
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    profit = evaluate_profit(y_test, y_pred)
    profits[name] = profit

# Sort and print
sorted_profits = sorted(profits.items(), key=lambda x: x[1], reverse=True)
for rank, (metric, profit) in enumerate(sorted_profits, start=1):
    print(f"{rank}. RFC ({metric}) → Profit: {profit} SEK")

1. RFC (accuracy) → Profit: 313000 SEK
2. RFC (precision) → Profit: 313000 SEK
3. RFC (recall) → Profit: 313000 SEK
4. RFC (f1) → Profit: 313000 SEK
