# table of contents
1. [predictions, part I](#predictions,-part-I)
2. [preprocessing](#preprocessing)
3. [calculations](#calculations)
   1. [batch model evaluation](#batch-model-evaluation)
   2. [single model evaluation](#single-model-evaluation)
   3. [LogisticRegression, coefficients](#LogisticRegression,-coefficients)
   4. [DecisionTreeClassifier, tree importance](#DecisionTreeClassifier,-tree-importance) 

# predictions, part I
- drop columns: no
- **scaling: yes**
- hyperparameter tuning: no
- one-hot encoding: yes, the dataset was received encoded
- resampling: no

**The main takeaways from this session:**
- I'm testing every model's performance on unscaled and scaled data.
- RandomForestClassifier is the best performer, as a standalone model, and a base estimator to 2 ensemble methods.
- scaling only significantly improves accuracy for KNN and Bagging with KNN as base estimator.
- AdaBoostClassifier has an extremely high computational cost when paired with certain 2 base estimators.

# preprocessing

In [2]:
# import libraries
%run common_imports.py

# load and split data
%run load_and_split_data.py
X_train, X_test, y_train, y_test = load_and_split_data()

# scale data
%run minmaxscaler.py
X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
104182,23,2,11,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110320,102,17,24,4,1,3,2,0,0,0,...,0,0,0,1,0,0,0,0,0,1
60388,489,46,10,11,0,2,2,0,0,0,...,0,0,0,0,1,0,0,0,1,0
105591,36,7,12,2,2,1,2,0,0,0,...,0,0,0,1,0,0,0,0,1,0
73207,101,33,17,8,1,3,2,0,0,0,...,0,0,0,1,0,0,0,0,1,0







104182    0
110320    0
60388     1
105591    0
73207     1
Name: is_canceled, dtype: int64

Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,0.031208,0.019231,0.333333,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.138399,0.307692,0.766667,0.272727,0.052632,0.06,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.663501,0.865385,0.3,0.909091,0.0,0.04,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.048847,0.115385,0.366667,0.090909,0.105263,0.02,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.137042,0.615385,0.533333,0.636364,0.052632,0.06,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0







Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,0.028494,0.423077,0.1,0.454545,0.105263,0.08,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.005427,0.019231,0.433333,0.0,0.0,0.02,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.02578,0.769231,0.166667,0.818182,0.0,0.06,0.018182,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.005427,0.019231,0.3,0.0,0.0,0.04,0.018182,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.187246,0.480769,0.866667,0.454545,0.0,0.06,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


# calculations

There are 2 approaches, depending on the needs and computing capacities:


1. **batch:** #model_accuracy function that calculates accuracy scores and computation time for given models,  saves their accuracy scores and computation time into a dataframe.\
This approach is recommended because the results can be easily collected and compared compared in the next iterations: unscaled vs scaled data, etc.\
Caution: computation time for this dataset is around 20 minutes.

2. **single model:** #train_evaluate_runtime function calculates accuracy score, classification report, and runtime for any given model.\
This approach allows to save time by calculating desired individual model.

## batch model evaluation

In [15]:
# calculate accuracy score and computation cost for a group of models, save the results into a DataFrame

@timer
def model_accuracy(model, scaled):
    X_train_use = X_train_scaled if scaled else X_train
    X_test_use = X_test_scaled if scaled else X_test

    model_name = str(model).split("(")[0]
    if hasattr(model, 'base_estimator_'):
        estimator_name = str(model.base_estimator_).split('(')[0]
    elif hasattr(model, 'estimator'):
        estimator_name = str(model.estimator).split('(')[0]
    else:
        estimator_name = model_name

    display(model.fit(X_train_use, y_train))
    accuracy = round(model.score(X_test_use, y_test) * 100, 2)
    return model_name, estimator_name, accuracy

# List of models
models = [
    KNeighborsClassifier(),
    LogisticRegression(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    BaggingClassifier(),
    BaggingClassifier(KNeighborsClassifier()),
    BaggingClassifier(LogisticRegression()),
    BaggingClassifier(RandomForestClassifier()),
    BaggingClassifier(GradientBoostingClassifier()),
    AdaBoostClassifier(),
    AdaBoostClassifier(LogisticRegression()),
    AdaBoostClassifier(RandomForestClassifier()),
    AdaBoostClassifier(GradientBoostingClassifier())
]

# Empty lists to store results
model_names = []
estimator_names = []
accuracies = []
times = []
sources = []

# Calculate accuracy and time for each model with both scaled and unscaled data
for model in models:
    for scaled in [False, True]:
        (model_name, estimator_name, accuracy), time_taken = model_accuracy(model, scaled=scaled)
        model_names.append(model_name)
        estimator_names.append(estimator_name)
        accuracies.append(accuracy)
        times.append(time_taken)
        sources.append("scaled" if scaled else "unscaled")

# Create DataFrame
accuracies_without_parameters = pd.DataFrame({
    "model": model_names,
    "estimator": estimator_names,
    "accuracy_in_%": accuracies,
    "runtime_in_seconds": times,
    "source": sources
})

# Save the DataFrame to a CSV file
accuracies_without_parameters.to_csv("../data/accuracies_without_parameters.csv", index=False)

# Display the DataFrame
accuracies_without_parameters

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt





STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt











Unnamed: 0,model,estimator,accuracy_in_%,runtime_in_seconds,source
0,KNeighborsClassifier,KNeighborsClassifier,77.2,2,unscaled
1,KNeighborsClassifier,KNeighborsClassifier,80.77,2,scaled
2,LogisticRegression,LogisticRegression,79.94,0,unscaled
3,LogisticRegression,LogisticRegression,79.75,0,scaled
4,DecisionTreeClassifier,DecisionTreeClassifier,82.15,0,unscaled
5,DecisionTreeClassifier,DecisionTreeClassifier,82.41,0,scaled
6,RandomForestClassifier,DecisionTreeClassifier,86.41,7,unscaled
7,RandomForestClassifier,DecisionTreeClassifier,86.57,7,scaled
8,GradientBoostingClassifier,GradientBoostingClassifier,81.49,9,unscaled
9,GradientBoostingClassifier,GradientBoostingClassifier,81.49,9,scaled


## single model evaluation

In [None]:
# calculate accuracy score, classification report, runtime for a given model

@timer
def train_evaluate_model(model):
    model.fit(X_train, y_train)
    accuracy = round(model.score(X_test, y_test) * 100, 2)
    pred = model.predict(X_test)
    return accuracy, pred

def train_evaluate_runtime(model_class, *args, **kwargs):
    model = model_class(*args, **kwargs)
    (result, runtime) = train_evaluate_model(model)
    accuracy, pred = result  

    print(f"The accuracy of the model is {accuracy}%")
    print(f"Runtime (seconds): {runtime}\n")
    print(classification_report(y_true=y_test, y_pred=pred))

# Example: train_evaluate_runtime(KNeighborsClassifier)
# Example: train_evaluate_runtime(KNeighborsClassifier, n_neighbors=5)

## LogisticRegression, coefficients
Below is #log_reg_coefficients function that calculates and displays the coefficients of the features in a logistic regression model, sorted by their absolute values in descending order, indicating their importance or impact on the model's predictions.

**Top 5 features by absolute coefficient:**
1. deposit_type_No_Deposit 1.97
2. deposit_type_Non_Refund 1.69
3. previous_cancellations 1.27
4. required_car_parking_spaces 0.98
5. market_segment_Offline_TA_TO 0.78

In [None]:
def log_reg_coefficients():
    coefficients = LogisticRegression().fit(X_train, y_train).coef_[0]
    feature_names = X_train.columns
    coefficients_df = pd.DataFrame({"Feature": feature_names, "Coefficient": coefficients})
    coefficients_df["Absolute Coefficient"] = abs(coefficients_df["Coefficient"])
    coefficients_df[["Coefficient", "Absolute Coefficient"]] = coefficients_df[["Coefficient", "Absolute Coefficient"]].round(2)
    coefficients_df = coefficients_df.sort_values(by="Absolute Coefficient", ascending=False)
    display(coefficients_df)

log_reg_coefficients()

## DecisionTreeClassifier, tree importance
Below is #dt_tree_importance function that calculates importance of each feature in the decision tree, sorts them in descending order of importance, displays as a DataFrame, generates+displays+saves a visualisation of the tree.

With the max_depth=2 parameter, this tree returns Top 3 features in the decision tree (and their importance score):
1. **deposit_type_Non_Refund (0.88):** there are more refundable canceled bookings in absolute terms, but 99% of non-refundable ones were canceled.
2. lead_time (0.12): refundable bookings with lead times <= 14.5 days are less likely to be canceled.
3. previous_bookings_not_canceled (almost 0): customers with history of not canceling previous bookings is a strong predictor of not canceling the current booking.

In [None]:
def dt_tree_importance():
    dt = DecisionTreeClassifier(max_depth=2)
    dt.fit(X_train, y_train)
    
    tree_importance = {feature: f"{importance:.2f}" for feature, importance in zip(X_train.columns, dt.feature_importances_)}
    sorted_tree_importance = {k: v for k, v in sorted(tree_importance.items(), key=lambda item: item[1], reverse=True)}

    df = pd.DataFrame(sorted_tree_importance.items(), columns=["Feature", "Importance"])
    display(df)

    dot_data = export_graphviz(dt, out_file=None, filled=True, rounded=True, feature_names=X_train.columns)
    graph = graphviz.Source(dot_data)
    graph.format = "png"
    graph.render("decision_tree_unscaled")
    display(graph)

dt_tree_importance()

Next: notebook_05_machine_learning_02_hyperparameter_tuning