# Predictions, part III
- drop columns: no
- scaler: yes
- **hyperparameter tuning: yes**
- one-hot encoding: yes, the dataset was found encoded
- oversampling: no

In this session, I finetuned hyperparameters of 7 algorithms.\
I did each tuning with unscaled and scaled data, and by using 2 tuning techniques: GridSearchCV and RandomizedSearchCV.\
Half the time there was negligable to no difference in results and runtime between unscaled and scaled data, between Grid and RandomizedSearch.\
Only 1 algorithm greatly benefitted from scaling, while 2 algorithms were computationally expensive for Grid, while performing very well with RandomizedSearchCV.
___
**RandomForestClassifier** is the algorithm that ticked all the good boxes:
- didn't require scaling
- it performed same with unscaled and scaled data, with both tuning techniques,
- and it's the top performer:
  - best parameters: {'bootstrap': True, 'max_features': 0.5, 'n_estimators': 100},
  - accuracy 86.53%,
  - runtime 3 minutes.

# train_test_split

In [2]:
%run "common_imports.py"

df = pd.read_csv("../data/02.csv")
features = df.drop(columns=["is_canceled"])
target = df["is_canceled"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

display(X_train.head(), "")
display(y_train.head(), "")

Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
104182,23,2,11,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
110320,102,17,24,4,1,3,2,0,0,0,...,0,0,0,1,0,0,0,0,0,1
60388,489,46,10,11,0,2,2,0,0,0,...,0,0,0,0,1,0,0,0,1,0
105591,36,7,12,2,2,1,2,0,0,0,...,0,0,0,1,0,0,0,0,1,0
73207,101,33,17,8,1,3,2,0,0,0,...,0,0,0,1,0,0,0,0,1,0


''

104182    0
110320    0
60388     1
105591    0
73207     1
Name: is_canceled, dtype: int64

''

# scaler

In [3]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_train_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)

X_test_scaled = scaler.transform(X_test)
X_test_df = pd.DataFrame(X_test_scaled, columns=X_test.columns)

display(X_train_df.head(), "")
display(X_test_df.head(), "")

Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,0.031208,0.019231,0.333333,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.138399,0.307692,0.766667,0.272727,0.052632,0.06,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.663501,0.865385,0.3,0.909091,0.0,0.04,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.048847,0.115385,0.366667,0.090909,0.105263,0.02,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.137042,0.615385,0.533333,0.636364,0.052632,0.06,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


''

Unnamed: 0,lead_time,arrival_date_week_number,arrival_date_day_of_month,arrival_date_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,...,reserved_room_type_G,reserved_room_type_H,reserved_room_type_L,deposit_type_No_Deposit,deposit_type_Non_Refund,deposit_type_Refundable,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,0.028494,0.423077,0.1,0.454545,0.105263,0.08,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.005427,0.019231,0.433333,0.0,0.0,0.02,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.02578,0.769231,0.166667,0.818182,0.0,0.06,0.018182,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.005427,0.019231,0.3,0.0,0.0,0.04,0.018182,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.187246,0.480769,0.866667,0.454545,0.0,0.06,0.036364,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


''

# Hyperparameter tuning

## GridSearchCV

4 algorithms were unaffected by scaling, the training gave the same performance and results both for unscaled and scaled data:
- DecisionTreeClassifier: BP {"max_depth": 15, "min_samples_leaf": 2, "min_samples_split": 2}, A 82.79, RT 15s
- BaggingClassifier: BP {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 100}, A 86.46, RT 12m
- **RandomForestClassifier: BP {'bootstrap': True, 'max_features': 0.5, 'n_estimators': 100}, A 86.53, RT 3m**
- AdaBoostClassifier: BP {'algorithm': 'SAMME', 'learning_rate': 1.0, 'n_estimators': 50}, A 80.53, RT 1m
  
2 algorithms are computationally expensive, so I interrupted them after 1h, without completion:
- KNeighborsClassifier,
- GradientBoostingClassifier.

LogisticRegression is the 1 algorithm greatly benefitting from scaling:
- unscaled, BP {'C': 10, 'max_iter': 6000, 'penalty': 'l1', 'solver': 'liblinear'}, A 80.93, RT 40m
- **scaled**, BP {'C': 10, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'sag'}, A 80.82, RT 1m

### KNeighborsClassifier
- I interrupted the execution of the code both for unscaled and scaled data after 1h, without completion.
- I managed successful training by removing some parameters from the grid, however I'm interested in full grid parameters.

In [None]:
#unscaled

grid = {"n_neighbors": range(1, 10),
        "weights": ["uniform", "distance"],
        "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
        "leaf_size": [10, 20, 30],
        "p": [1, 2]}

model = GridSearchCV(KNeighborsClassifier(),
                               grid,
                               cv=5,
                               scoring="accuracy",
                               verbose=3)

display(model.fit(X_train, y_train), "")

display("Best parameters:", model.best_params_ , "")

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

In [None]:
#scaled

grid = {"n_neighbors": range(1, 10),
        "weights": ["uniform", "distance"],
        "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
        "leaf_size": [10, 20, 30],
        "p": [1, 2]}

model = GridSearchCV(KNeighborsClassifier(),
                               grid,
                               cv=5,
                               scoring="accuracy",
                               verbose=3)

display(model.fit(X_train_scaled, y_train), "")

display("Best parameters:", model.best_params_ , "")

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

### LogisticRegression
- unscaled, BP {'C': 10, 'max_iter': 6000, 'penalty': 'l1', 'random_state': 42, 'solver': 'liblinear'}, A 80.93, RT 40m
- **scaled, BP {'C': 10, 'max_iter': 1000, 'penalty': 'l2', 'random_state': 42, 'solver': 'sag'}, A 80.82. RT 1m**

In [None]:
#unscaled

grid_outer = {"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
              "max_iter": [5000, 6000, 7000, 8000, 9000, 10000]}

model_outer = GridSearchCV(LogisticRegression(), 
                           grid_outer, 
                           cv=5,
                           scoring="accuracy", 
                           verbose=3, 
                           n_jobs=-1)

display(model_outer.fit(X_train, y_train),"")

best_params_outer = model_outer.best_params_
best_solver = best_params_outer["solver"]
best_max_iter = best_params_outer["max_iter"]

grid_inner = {"C": [0.001, 0.01, 0.1, 1, 10],
              "penalty": ["l1", "l2"],
              "solver": [best_solver],
              "max_iter": [best_max_iter]}

model_inner = GridSearchCV(LogisticRegression(),
                           grid_inner,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model_inner.fit(X_train, y_train),"")

best_params_inner = model_inner.best_params_
display("Best parameters from outer grid search:", best_params_outer ,"")
display("Best parameters from inner grid search:", best_params_inner ,"")

final_model = LogisticRegression(**best_params_inner)
display(final_model.fit(X_train, y_train),"")

final_model_accuracy = final_model.score(X_test, y_test) * 100
print(f"Accuracy of the final model: {final_model_accuracy:.2f}% \n")

pred = final_model.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

In [None]:
#scaled

grid_outer = {"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
              "max_iter": [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]}

model_outer = GridSearchCV(LogisticRegression(), 
                           grid_outer, 
                           cv=5,
                           scoring="accuracy", 
                           verbose=3, 
                           n_jobs=-1)

display(model_outer.fit(X_train_scaled, y_train),"")

best_params_outer = model_outer.best_params_
best_solver = best_params_outer["solver"]
best_max_iter = best_params_outer["max_iter"]

grid_inner = {"C": [0.001, 0.01, 0.1, 1, 10],
              "penalty": ["l1", "l2"],
              "solver": [best_solver],
              "max_iter": [best_max_iter]}

model_inner = GridSearchCV(LogisticRegression(),
                           grid_inner,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model_inner.fit(X_train_scaled, y_train),"")

best_params_inner = model_inner.best_params_
display("Best parameters from outer grid search:", best_params_outer ,"")
display("Best parameters from inner grid search:", best_params_inner ,"")

final_model = LogisticRegression(**best_params_inner)
display(final_model.fit(X_train_scaled, y_train),"")

final_model_accuracy = final_model.score(X_test_scaled, y_test) * 100
print(f"Accuracy of the final model: {final_model_accuracy:.2f}% \n")

pred = final_model.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

### DecisionTreeClassifier
- unaffected by scaling, BP {"max_depth": 15, "min_samples_leaf": 2, "min_samples_split": 2}, A 82.79, RT 15s

In [44]:
#unscaled

grid = {"max_depth": [None, 5, 10, 15],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4]}

model = GridSearchCV(DecisionTreeClassifier(random_state=42),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train, y_train), "")
display("Best parameters:", model.best_params_ , "")

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 36 candidates, totalling 180 fits


''

'Best parameters:'

{'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 2}

''

The accuracy of the tuned model is 82.79% 

              precision    recall  f1-score   support

           0       0.83      0.91      0.87     14940
           1       0.82      0.69      0.75      8902

    accuracy                           0.83     23842
   macro avg       0.83      0.80      0.81     23842
weighted avg       0.83      0.83      0.82     23842



In [46]:
#scaled

grid = {"max_depth": [None, 5, 10, 15],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4]}

model = GridSearchCV(DecisionTreeClassifier(random_state=42),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
display("Best parameters:", model.best_params_ , "")

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 36 candidates, totalling 180 fits


''

'Best parameters:'

{'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 2}

''

The accuracy of the tuned model is 82.79% 

              precision    recall  f1-score   support

           0       0.83      0.91      0.87     14940
           1       0.82      0.69      0.75      8902

    accuracy                           0.83     23842
   macro avg       0.83      0.80      0.81     23842
weighted avg       0.83      0.83      0.82     23842

[CV 2/5] END max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=42;, score=0.815 total time=   1.5s
[CV 4/5] END max_depth=None, min_samples_leaf=1, min_samples_split=5, random_state=42;, score=0.817 total time=   1.3s
[CV 2/5] END max_depth=None, min_samples_leaf=2, min_samples_split=2, random_state=42;, score=0.816 total time=   1.2s
[CV 5/5] END max_depth=None, min_samples_leaf=2, min_samples_split=5, random_state=42;, score=0.817 total time=   1.2s
[CV 3/5] END max_depth=None, min_samples_leaf=4, min_samples_split=2, random_state=42;, score=0.820 total time=   1.4s
[CV 2/5] END max_depth=None, min_s

### BaggingClassifier
- unaffected by scaling, BP {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 100}, A 86.46, RT 12m

In [None]:
#unscaled

grid = {"bootstrap": [True, False],
       "bootstrap_features": [True, False],
       "max_features": [0.5, 1.0],
       "max_samples": [0.5, 1.0],
       "n_estimators": [10, 50, 100],
       "random_state": [42]}

model = GridSearchCV(BaggingClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train, y_train),"")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

In [None]:
#scaled

grid = {"bootstrap": [True, False],
       "bootstrap_features": [True, False],
       "max_features": [0.5, 1.0],
       "max_samples": [0.5, 1.0],
       "n_estimators": [10, 50, 100],
       "random_state": [42]}

model = GridSearchCV(BaggingClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train_scaled, y_train),"")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

### RandomForestClassifier
- unaffected by scaling, BP {'bootstrap': True, 'max_features': 0.5, 'n_estimators': 100}, A 86.53, RT 3m

In [None]:
#unscaled

grid = {"bootstrap": [True, False],
        "max_features": [0.5, 1.0],
        "n_estimators": [10, 50, 100],
        "random_state": [42]}

if not grid["bootstrap"][0]:
    grid["max_samples"] = [None]

model = GridSearchCV(RandomForestClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

In [None]:
#scaled

grid = {"bootstrap": [True, False],
        "max_features": [0.5, 1.0],
        "n_estimators": [10, 50, 100],
        "random_state": [42]}

if not grid["bootstrap"][0]:
    grid["max_samples"] = [None]

model = GridSearchCV(RandomForestClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

### AdaBoostClassifier
- unaffected by scaling, BP {'algorithm': 'SAMME', 'learning_rate': 1.0, 'n_estimators': 50}, A 80.53, RT 1m

In [None]:
#unscaled

grid = {"algorithm": ["SAMME"],
        "n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5, 1.0],
        "random_state": [42]}

model = GridSearchCV(AdaBoostClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

In [None]:
#scaled

grid = {"algorithm": ["SAMME"],
        "n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5, 1.0],
        "random_state": [42]}

model = GridSearchCV(AdaBoostClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

### GradientBoostingClassifier
- unscaled, BP {'learning_rate': 0.5, 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 100, 'subsample': 1.0}, A 84.37, RT 8h
- I left the code to run overnight, out of curiosity. It took 8h. Tried with scaled data and manually interrupted it after 2h.

In [None]:
#unscaled

grid = {"n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5],
        "max_depth": [3, 5, 7],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "subsample": [0.5, 0.75, 1.0],
        "random_state": [42]}

model = GridSearchCV(GradientBoostingClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=1)

display(model.fit(X_train, y_train),"")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

In [None]:
#scaled

grid = {"n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5],
        "max_depth": [3, 5, 7],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "subsample": [0.5, 0.75, 1.0],
        "random_state": [42]}

model = GridSearchCV(GradientBoostingClassifier(),
                     grid,
                     cv=5,
                     scoring="accuracy",
                     verbose=3,
                     n_jobs=1)

display(model.fit(X_train_scaled, y_train),"")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

## RandomizedSearchCV
- KNeighborsClassifier: scaled data trains in 1 minute, but the scaled data has higher accuracy
  - **unscaled, BP {'weights': 'distance', 'p': 1, 'n_neighbors': 4, 'leaf_size': 10, 'algorithm': 'kd_tree'}, A 79.96, RT 1m**
  - scaled, BP {'weights': 'distance', 'p': 1, 'n_neighbors': 8, 'leaf_size': 30, 'algorithm': 'kd_tree'}, A 83.42, RT 8m
- LogisticRegression: scaling improves training time, but no effect on the accuracy:
  - unscaled, BP {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 5000, 'C': 10}, A 80.93, RT 25m
  - **scaled, BP {'solver': 'sag', 'penalty': 'l2', 'max_iter': 10000, 'C': 10}, A 80.82, RT 1m**
- DecisionTreeClassifier:
  - unscaled: BP {'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 15}, A 82.65, RT 15s
  - **scaled: BP {'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 15}, A 82.72, RT 15s**
- GradientBoostingClassifier
- unscaled, BP {'subsample': 1.0, 'n_estimators': 50, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_depth': 5, 'learning_rate': 0.5}, A 83.23, RT 2m
- **scaled, BP {'subsample': 1.0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_depth': 7, 'learning_rate': 0.5}, 84.34%, RT 2m**

3 ensemble methods are unaffected by scaling:
- BaggingClassifier: BP {'n_estimators': 100, 'max_samples': 1.0, 'max_features': 1.0, 'bootstrap_features': True, 'bootstrap': True}, A 86.40, RT 3m
- RandomForestClassifier: BP {'bootstrap': True, 'max_features': 0.5, 'n_estimators': 100}, A 86.53, RT 3m
- AdaBoostClassifier: BP {n_estimators': 50, 'learning_rate': 1.0, 'algorithm': 'SAMME'}, A 80.53, RT 1m

### KNeighborsClassifier
Unscaled data trains in 1 minute, but the scaled data has higher accuracy:
- unscaled, BP {'weights': 'distance', 'p': 1, 'n_neighbors': 4, 'leaf_size': 10, 'algorithm': 'kd_tree'}, A 79.96, RT 1m
- scaled, BP {'weights': 'distance', 'p': 1, 'n_neighbors': 8, 'leaf_size': 30, 'algorithm': 'kd_tree'}, A 83.42, RT 8m

In [30]:
#unscaled

grid = {"n_neighbors": range(1, 10),
        "weights": ["uniform", "distance"],
        "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
        "leaf_size": [10, 20, 30],
        "p": [1, 2]}

model = RandomizedSearchCV(KNeighborsClassifier(),
                           grid,
                           n_iter=10,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train, y_train), "")

display("Best parameters:", model.best_params_, "")

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

'Best parameters:'

{'weights': 'distance',
 'p': 1,
 'n_neighbors': 8,
 'leaf_size': 10,
 'algorithm': 'brute'}

''

The accuracy of the tuned model is 80.90% 

              precision    recall  f1-score   support

           0       0.83      0.88      0.85     14940
           1       0.77      0.70      0.73      8902

    accuracy                           0.81     23842
   macro avg       0.80      0.79      0.79     23842
weighted avg       0.81      0.81      0.81     23842



In [None]:
#scaled

grid = {"n_neighbors": range(1, 10),
        "weights": ["uniform", "distance"],
        "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
        "leaf_size": [10, 20, 30],
        "p": [1, 2]}

model = RandomizedSearchCV(KNeighborsClassifier(),
                           grid,
                           n_iter=10,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")

display("Best parameters:", model.best_params_, "")

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

### LogisticRegression
Scaled data trains 25 times faster, but offers no improvement in accuracy.
- unscaled, BP {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 5000, 'C': 10}, A 80.93, RT 25m
- **scaled, BP {'solver': 'sag', 'penalty': 'l2', 'max_iter': 9000, 'C': 10}, A 80.82, RT 1m**

In [51]:
#unscaled

grid_outer = {"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
              "max_iter": [5000, 6000, 7000, 8000, 9000, 10000]}

model_outer = RandomizedSearchCV(LogisticRegression(), 
                                 grid_outer, 
                                 cv=5,
                                 scoring="accuracy", 
                                 verbose=3, 
                                 n_jobs=-1)

display(model_outer.fit(X_train, y_train),"")

best_params_outer = model_outer.best_params_
best_solver = best_params_outer["solver"]
best_max_iter = best_params_outer["max_iter"]

grid_inner = {"C": [0.001, 0.01, 0.1, 1, 10],
              "penalty": ["l1", "l2"],
              "solver": [best_solver],
              "max_iter": [best_max_iter]}

model_inner = RandomizedSearchCV(LogisticRegression(),
                                 grid_inner,
                                 cv=5,
                                 scoring="accuracy",
                                 verbose=3,
                                 n_jobs=-1)

display(model_inner.fit(X_train, y_train),"")

best_params_inner = model_inner.best_params_
display("Best parameters from outer grid search:", best_params_outer ,"")
display("Best parameters from inner grid search:", best_params_inner ,"")

final_model = LogisticRegression(**best_params_inner)
display(final_model.fit(X_train, y_train),"")

final_model_accuracy = final_model.score(X_test, y_test) * 100
print(f"Accuracy of the final model: {final_model_accuracy:.2f}% \n")

pred = final_model.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

'Best parameters from outer grid search:'

{'solver': 'liblinear', 'max_iter': 10000}

''

'Best parameters from inner grid search:'

{'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 10000, 'C': 0.1}

''

''

Accuracy of the final model: 80.89% 

              precision    recall  f1-score   support

           0       0.79      0.94      0.86     14940
           1       0.85      0.59      0.70      8902

    accuracy                           0.81     23842
   macro avg       0.82      0.77      0.78     23842
weighted avg       0.82      0.81      0.80     23842

[CV 1/5] END .........max_iter=7000, solver=sag;, score=0.791 total time= 7.0min
[CV 5/5] END .........max_iter=5000, solver=sag;, score=0.793 total time= 7.4min
[CV 4/5] END ........max_iter=10000, solver=sag;, score=0.813 total time= 4.4min
[CV 1/5] END C=0.001, max_iter=10000, penalty=l1, solver=liblinear;, score=0.770 total time=   1.1s
[CV 2/5] END C=0.01, max_iter=10000, penalty=l1, solver=liblinear;, score=0.807 total time=   3.3s
[CV 4/5] END C=0.01, max_iter=10000, penalty=l2, solver=liblinear;, score=0.810 total time=   2.8s
[CV 3/5] END C=0.1, max_iter=10000, penalty=l2, solver=liblinear;, score=0.805 total time=   3

In [47]:
#scaled

grid_outer = {"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
              "max_iter": [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]}

model_outer = RandomizedSearchCV(LogisticRegression(), 
                                 grid_outer, 
                                 cv=5,
                                 scoring="accuracy", 
                                 verbose=3, 
                                 n_jobs=-1)

display(model_outer.fit(X_train_scaled, y_train),"")

best_params_outer = model_outer.best_params_
best_solver = best_params_outer["solver"]
best_max_iter = best_params_outer["max_iter"]

grid_inner = {"C": [0.001, 0.01, 0.1, 1, 10],
              "penalty": ["l1", "l2"],
              "solver": [best_solver],
              "max_iter": [best_max_iter]}

model_inner = RandomizedSearchCV(LogisticRegression(),
                                 grid_inner,
                                 cv=5,
                                 scoring="accuracy",
                                 verbose=3,
                                 n_jobs=-1)

display(model_inner.fit(X_train_scaled, y_train),"")

best_params_inner = model_inner.best_params_
display("Best parameters from outer grid search:", best_params_outer ,"")
display("Best parameters from inner grid search:", best_params_inner ,"")

final_model = LogisticRegression(**best_params_inner)
display(final_model.fit(X_train_scaled, y_train),"")

final_model_accuracy = final_model.score(X_test_scaled, y_test) * 100
print(f"Accuracy of the final model: {final_model_accuracy:.2f}% \n")

pred = final_model.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Fitting 5 folds for each of 10 candidates, totalling 50 fits


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/milenko/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/milenko/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/milenko/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^

''

'Best parameters from outer grid search:'

{'solver': 'sag', 'max_iter': 9000}

''

'Best parameters from inner grid search:'

{'solver': 'sag', 'penalty': 'l2', 'max_iter': 9000, 'C': 10}

''

''

Accuracy of the final model: 80.82% 

              precision    recall  f1-score   support

           0       0.79      0.94      0.86     14940
           1       0.85      0.59      0.70      8902

    accuracy                           0.81     23842
   macro avg       0.82      0.76      0.78     23842
weighted avg       0.81      0.81      0.80     23842



### DecisionTreeClassifier
- unscaled: BP {'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 15}, A 82.65, RT 15s
- scaled: BP {'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 15}, A 82.72, RT 15s

In [52]:
#unscaled

grid = {"max_depth": [None, 5, 10, 15],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4]}

model = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), 
                           grid, 
                           n_iter=10,
                           cv=5, 
                           scoring="accuracy", 
                           verbose=3, 
                           n_jobs=-1)

display(model.fit(X_train, y_train), "")
display("Best parameters:", model.best_params_ , "")

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

'Best parameters:'

{'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 15}

''

The accuracy of the tuned model is 82.65% 

              precision    recall  f1-score   support

           0       0.83      0.91      0.87     14940
           1       0.82      0.69      0.75      8902

    accuracy                           0.83     23842
   macro avg       0.82      0.80      0.81     23842
weighted avg       0.83      0.83      0.82     23842



In [53]:
#scaled

grid = {"max_depth": [None, 5, 10, 15],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4]}

model = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), 
                           grid, 
                           n_iter=10,
                           cv=5, 
                           scoring="accuracy", 
                           verbose=3, 
                           n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
display("Best parameters:", model.best_params_ , "")

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

'Best parameters:'

{'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 15}

''

The accuracy of the tuned model is 82.72% 

              precision    recall  f1-score   support

           0       0.83      0.91      0.87     14940
           1       0.82      0.69      0.75      8902

    accuracy                           0.83     23842
   macro avg       0.82      0.80      0.81     23842
weighted avg       0.83      0.83      0.82     23842

[CV 2/5] END max_depth=10, min_samples_leaf=2, min_samples_split=10;, score=0.819 total time=   0.9s
[CV 1/5] END max_depth=5, min_samples_leaf=4, min_samples_split=10;, score=0.802 total time=   0.5s
[CV 2/5] END max_depth=15, min_samples_leaf=1, min_samples_split=5;, score=0.822 total time=   1.1s
[CV 5/5] END max_depth=15, min_samples_leaf=4, min_samples_split=2;, score=0.826 total time=   1.3s
[CV 4/5] END max_depth=5, min_samples_leaf=2, min_samples_split=2;, score=0.805 total time=   0.6s
[CV 2/5] END max_depth=15, min_samples_leaf=1, min_samples_split=2;, score=0.822 total time=   1.3s
[CV 2/5] END max_depth=10, mi

### BaggingClassifier
- unaffected by scaling: {'n_estimators': 100, 'max_samples': 1.0, 'max_features': 1.0, 'bootstrap_features': True, 'bootstrap': True}, A 86.40, RT 3m

In [56]:
#unscaled

grid = {"bootstrap": [True, False],
       "bootstrap_features": [True, False],
       "max_features": [0.5, 1.0],
       "max_samples": [0.5, 1.0],
       "n_estimators": [10, 50, 100],
       "random_state": [42]}

model = RandomizedSearchCV(BaggingClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train, y_train),"")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'random_state': 42, 'n_estimators': 100, 'max_samples': 1.0, 'max_features': 1.0, 'bootstrap_features': True, 'bootstrap': True}
The accuracy of the tuned model is 86.40% 

              precision    recall  f1-score   support

           0       0.86      0.94      0.90     14940
           1       0.88      0.73      0.80      8902

    accuracy                           0.86     23842
   macro avg       0.87      0.84      0.85     23842
weighted avg       0.87      0.86      0.86     23842



In [58]:
#scaled

grid = {"bootstrap": [True, False],
       "bootstrap_features": [True, False],
       "max_features": [0.5, 1.0],
       "max_samples": [0.5, 1.0],
       "n_estimators": [10, 50, 100],
       "random_state": [42]}

model = RandomizedSearchCV(BaggingClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train_scaled, y_train),"")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'random_state': 42, 'n_estimators': 50, 'max_samples': 0.5, 'max_features': 1.0, 'bootstrap_features': True, 'bootstrap': False}
The accuracy of the tuned model is 86.36% 

              precision    recall  f1-score   support

           0       0.86      0.94      0.90     14940
           1       0.88      0.73      0.80      8902

    accuracy                           0.86     23842
   macro avg       0.87      0.84      0.85     23842
weighted avg       0.87      0.86      0.86     23842

[CV 3/5] END bootstrap=False, bootstrap_features=True, max_features=0.5, max_samples=0.5, n_estimators=50, random_state=42;, score=0.840 total time=   9.4s
[CV 1/5] END bootstrap=True, bootstrap_features=False, max_features=0.5, max_samples=1.0, n_estimators=100, random_state=42;, score=0.851 total time=  28.0s
[CV 4/5] END bootstrap=False, bootstrap_features=False, max_features=0.5, max_samples=0.5, n_estimators=100, random_state=42;, score=0.854 total time=  24.7s
[CV 3/5] EN

### RandomForestClassifier
- unaffected by scaling, BP {'bootstrap': True, 'max_features': 0.5, 'n_estimators': 100}, A 86.53, RT 3m

In [60]:
#unscaled

grid = {"bootstrap": [True, False],
        "max_features": [0.5, 1.0],
        "n_estimators": [10, 50, 100],
        "random_state": [42]}

if not grid["bootstrap"][0]:
    grid["max_samples"] = [None]

model = RandomizedSearchCV(RandomForestClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'random_state': 42, 'n_estimators': 100, 'max_features': 0.5, 'bootstrap': True}
The accuracy of the tuned model is 86.53% 

              precision    recall  f1-score   support

           0       0.87      0.92      0.90     14940
           1       0.86      0.77      0.81      8902

    accuracy                           0.87     23842
   macro avg       0.86      0.85      0.85     23842
weighted avg       0.86      0.87      0.86     23842



In [61]:
#scaled

grid = {"bootstrap": [True, False],
        "max_features": [0.5, 1.0],
        "n_estimators": [10, 50, 100],
        "random_state": [42]}

if not grid["bootstrap"][0]:
    grid["max_samples"] = [None]

model = RandomizedSearchCV(RandomForestClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'random_state': 42, 'n_estimators': 100, 'max_features': 0.5, 'bootstrap': True}
The accuracy of the tuned model is 86.58% 

              precision    recall  f1-score   support

           0       0.87      0.93      0.90     14940
           1       0.86      0.77      0.81      8902

    accuracy                           0.87     23842
   macro avg       0.86      0.85      0.85     23842
weighted avg       0.87      0.87      0.86     23842

[CV 4/5] END bootstrap=False, max_features=1.0, n_estimators=100, random_state=42;, score=0.817 total time= 1.3min
[CV 2/5] END bootstrap=False, max_features=0.5, n_estimators=50, random_state=42;, score=0.856 total time=  20.4s
[CV 1/5] END bootstrap=False, max_features=1.0, n_estimators=50, random_state=42;, score=0.812 total time=  38.8s
[CV 4/5] END bootstrap=False, max_features=0.5, n_estimators=100, random_state=42;, score=0.856 total time=  33.9s
[CV 2/5] END bootstrap=False, max_features=0.5, n_estimators=10, random_

### AdaBoostClassifier
- unaffected by scaling, BP {n_estimators': 50, 'learning_rate': 1.0, 'algorithm': 'SAMME'}, A 80.53, RT 1m

In [62]:
#unscaled

grid = {"algorithm": ["SAMME"],
        "n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5, 1.0],
        "random_state": [42]}

model = RandomizedSearchCV(AdaBoostClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'random_state': 42, 'n_estimators': 50, 'learning_rate': 1.0, 'algorithm': 'SAMME'}
The accuracy of the tuned model is 80.53% 

              precision    recall  f1-score   support

           0       0.79      0.95      0.86     14940
           1       0.87      0.57      0.68      8902

    accuracy                           0.81     23842
   macro avg       0.83      0.76      0.77     23842
weighted avg       0.82      0.81      0.79     23842



In [63]:
#scaled

grid = {"algorithm": ["SAMME"],
        "n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5, 1.0],
        "random_state": [42]}

model = RandomizedSearchCV(AdaBoostClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'random_state': 42, 'n_estimators': 50, 'learning_rate': 1.0, 'algorithm': 'SAMME'}
The accuracy of the tuned model is 80.53% 

              precision    recall  f1-score   support

           0       0.79      0.95      0.86     14940
           1       0.87      0.57      0.68      8902

    accuracy                           0.81     23842
   macro avg       0.83      0.76      0.77     23842
weighted avg       0.82      0.81      0.79     23842

[CV 2/5] END algorithm=SAMME, learning_rate=0.5, n_estimators=100, random_state=42;, score=0.769 total time=   6.7s
[CV 3/5] END algorithm=SAMME, learning_rate=0.1, n_estimators=100, random_state=42;, score=0.749 total time=   6.9s
[CV 1/5] END algorithm=SAMME, learning_rate=0.01, n_estimators=100, random_state=42;, score=0.750 total time=   6.9s
[CV 2/5] END algorithm=SAMME, learning_rate=1.0, n_estimators=100, random_state=42;, score=0.811 total time=   5.6s
[CV 2/5] END algorithm=SAMME, learning_rate=0.5, n_estimators=

### GradientBoostingClassifier
- unscaled, BP {'subsample': 1.0, 'n_estimators': 50, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_depth': 5, 'learning_rate': 0.5}, A 83.23, RT 2m
- scaled, BP {'subsample': 1.0, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_depth': 7, 'learning_rate': 0.5}, 84.34%, RT 2m

In [65]:
#unscaled

grid = {"n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5],
        "max_depth": [3, 5, 7],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "subsample": [0.5, 0.75, 1.0],
        "random_state": [42]}

model = RandomizedSearchCV(GradientBoostingClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'subsample': 1.0, 'random_state': 42, 'n_estimators': 50, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_depth': 5, 'learning_rate': 0.5}
The accuracy of the tuned model is 83.23% 

              precision    recall  f1-score   support

           0       0.83      0.92      0.87     14940
           1       0.84      0.68      0.75      8902

    accuracy                           0.83     23842
   macro avg       0.83      0.80      0.81     23842
weighted avg       0.83      0.83      0.83     23842



In [66]:
#scaled

grid = {"n_estimators": [10, 50, 100],
        "learning_rate": [0.01, 0.1, 0.5],
        "max_depth": [3, 5, 7],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "subsample": [0.5, 0.75, 1.0],
        "random_state": [42]}

model = RandomizedSearchCV(GradientBoostingClassifier(),
                           grid,
                           cv=5,
                           scoring="accuracy",
                           verbose=3,
                           n_jobs=-1)

display(model.fit(X_train_scaled, y_train), "")
print("Best parameters:", model.best_params_)

accuracy = model.score(X_test_scaled, y_test) * 100
print(f"The accuracy of the tuned model is {accuracy:.2f}% \n")

pred = model.best_estimator_.predict(X_test_scaled)
print(classification_report(y_true=y_test, y_pred=pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


''

Best parameters: {'subsample': 1.0, 'random_state': 42, 'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_depth': 7, 'learning_rate': 0.5}
The accuracy of the tuned model is 84.34% 

              precision    recall  f1-score   support

           0       0.85      0.91      0.88     14940
           1       0.83      0.73      0.78      8902

    accuracy                           0.84     23842
   macro avg       0.84      0.82      0.83     23842
weighted avg       0.84      0.84      0.84     23842

[CV 4/5] END learning_rate=0.1, max_depth=5, min_samples_leaf=4, min_samples_split=10, n_estimators=100, random_state=42, subsample=1.0;, score=0.827 total time=  22.0s
[CV 3/5] END learning_rate=0.5, max_depth=5, min_samples_leaf=1, min_samples_split=10, n_estimators=50, random_state=42, subsample=1.0;, score=0.830 total time=  11.7s
[CV 5/5] END learning_rate=0.01, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=50, random_state=42, subsample

In [None]:
drop_columns = [""]

In [None]:
corr=np.abs(df.corr())

#Set up mask for triangle representation
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask,  vmax=1,square=True, linewidths=.5, cbar_kws={"shrink": .5},annot = corr)

plt.show()