# table of contents
1. [predictions, part II](#predictions,-part-II)
2. [preprocessing](#preprocessing)
3. [hyperparameter tuning](#hyperparameter-tuning)
   1. [GridSearchCV](#GridSearchCV)
      1. [unscaled](#unscaled)
      2. [scaled](#scaled)
   2. [RandomizedSearchCV](#RandomizedSearchCV)
   3. [hyperparameter_tuning_results](#hyperparameter_tuning_results)
4. [retrain the models with the best parameters](#retrain-the-models-with-the-best-parameters)
   1. [compare accuracies](#compare-accuracies)

# predictions, part II
- drop columns: no
- scaling: yes
- **hyperparameter tuning: yes**
- one-hot encoding: yes, the dataset was received encoded
- resampling: no

In the 1st part of this session, I'm finetuning hyperparameters of 7 models, each one with unscaled and scaled data, and by using 2 tuning techniques: GridSearchCV and RandomizedSearchCV.\
The main takeaways:
- **RandomForestClassifier** remains the best perfomer, in all 4 scenarios.
- Accuracy of each model in each scenario remains stable, in the range 80-86%.
- The time it takes to do a search remains unaffected in 5 models, not fluctuating.
- Only 2 models are greatly affected by scaling, 1 negatively and the other one positively.


In the 2nd part of this session, I retrain the 7 models with the best parameters, and compare the accuracies to those from the previous session, when no parameters were used.\
The main takeaways:
- The effect of the best parameters on the accuracy is negligent for 4/7 models; 2 models improve by 3%, and 1 model drops by 6%.
- **RandomForestClassifier** remains the best perfomer.

# preprocessing

In [None]:
# import libraries
%run common_imports.py

# load and split data
%run load_and_split_data.py
X_train, X_test, y_train, y_test = load_and_split_data()

# scale data
%run minmaxscaler.py
X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

# hyperparameter tuning
The code below performs hyperparameter tuning on the given models, with given parameters.\
It can use both unscaled and scaled data, and it saves the results to a csv file.\
It contains 3 functions:
  1. Perform grid search by using the param_grid/s and model/s
     - the function will use different variables and parameters, depending on the choice of unscaled or scaled data, and search model.
  2. Save results to a new (or append to an existing) csv.
     - Results are:
       - model
       - best_parameters
       - accuracy_in_%
       - runtime_in_seconds (the time taken to do the search)
       - source (identifies if the calculation used unscaled or unscaled data, grid_search or randomized_search)
       - classification_report
  3. Helper function to perform search, save and print results.

In [None]:
from hyperparameter_tuning import search

# Define the grids
param_grids = {
    "KNeighborsClassifier": {
        "algorithm": ["auto", "ball_tree", "brute", "kd_tree"],
        "leaf_size": range(10, 31, 10),
        "n_neighbors": range(1, 10),
        "p": [1, 2],
        "weights": ["uniform", "distance"]
    },
    "LogisticRegression": [
        {"C": [0.001, 0.01, 0.1, 1, 10], "max_iter": range(7000, 10001, 1000), "penalty": ["l2"], "solver": ["newton-cg", "lbfgs", "sag"]},
        {"C": [0.001, 0.01, 0.1, 1, 10], "max_iter": range(7000, 10001, 1000), "penalty": ["l1", "l2"], "solver": ["liblinear", "saga"]}
    ],
    "DecisionTreeClassifier": {
        "max_depth": range(1, 21),
        "min_samples_split": range(2, 11),
        "min_samples_leaf": range(1, 5)
    },
    "BaggingClassifier": {
        "bootstrap": [True, False],
        "bootstrap_features": [True, False],
        "max_features": [0.5, 1.0],
        "max_samples": [0.5, 1.0],
        "n_estimators": range(10, 110, 10)
    },
    "RandomForestClassifier": {
        "bootstrap": [True, False],
        "max_features": [0.5, 1.0],
        "n_estimators": range(10, 110, 10)
    },
    "AdaBoostClassifier": {
        "algorithm": ["SAMME"],
        "n_estimators": range(10, 110, 10),
        "learning_rate": [0.01, 0.1, 0.5, 1.0]
    },
    "GradientBoostingClassifier": {
        "learning_rate": [0.01, 0.1, 0.5],
        "max_depth": range(3, 8),
        "min_samples_leaf": range(1, 5),
        "min_samples_split": range(2, 11),
        "n_estimators": range(10, 110, 10),
        "subsample": [0.5, 0.75, 1.0]
    }
}

# List of models
models = [
    KNeighborsClassifier,
    LogisticRegression,
    DecisionTreeClassifier,
    RandomForestClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    AdaBoostClassifier
]

## GridSearchCV
I demonstrate the single model evaluations on the unscaled data, followed by a batch model evaluation on the scaled data.\
I do this to test for the runtime difference between the approaches, and with presumption that the scaled data runs faster.\
Conclusion: the scaled data ends up taking 2h longer than the unscaled, no improvement in accuracy.

### unscaled

In [None]:
# KNeighborsClassifier
search(KNeighborsClassifier, "KNeighborsClassifier", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

In [None]:
# LogisticRegression
search(LogisticRegression, "LogisticRegression", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

In [None]:
# DecisionTreeClassifier
search(DecisionTreeClassifier, "DecisionTreeClassifier", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

In [None]:
# BaggingClassifier
search(BaggingClassifier, "BaggingClassifier", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

In [None]:
# RandomForestClassifier
search(RandomForestClassifier, "RandomForestClassifier", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

In [None]:
# AdaBoostClassifier
search(AdaBoostClassifier, "AdaBoostClassifier", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

In [None]:
# GradientBoostingClassifier
search(GradientBoostingClassifier, "GradientBoostingClassifier", param_grids, search_type='grid', X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, scaled=False)

### scaled

In [None]:
# Loop through models and their corresponding parameter grids
for model_class in models:
    model_name = model_class.__name__
    # Perform grid search with scaled data
    search(model_class, model_name, param_grids, search_type='grid', 
           X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled, y_test=y_test, 
           scaled=True)

## RandomizedSearchCV

In [None]:
# Perform randomized search with unscaled and scaled data
for model_class in models:
    model_name = model_class.__name__
    search(model_class, model_name, param_grids, search_type='random', 
           X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, 
           scaled=False)
    search(model_class, model_name, param_grids, search_type='random', 
           X_train=X_train_scaled, y_train=y_train, X_test=X_test_scaled, y_test=y_test, 
           scaled=True)

## hyperparameter_tuning_results

In [None]:
hyperparameter_tuning_results = pd.read_csv("../data/hyperparameter_tuning_results.csv")
hyperparameter_tuning_results

In [None]:
# Look for the duplicated best_parameters
hyperparameter_tuning_results["best_parameters"].duplicated()

In [None]:
# Delete the duplicates to save time
hyperparameter_tuning_results = hyperparameter_tuning_results.drop_duplicates(subset=["best_parameters"])
hyperparameter_tuning_results["best_parameters"].duplicated()
hyperparameter_tuning_results.to_csv("../data/hyperparameter_tuning_results.csv", index=False)

# retrain the models with the best parameters

In [None]:
from train_with_best_hyperparameters import process_hyperparameter_tuning_results

process_hyperparameter_tuning_results(
    input_file="../data/hyperparameter_tuning_results.csv", 
    output_file="../data/accuracies_with_parameters.csv",
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    X_train_scaled=X_train_scaled,
    X_test_scaled=X_test_scaled,
    y_test=y_test
)

## compare accuracies
The effect of the best parameters on the accuracy is negligent for 4/7 models; 2 models improve by 3%, and 1 model drops by 6%.

In [None]:
# Load the data
accuracies_without_parameters = pd.read_csv("../data/accuracies_without_parameters.csv")
accuracies_with_parameters = pd.read_csv("../data/accuracies_with_parameters.csv")

# Identify the best accuracy for each model in each dataframe
idx_without_params = accuracies_without_parameters.groupby("model")["accuracy_in_%"].idxmax()
best_without_params = accuracies_without_parameters.loc[idx_without_params, ["model", "estimator", "accuracy_in_%", "source"]].reset_index(drop=True)
best_without_params["dataset"] = "original"

idx_with_params = accuracies_with_parameters.groupby("model")["accuracy_in_%"].idxmax()
best_with_params = accuracies_with_parameters.loc[idx_with_params, ["model", "best_parameters", "accuracy_in_%", "source"]].reset_index(drop=True)
best_with_params["dataset"] = "parametered"

# Combine the results into a single dataframe
combined_results = pd.concat([best_without_params, best_with_params], ignore_index=True)

# Sort the dataframe by top performers
accuracies_sorted = combined_results.sort_values(by=["accuracy_in_%", "model", "dataset"], ascending=[False, True, True])

# Save the sorted dataframe
accuracies_sorted.to_csv("../data/accuracies_combined.csv", index=False)

# Display the dataframe
accuracies_sorted

Next: notebook_06_machine_learning_03_resampling