## Preprocessing function

The `DataPreprocessor` class is designed to prepare data for analysis. It takes two key arguments:

* **dataset**: The dataset to be processed.
* **target_column_name**: The name of the column containing the target variable.

The `preprocess()` method performs the following steps in sequence:

* **Extracting date-related components** (`FeatureTypeExtractor().separate_datetime(dataset)`): Splits datetime features into separate day, month, and year components.
* **Removing redundant features** (`RedundantFeaturesHandler(dataset).fit_transform()`): This step involves eliminating redundant columns from the dataset. A feature is considered redundant if its type is identified as INDEX, UNKNOWN, or TEXT, as the library does not support text columns unless they are explicitly categorized as categorical. Additionally, features with more than 90% missing values are removed, as they contribute little or no useful information for analysis.
* **Handling missing values** (`MissingValuesHandler(dataset).fit_transform()`): Missing data is handled based on the type of the feature. For categorical features, missing values are replaced with the most frequent value. For discrete and continuous features, the median value is used to fill the gaps.
* **Handling outliers** (`OutliersHandler(dataset).fit_transform()`): Outliers in the dataset are handled specifically for numerical features using the Z-score method.
* **Encoding categorical features** (`FeatureTypeExtractor().encode_categorical(dataset, target_column_name)`): This process transforms categorical variables into numerical formats suitable for machine learning models. Depending on the feature type, either one-hot encoding or label encoding is applied. For features classified as one-hot encodable, each unique value is converted into a separate binary column, replacing the original feature with the new set of columns. For features classified as label encodable, each unique value is replaced by an integer code directly within the dataset. The target column is excluded from encoding.
* **Transforming boolean features** (`FeatureTypeExtractor().bool_to_int(dataset)`): Converts the boolean features to integers.
* **Encoding the target variable** (`FeatureTypeExtractor().encode_target(dataset, target_column_name)`): Encodes the target variable.
* **Removing highly correlated features** (`CorrelationFeaturesHandler().fit_transform(X)`): Identifies and removes features with high correlation.
* **Handling class imbalance** (`ClassBalanceHandler().fit_resample(X, y)`): Balances the dataset using resampling techniques to ensure fair representation of all classes. If the data is moderately imbalanced, it applies oversampling of the minority class; for extreme imbalances, it uses SMOTE; for large datasets, it opts for undersampling of the majority class. If the class distribution is already balanced, no resampling is applied.

Finally, the method returns the processed data as a Pandas DataFrame, ready for model training and evaluation, with the name of target column updated to 'target'.

**Importans note**: our package does not scale numerical data because one of its primary goals is interpretability. Scaling could reduce transparency in understanding how each feature impacts the model's decisions. We use DecisionTree and RandomForest, which are not sensitive to feature scaling, so it's unnecessary. While scaling can be beneficial for XGBoost, it is not required, and we prioritize interpretability over accuracy to maintain clarity in how the model makes predictions.

## Model selection and optimisation function

The `OptimizerAllModels` class is responsible for performing hyperparameter optimization for multiple machine learning models (`Random Forest`, `Decision Tree`, and `XGBoost`) using `Random search`. It accepts the following arguments:
* **dataset**: The preprocessed dataset (Pandas DataFrame).
* **random_state**: Random seed for reproducibility.
* **n_iter**: The number of iterations to perform random search for each model.
* **cv**: Number of cross-validation splits.
* **n_repeats**: The number of times to repeat cross-validation for stability.
* **metric_to_eval**: The evaluation metric used to assess the models, with possible values such as 'roc_auc', 'f1', or 'accuracy'.

The class splits the dataset into features and target variables, then applies hyperparameter optimization for each model using the DecisionTreeRandomSearch, RandomForestRandomSearch, and XGBoostRandomSearch classes. Each of them fits and evaluates the model using default parameters, and then performs a random search over a predefined hyperparameter grid.

The `tune_hyperparameters()` method performs the whole optimization process based on the given settings.

The `get_best_results()` method retrieves and prints the best hyperparameters and metrics for each model. It returns a DataFrame containing the best results for all three models.

Finally, the `perform_analysis()` method combines the tuning and result extraction processes, providing a comprehensive analysis of the optimal models and their hyperparameters. This class automates the model selection and optimization process for the three classifiers, ensuring the best configurations for the given dataset.

## Report

The report begins with an **exploratory data analysis**, providing an overview of the dataset by examining the non-null counts and data types of its features. It includes descriptive statistics to summarize key properties of the data and explores the distribution of features through visualizations. Numerical columns are analyzed with histograms, while categorical columns are represented using bar charts.

The next section describes the **evaluation metrics** used in the analysis, including Accuracy, F1 Score, and ROC AUC, providing a clear understanding of how model performance is assessed.

Following this, the **results of model optimization** are presented. Detailed tables display the outcomes of hyperparameter tuning, accompanied by boxplots that compare accuracy, F1 score, and ROC AUC across models. Additionally, bar plots highlight the maximum metric values achieved by each model, offering insights into their performance.

The report concludes with an analysis of the **interpretability of the best models**. This section includes visualisations for better understanding. In particular, the following plots are presented for the best XGBoost model:
* the SHAP values
* the feature importance plot
* the violin SHAP plot illustrating the impact of features on predictions.

These visualisations help to clarify the key factors influencing the decisions made by the model.