# Mamut vs Existing AutoML Solutions

The Mamut package is an AutoML tool designed for automatic binary or multiclass classification of tabular data. It automates various stages of the machine learning process, including:

- **Preprocessing**: Customizable preprocessing through the `Preprocessor` class, handling tasks such as data cleaning, feature engineering, and PCA loadings.
- **Model Selection**: Utilizes the `ModelSelector` class to compare multiple models, perform hyperparameter optimization (integrated with Optuna for both random search and Bayesian optimization), and select the best-performing model.
- **Ensemble Methods**: Supports the creation of both standard and greedy ensembles. The greedy ensemble method iteratively adds models to the ensemble based on their performance and diversity.
- **Imbalance Handling**: Detects and handles imbalanced datasets, ensuring robust model performance on skewed data distributions.
- **Evaluation and Reporting**: Generates detailed HTML reports with comprehensive metrics, visualizations, and summaries of the training process. This includes ROC curves, confusion matrices, feature importances, and SHAP values.

## Comparison with Existing AutoML Solutions
### AutoSklearn

**AutoSklearn** is an automated machine learning toolkit built on top of scikit-learn. It automates model selection, hyperparameter tuning, and ensemble creation. Here are its key features and workflow:

#### Key Features

- **Model Selection**: Automatically selects the best model from a wide range of algorithms.
- **Hyperparameter Optimization**: Uses Bayesian optimization to find the best hyperparameters.
- **Ensemble Building**: Creates an ensemble of the best-performing models.
- **Preprocessing**: Includes various preprocessing techniques.
- **Meta-Learning**: Utilizes meta-learning to warm-start the optimization process.
- **Time Management**: Supports limiting the time for fitting tasks.

#### Workflow
Load the dataset and split it into training and testing sets:
```python
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)
```
Create an AutoSklearn classifier and fit it to the training data:
```python
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_classification_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
```
View the best models found by AutoSklearn that were included in the ensemble:
```python
print(automl.leaderboard())
```
View the final ensemble:
```python
pprint(automl.show_models(), indent=4)
```

#### Setup

AutoSklearn can be challenging to set up and install due to its dependencies.

#### Comparison with Mamut

#### Similarities

- **Model Selection**: Both tools automatically select the best model.
- **Hyperparameter Optimization**: Both use advanced optimization techniques.
- **Ensemble Methods**: Both support ensemble creation.

#### Differences

- **Preprocessing**:
  - **AutoSklearn**: Limited customization.
  - **Mamut**: Customizable through the `Preprocessor` class.

- **Reporting**:
  - **AutoSklearn**: Basic evaluation metrics.
  - **Mamut**: Detailed HTML reports with comprehensive metrics and visualizations.

- **Ensemble Creation**:
  - **AutoSklearn**: Provides great ensemble based on the paper: [Caruna et al. (2004)](https://dl.acm.org/doi/abs/10.1145/1015330.1015432)
  - **Mamut**: Greedy ensembles are also based on the same paper.

- **Imbalance Handling**:
  - **AutoSklearn**: Basic handling.
  - **Mamut**: Built-in functionality for imbalanced datasets.

- **Time Management**:
  - **AutoSklearn**: Supports limiting fitting time.
  - **Mamut**: Does not support limiting fitting time.

Overall, while both tools aim to automate the machine learning process, Mamut offers more flexibility and detailed reporting, whereas AutoSklearn provides time management features and can be harder to set up.

### PyCaret
The below table compares Mamut with PyCaret, another popular AutoML solution. During the preparation of the Mamut package we have tried to take the best features from PyCaret and improve upon them. The key advantages of packages like PyCaret or Mamut is that they are light-weight and very easy to setup and use. They are designed for users who want to quickly build and compare multiple machine learning models without having to write extensive code.

| Feature                | Mamut                                                                 | PyCaret                                                                                     |
|------------------------|----------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| Preprocessing          | Customizable via the `Preprocessor` class (data cleaning, feature engineering, PCA) | Automated with data transformation pipelines (handling scaling, encoding, imputation, etc.) |
| Model Selection        | Hyperparameter optimization using Optuna with random and Bayesian search | Cross-validation with all models in the library, comparison via `compare_models()`           |
| Ensemble Methods       | Supports standard and greedy ensemble creation                       | Supports bagging, boosting, and stacking ensembles                                          |
| Imbalance Handling     | Detects and handles imbalanced datasets                              | Built-in support for imbalanced data handling                                               |
| Evaluation and Reporting | Generates detailed HTML reports (ROC, SHAP values, confusion matrix) | Provides model evaluation and visualizations like ROC, confusion matrix through `plot_model()` |

### PyCaret Workflow
The typical workflow in PyCaret is straightforward and user-friendly:
Install PyCaret if needed:
```bash
!pip install pycaret
```
**Load PyCaret and Dataset**

First, import the necessary modules from PyCaret and load the dataset:
```python
from pycaret.datasets import get_data
from pycaret.classification import setup, compare_models, plot_model, predict_model

# Load example dataset
data = get_data('diabetes')
```
**Set Up the PyCaret Environment**

Set up the PyCaret environment for classification:
```python
clf = setup(data, target='Class variable', session_id=123)
```

**Compare and Select the Best Model**
Compare multiple models and select the best one:
```python
best_model = compare_models()
```

**Evaluate the Model**
Evaluate the model using various metrics and visualizations:
```python
plot_model(best_model, plot='auc')
plot_model(best_model, plot='confusion_matrix')
```

**Make Predictions**
Make predictions on the test set:
```python
predictions = predict_model(best_model)
```