# AutoML & Hyperparameter Tuning

Learn how to automatically find the best model and optimize hyperparameters.

**Topics covered:**
- Automatic model comparison
- Hyperparameter optimization
- Grid search vs Bayesian optimization
- Parallel training

In [1]:
import mkyz

mkyz package initialized. Version: 0.2.1


## 1. Prepare Data

In [2]:
# Load and prepare the Titanic dataset
data = mkyz.prepare_data(
    'data/titanic.csv',
    target_column='Survived',
    test_size=0.2,
    random_state=42
)

X_train, X_test, y_train, y_test, df, target, num_cols, cat_cols = data

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples")

INFO:mkyz.data_processing:First 5 rows of the dataset:
INFO:mkyz.data_processing:   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803 

Training set: 576 samples, 1420 features
Test set: 145 samples


## 2. Quick Model Comparison (AutoML)

Let MKYZ automatically train and compare multiple models.

In [3]:
# Auto-train: Compare all available models
best_model = mkyz.auto_train(
    data,
    task='classification',
    n_threads=2,              # Parallel training
    optimize_models=False     # Quick comparison without tuning
)

print(f"\nBest model: {type(best_model).__name__}")


Best model: LogisticRegression


In [4]:
# Evaluate the best model
predictions = best_model.predict(X_test)
metrics = mkyz.classification_metrics(y_test, predictions)

print("\nBest Model Performance:")
print("=" * 40)
for k, v in metrics.items():
    print(f"  {k}: {v:.4f}")


Best Model Performance:
  accuracy: 0.8138
  precision: 0.8117
  recall: 0.8138
  f1_score: 0.8124
  mcc: 0.5826
  cohen_kappa: 0.5820


## 3. Training Specific Models

Train individual models with custom parameters.

In [5]:
# Train Random Forest with custom parameters
rf_model = mkyz.train(
    data,
    task='classification',
    model='rf',
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    random_state=42
)

rf_predictions = rf_model.predict(X_test)
rf_metrics = mkyz.classification_metrics(y_test, rf_predictions)
print(f"Random Forest Accuracy: {rf_metrics['accuracy']:.4f}")

Random Forest Accuracy: 0.7793


In [6]:
# Train Logistic Regression
lr_model = mkyz.train(
    data,
    task='classification',
    model='lr',
    C=1.0,
    max_iter=1000
)

lr_predictions = lr_model.predict(X_test)
lr_metrics = mkyz.classification_metrics(y_test, lr_predictions)
print(f"Logistic Regression Accuracy: {lr_metrics['accuracy']:.4f}")

Logistic Regression Accuracy: 0.8138


In [7]:
# Train Gradient Boosting
gb_model = mkyz.train(
    data,
    task='classification',
    model='gb',
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5
)

gb_predictions = gb_model.predict(X_test)
gb_metrics = mkyz.classification_metrics(y_test, gb_predictions)
print(f"Gradient Boosting Accuracy: {gb_metrics['accuracy']:.4f}")

Gradient Boosting Accuracy: 0.8000


## 4. Model Comparison Table

In [8]:
# Compare all trained models
models = [
    ('Random Forest', rf_model, rf_metrics),
    ('Logistic Regression', lr_model, lr_metrics),
    ('Gradient Boosting', gb_model, gb_metrics)
]

print("Model Comparison:")
print("=" * 70)
print(f"{'Model':<25} {'Accuracy':>10} {'F1':>10} {'Precision':>10} {'Recall':>10}")
print("-" * 70)

for name, model, metrics in models:
    print(f"{name:<25} {metrics['accuracy']:>10.4f} {metrics['f1_score']:>10.4f} {metrics['precision']:>10.4f} {metrics['recall']:>10.4f}")

Model Comparison:
Model                       Accuracy         F1  Precision     Recall
----------------------------------------------------------------------
Random Forest                 0.7793     0.7568     0.7928     0.7793
Logistic Regression           0.8138     0.8124     0.8117     0.8138
Gradient Boosting             0.8000     0.7960     0.7962     0.8000


## 5. Cross-Validation for Fair Comparison

In [9]:
# Cross-validate each model for fair comparison
print("Cross-Validation Comparison (5-fold):")
print("=" * 50)

for name, model, _ in models:
    cv_results = mkyz.cross_validate(
        model, X_train, y_train,
        cv=mkyz.CVStrategy.STRATIFIED,
        n_splits=5
    )
    print(f"{name:<25}: {cv_results['mean_test_score']:.4f} ± {cv_results['std_test_score']:.4f}")

Cross-Validation Comparison (5-fold):
Random Forest            : 0.7952 ± 0.0084
Logistic Regression      : 0.8090 ± 0.0372
Gradient Boosting        : 0.8056 ± 0.0243


## 6. Available Model Types

Reference of available models in MKYZ.

In [10]:
# Available models reference
print("Available Classification Models:")
print("=" * 50)
classification_models = {
    'rf': 'Random Forest',
    'lr': 'Logistic Regression',
    'svm': 'Support Vector Machine',
    'knn': 'K-Nearest Neighbors',
    'dt': 'Decision Tree',
    'nb': 'Naive Bayes',
    'gb': 'Gradient Boosting',
    'xgb': 'XGBoost (if installed)',
    'lgbm': 'LightGBM (if installed)',
    'catboost': 'CatBoost (if installed)'
}

for key, name in classification_models.items():
    print(f"  '{key}': {name}")

Available Classification Models:
  'rf': Random Forest
  'lr': Logistic Regression
  'svm': Support Vector Machine
  'knn': K-Nearest Neighbors
  'dt': Decision Tree
  'nb': Naive Bayes
  'gb': Gradient Boosting
  'xgb': XGBoost (if installed)
  'lgbm': LightGBM (if installed)
  'catboost': CatBoost (if installed)


## Summary

In this notebook, we learned:

1. **AutoML** - Automatically compare multiple models with `auto_train()`
2. **Custom Training** - Train specific models with custom hyperparameters
3. **Model Comparison** - Compare models using test metrics and cross-validation
4. **Available Models** - Reference of supported model types

### Tips

- Start with `auto_train(optimize_models=False)` for quick comparison
- Use cross-validation for fair model comparison
- Try different hyperparameters for the best performing model
- Consider ensemble methods for final predictions