<a href="https://colab.research.google.com/github/kritika966/AutoML-Using-PyCaret/blob/main/AutoML_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How To Fit a model with AutoML

## What is AutoML?

AutoML stands for Automated Machine Learning. It refers to techniques and tools that automatically build and optimize machine learning models without requiring deep expertise in data science.

**Key Features of AutoML:**

- Automatic data preprocessing (handling missing values, encoding)

- Feature engineering (creation & selection)

- Model training & selection (tries multiple models)

- Hyperparameter tuning (automatically adjusts model settings)

- Evaluation & comparison of models

- Model interpretability (e.g., SHAP, feature importance)


## Install Required Packages

In [None]:
# Install PyCaret (compatible with Colab)
#!pip install --pre pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting numpy<1.27,>=1.21 (from pycaret)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas<2.2.0 (from pycaret)
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.5-py3-none-any.whl.metadata (46 kB)
[2K     [90m━━━━━━━━━

In [None]:
# Import libraries
from pycaret.classification import *
import pandas as pd

In [None]:
# Load the dataset

from google.colab import files
uploaded = files.upload()

Saving application_train.csv to application_train (1).csv


In [None]:
df = pd.read_csv("application_train.csv")
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df['TARGET'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
TARGET,Unnamed: 1_level_1
0,0.919271
1,0.080729


In [None]:
df.isnull().sum().sort_values(ascending=False).head(10)

Unnamed: 0,0
COMMONAREA_MEDI,214865
COMMONAREA_AVG,214865
COMMONAREA_MODE,214865
NONLIVINGAPARTMENTS_MODE,213514
NONLIVINGAPARTMENTS_AVG,213514
NONLIVINGAPARTMENTS_MEDI,213514
FONDKAPREMONT_MODE,210295
LIVINGAPARTMENTS_MODE,210199
LIVINGAPARTMENTS_AVG,210199
LIVINGAPARTMENTS_MEDI,210199


In [None]:
df.select_dtypes('object').nunique().sort_values(ascending=False).head(10)

Unnamed: 0,0
ORGANIZATION_TYPE,58
OCCUPATION_TYPE,18
NAME_INCOME_TYPE,8
NAME_TYPE_SUITE,7
WEEKDAY_APPR_PROCESS_START,7
WALLSMATERIAL_MODE,7
NAME_FAMILY_STATUS,6
NAME_HOUSING_TYPE,6
NAME_EDUCATION_TYPE,5
FONDKAPREMONT_MODE,4


## Initialize AutoML with PyCaret

In [None]:
# Initialize PyCaret (AutoML)
clf = setup(data=df, target='TARGET', session_id=123, use_gpu=False)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,TARGET
2,Target type,Binary
3,Original data shape,"(307511, 122)"
4,Transformed data shape,"(307511, 185)"
5,Transformed train set shape,"(215257, 185)"
6,Transformed test set shape,"(92254, 185)"
7,Numeric features,105
8,Categorical features,16
9,Rows with missing values,97.2%


## Compare Models

In [None]:
# This will train and compare multiple models automatically
best_model = compare_models()


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9194,0.712,0.0015,0.7717,0.003,0.0027,0.0315,125.696
lr,Logistic Regression,0.9193,0.6289,0.0,0.0,0.0,-0.0,-0.0004,77.814
ridge,Ridge Classifier,0.9193,0.7464,0.0,0.0,0.0,-0.0,-0.0002,8.506
nb,Naive Bayes,0.914,0.603,0.0056,0.0653,0.0103,-0.0011,-0.0023,8.785
knn,K Neighbors Classifier,0.9138,0.5303,0.0132,0.1402,0.0241,0.0104,0.0191,150.195
svm,SVM - Linear Kernel,0.9121,0.5707,0.0128,0.098,0.0179,0.0052,0.0087,15.962
dt,Decision Tree Classifier,0.8509,0.5391,0.1672,0.1416,0.1533,0.0722,0.0725,31.011
qda,Quadratic Discriminant Analysis,0.1063,0.529,0.9801,0.0815,0.1504,0.0016,0.0164,16.606


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.9195,0.7524,0.0147,0.5527,0.0286,0.0244,0.0799,232.843
rf,Random Forest Classifier,0.9194,0.712,0.0015,0.7717,0.003,0.0027,0.0315,125.696
lr,Logistic Regression,0.9193,0.6289,0.0,0.0,0.0,-0.0,-0.0004,77.814
ridge,Ridge Classifier,0.9193,0.7464,0.0,0.0,0.0,-0.0,-0.0002,8.506
et,Extra Trees Classifier,0.9193,0.7048,0.0006,0.4917,0.0013,0.0011,0.0152,115.511
dummy,Dummy Classifier,0.9193,0.5,0.0,0.0,0.0,0.0,0.0,7.023
ada,Ada Boost Classifier,0.9191,0.744,0.0203,0.4802,0.0389,0.0326,0.0856,54.308
lda,Linear Discriminant Analysis,0.919,0.7464,0.0252,0.4711,0.0478,0.0399,0.0942,16.5
xgboost,Extreme Gradient Boosting,0.9187,0.7445,0.0332,0.4491,0.0618,0.0512,0.1046,19.266
nb,Naive Bayes,0.914,0.603,0.0056,0.0653,0.0103,-0.0011,-0.0023,8.785


**Best Accuracy:**

All top models like Random Forest (0.9194) and Gradient Boosting (0.9195) perform similarly well.

But high accuracy alone is misleading if your data is imbalanced (e.g., very few positives).

**Best AUC (Class Separation):**

Gradient Boosting (0.7524) is best. AUC tells how well the model separates positive and negative classes.

Good when we care about ranking predictions.

**Best Recall (Catch Positives):**

QDA (0.9801) is highest — it catches almost all positives.

But its accuracy is very poor (0.1063), meaning it classifies too many negatives as positives (bad precision).

**Best Precision:**

Random Forest (0.7717) is best, meaning when it predicts positive, it's usually correct — but it barely catches any (recall = 0.0015).

**Best Balance (F1 Score):**

Gradient Boosting (0.0286) and Decision Tree (0.1533) do better than most, but all F1 scores are low.

Low F1 means models struggle to balance precision and recall.

**Fastest Models:**

Ridge Classifier (8.5 sec) and Naive Bayes (8.7 sec) are quick.

Good for quick testing, but not top performers in AUC or F1.

**Summary:**

Use AUC to compare models on imbalanced data.

If catching positives is critical (like in fraud/disease), go for high recall (like QDA).

If false alarms are bad, look for high precision (like Random Forest).

If you want a balanced model, go for one with higher F1 and AUC — in this case, Gradient Boosting is a safe choice.

## Create & Evaluate the Best Model

In [None]:
# Finalize model
final_model = finalize_model(best_model)

# Evaluate model
evaluate_model(final_model)


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## Make Predictions

In [None]:
# Predict on unseen data
predictions = predict_model(final_model)
predictions.head()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.9197,0.7581,0.0126,0.6225,0.0247,0.0216,0.0805


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,TARGET,prediction_label,prediction_score
237042,374557,Cash loans,M,N,N,0,225000.0,714154.5,21780.0,616500.0,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.7988
226763,362659,Cash loans,F,N,Y,0,135000.0,1040985.0,30438.0,909000.0,...,0,0.0,0.0,0.0,0.0,0.0,4.0,0,0,0.9656
182044,311004,Cash loans,M,Y,Y,3,252000.0,900000.0,31887.0,900000.0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.968
5214,106100,Cash loans,F,N,Y,0,72000.0,113760.0,6480.0,90000.0,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.9692
290026,435994,Cash loans,F,N,N,0,90000.0,414792.0,18270.0,315000.0,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.7841


This customer (ID: 374557) applied for a loan of 714,154.5 with an income of 225,000.0.

The actual label (TARGET) is 0, meaning the person did not default.

The model predicted 0 (no default) with a confidence score of 0.7988 (about 79.9%).

The prediction is correct. The model predicted "no default", and that was true.


The prediction_score is the model's confidence that the target is class 1 (default).

Closer to 1 = likely to default

Closer to 0 = likely to not default

Use prediction_label to make yes/no decisions, and use prediction_score to understand the model's confidence.



## Save Model

In [None]:
# Save the model to use later
save_model(final_model, 'home_credit_best_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['SK_ID_CURR', 'CNT_CHILDREN',
                                              'AMT_INCOME_TOTAL', 'AMT_CREDIT',
                                              'AMT_ANNUITY', 'AMT_GOODS_PRICE',
                                              'REGION_POPULATION_RELATIVE',
                                              'DAYS_BIRTH', 'DAYS_EMPLOYED',
                                              'DAYS_REGISTRATION',
                                              'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
                                              'FLAG_MOBIL', 'FLAG_EMP_PHONE',
                                              'FLAG_WORK_...
                                             criterion='friedman_mse', init=None,
                                             learning_rate=0.1, loss='log_loss',
                                

## Automated Feature Engineering with PyCaret

PyCaret simplifies the process of feature engineering through its powerful `setup()` function, which automates many data preprocessing steps essential for building machine learning models. Here's a breakdown of what it does:

* **Missing Value Treatment**: PyCaret automatically detects and imputes missing data. It uses strategies like mean or median imputation for numerical columns and the most frequent value for categorical ones.

* **Categorical Encoding**: It seamlessly converts categorical variables into numerical format using appropriate encoding techniques such as one-hot encoding, ordinal encoding, or target-based encoding.

* **Outlier Handling**: The framework identifies and processes outliers by either modifying or excluding them to enhance model robustness.

* **Normalization and Scaling**: To ensure consistency across numerical features, PyCaret applies scaling techniques like standardization or normalization, improving model performance.

* **Feature Interaction Creation**: It can automatically generate new features by exploring interactions between existing variables, including polynomial transformations for capturing non-linear effects.

* **Reducing Dimensionality**: PyCaret helps in trimming down the number of input features while retaining the essential patterns, often using methods like PCA (Principal Component Analysis).

* **Feature Selection**: Through techniques like Recursive Feature Elimination (RFE), it filters out irrelevant or redundant features to streamline the dataset and optimize model accuracy.




## Conclusion

In this tutorial, we used PyCaret’s AutoML to quickly build and compare several classification models for predicting loan default risk. The Gradient Boosting Classifier performed best in terms of accuracy (91.97%) and precision (62.25%), but its recall was very low (1.26%), meaning it missed most actual defaulters.

We also explored individual predictions, showing how the model outputs both a prediction label (default or not) and a prediction score (model confidence). While the model is confident in predicting "no default", it struggles to correctly identify customers who actually default.

**This highlights the importance of:**

Evaluating multiple metrics, not just accuracy.

Understanding the business context — in credit risk, recall is critical to catch high-risk customers.

Using prediction scores to flag borderline or high-risk customers even if the model predicts no default.

With AutoML, we were able to build a baseline model quickly, interpret its performance, and understand how it makes decisions — a powerful tool for rapid experimentation in real-world business problems.