## Step 1: Setup & Install PyCaret

This first step ensures that PyCaret and its dependencies are available in the environment. PyCaret is a low-code library that simplifies the entire machine learning workflow, from data preprocessing to model deployment. The installation process brings in a number of core ML and data science libraries (pandas, scikit-learn, matplotlib, lightgbm, etc.) required for typical ML pipelines.

**Why show the install?**
Installing PyCaret in a fresh environment demonstrates its accessibility and how easy it is to get started. In production or on your workstations, you’d usually do this once to set up your tools.

**Note:** The verbose output lists all dependencies PyCaret brings in—demonstrating how comprehensive the package is out-of-the-box.


In [14]:
!pip install pycaret



## Step 2: Data Overview, Automatic Setup, and Initial Diagnostics

In this cell, we demonstrate how PyCaret’s `setup()` function automates several major steps of the ML workflow:

- **Dataset loading and preview:** We are using the diabetes dataset, a classic classification problem, where the goal is to predict the likelihood of diabetes onset based on various health indicators.
- **Automatic preprocessing and summary:** PyCaret handles missing value imputation, numeric/categorical encoding, train-test split, and provides detailed diagnostics in a single output table.
- **Key insights from the table output:**
  - Number of features, their types, and summary statistics.
  - Imputation strategy (e.g., mean for numeric, mode for categorical).
  - Whether the data was normalized, how folds are set up for cross-validation.
  - Insights into CPU/jobs, session seed, etc.
  
**Teaching Point:**  
Notice how much preprocessing and setup PyCaret handles with no manual coding—making it an ideal tool for fast prototyping or educational demos.  
**Limitation Highlight:**  
For truly complex datasets (with highly non-standard features, text, images, or where custom preprocessing is vital), PyCaret’s automation may not cover all the intricacies required for optimal results.


In [26]:
import pycaret
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [27]:
from pycaret.classification import *
experiment = setup(data = diabetes, target = 'Class variable', session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


## Step 3: Model Training & Automated Comparison

PyCaret’s `compare_models()` trains a variety of classification algorithms on the preprocessed dataset and benchmarks them using standard metrics like Accuracy, AUC, Recall, Precision, and F1 Score, all in just a single line of code.

**Key things happening:**
- Multiple models (Logistic Regression, Ridge Classifier, Random Forest, Naive Bayes, Gradient Boosting, etc.) are trained and validated using cross-validation (here, StratifiedKFold with 10 folds).
- Results are clearly summarized, so you immediately see which model performs best for your dataset and objective.
- Each row in the table represents a different model’s average cross-validated performance.

**Teaching Point:**  
This level of automation allows you to move rapidly from data to insight, skipping hours of manual model coding and baseline testing.

**Limitation Highlight:**  
- While convenient, this approach may gloss over nuances—such as precise handling of class imbalance, custom validation splits, or advanced hyperparameter tuning needed for production-grade solutions.


In [28]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7689,0.8047,0.5602,0.7208,0.6279,0.4641,0.4736,0.004
ridge,Ridge Classifier,0.767,0.806,0.5497,0.7235,0.6221,0.4581,0.469,0.003
lda,Linear Discriminant Analysis,0.767,0.8055,0.555,0.7202,0.6243,0.4594,0.4695,0.003
rf,Random Forest Classifier,0.7466,0.792,0.5284,0.6795,0.5908,0.4117,0.421,0.014
nb,Naive Bayes,0.7427,0.7955,0.5702,0.6543,0.6043,0.4156,0.4215,0.002
gbc,Gradient Boosting Classifier,0.7373,0.7909,0.555,0.6445,0.5931,0.4013,0.4059,0.009
ada,Ada Boost Classifier,0.7372,0.7799,0.5275,0.6585,0.5796,0.3926,0.4017,0.005
qda,Quadratic Discriminant Analysis,0.7282,0.7894,0.5281,0.6558,0.5736,0.3785,0.391,0.003
et,Extra Trees Classifier,0.7243,0.7793,0.4857,0.6419,0.5487,0.3565,0.3663,0.011
lightgbm,Light Gradient Boosting Machine,0.7133,0.7645,0.5398,0.6036,0.565,0.3534,0.358,0.285


## Step 4: Create a Random Forest Model

Here we use PyCaret's `create_model()` function to quickly instantiate and train a **Random Forest classifier** on the preprocessed training data.

- **Purpose:** This function builds the machine learning model of your choice (here, `'rf'` for Random Forest).
- **How it works:** Under-the-hood, PyCaret:
  - Pulls the best-practices hyperparameters (unless you specify custom ones)
  - Trains 10-fold cross-validation (by default) for a robust initial estimate
  - Prints key performance metrics: Accuracy, AUC, Recall, Precision, F1, etc.
- **Key advantage:** With a single line, you set up, train, and cross-validate a robust baseline model, ready for diagnostics or further tuning.

> **Tip:** You can replace `'rf'` with any supported PyCaret model ID (e.g., `'lr'`, `'gbc'`, `'lightgbm'`) to try different machine learning algorithms just as easily.


In [29]:
model = create_model('rf')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7593,0.882,0.5789,0.6875,0.6286,0.4524,0.4561
1,0.7593,0.8098,0.6316,0.6667,0.6486,0.4658,0.4661
2,0.7593,0.8782,0.4737,0.75,0.5806,0.4236,0.4456
3,0.7037,0.7632,0.4737,0.6,0.5294,0.3175,0.3223
4,0.8148,0.8504,0.6842,0.7647,0.7222,0.584,0.586
5,0.6852,0.6729,0.4211,0.5714,0.4848,0.2656,0.272
6,0.7963,0.7872,0.6316,0.75,0.6857,0.5367,0.541
7,0.8113,0.8635,0.5,0.9,0.6429,0.5285,0.5706
8,0.6792,0.6833,0.4444,0.5333,0.4848,0.2548,0.257
9,0.6981,0.7294,0.4444,0.5714,0.5,0.2886,0.2933


## Step 5: Tune Your Model for Better Performance

Now, we use `tune_model()` to **automatically search for the best hyperparameters** of the Random Forest model we just created.

- **Purpose:** Hyperparameter tuning is essential for extracting maximum performance from ML models by optimizing settings like tree depth, number of trees, etc.
- **How it works:** PyCaret:
  - Runs an automated hyperparameter search (using grid/randomized search)
  - Evaluates dozens of combinations for the top cross-validated score
  - Returns a new instance of your model fitted with the best-found parameters
  - Prints the performance of the tuned model, so you can see improvements from tuning

**Why show this step?**
- In professional ML workflows, tuning can significantly boost accuracy or recall. 
- Here, it's made simple—just one function call instead of writing custom tuning loops.

> **Limitation:** For highly complex pipelines or non-standard hyperparameters (e.g., custom cost functions), you may need to extend or leave PyCaret's default tuning system.


In [30]:
tune_model = tune_model(model)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7593,0.8827,0.6842,0.65,0.6667,0.4785,0.4788
1,0.7407,0.8391,0.8947,0.5862,0.7083,0.4926,0.5286
2,0.7778,0.9008,0.7895,0.6522,0.7143,0.5352,0.5417
3,0.7407,0.7729,0.6316,0.6316,0.6316,0.4316,0.4316
4,0.8333,0.8872,0.8421,0.7273,0.7805,0.6473,0.6518
5,0.6296,0.7203,0.5263,0.4762,0.5,0.207,0.2077
6,0.7593,0.788,0.7368,0.6364,0.6829,0.4906,0.494
7,0.8113,0.8556,0.7222,0.7222,0.7222,0.5794,0.5794
8,0.6415,0.7095,0.5556,0.4762,0.5128,0.2319,0.2336
9,0.6604,0.7286,0.5556,0.5,0.5263,0.2628,0.2636


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


## Step 6: Visual Evaluation of Model Performance

Let's use `evaluate_model()` to **interactively visualize and interpret the performance** of the tuned Random Forest model.

- **What it does:** Launches a suite of diagnostic plots:
  - ROC Curve, Precision-Recall, Confusion Matrix, Feature Importance, Learning Curve, and more
- **Purpose:** These visualizations help you:
  - Spot overfitting or underfitting (from curves and matrices)
  - Understand which features are most impactful for predictions
  - Assess class-wise errors and prediction probabilities
- **User Experience:** The function opens a GUI (inside the notebook) so you can cycle through key visual reports without extra plotting code.
- **Key takeaway:** With a single function, PyCaret wraps up model analysis and debugging, making it easy for audiences or stakeholders to understand results.

> **Pro Tip:** Serious production projects should still include custom error analysis—but these visuals are the fastest way to communicate model strengths and weaknesses in a demo or prototype.


In [None]:
evaluate_model(tune_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## Step 7: Saving the Tuned Model for Future Use

This step uses the `save_model()` function to **save your best model to disk** with a chosen name (here, `"diabetes_prediction"`).

- This makes it easy to reuse the trained and tuned model later, either in another notebook, a script, or even in production.
- The model is saved as a file (using pickle format under the hood).

**Key point:** No need to retrain your model every time—simply reload it when needed!


In [32]:
save_model(tune_model, 'diabetes_prediction')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Number of times pregnant',
                                              'Plasma glucose concentration a 2 '
                                              'hours in an oral glucose '
                                              'tolerance test',
                                              'Diastolic blood pressure (mm Hg)',
                                              'Triceps skin fold thickness (mm)',
                                              '2-Hour serum insulin (mu U/ml)',
                                              'Body mass index (weight in '
                                              'kg/(height in m)^2)',
                                              'Diabetes pedigre...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                         class_w

## Step 8: Loading the Saved Model

With `load_model()`, you can **load your previously saved model checkpoint** (here, `"diabetes_prediction"`) back into memory.

- This restores the model to exactly how it was saved—including all preprocessing steps and hyperparameters.
- Useful for making predictions in a new session, handing off work, or deploying in an app.

**Key point:** Loading a model is fast; no need for retraining or repeated setup!


In [33]:
loaded_model = load_model('diabetes_prediction')

Transformation Pipeline and Model Successfully Loaded


## Step 9: Making Predictions on New Data

Finally, use `predict_model()` with your loaded model and any new data to **generate predictions**.

- Pass in your loaded model and the new dataset (e.g., `new_data`).
- PyCaret will handle all necessary preprocessing automatically and return predictions in a single step.
- This is how you use your trained model for real-world inference!

**Key point:** Easily use your ML model on new, unseen data with just one function call.


In [34]:
import pandas as pd
new_data = pd.DataFrame({
    'Number of times pregnant': [4],
    'Plasma glucose concentration a 2 hours in an oral glucose tolerance test': [76],
    'Diastolic blood pressure (mm Hg)': [35],
    'Triceps skin fold thickness (mm)': [0],
    '2-Hour serum insulin (mu U/ml)' : [20],
    'Body mass index (weight in kg/(height in m)^2)' : [34.3],
    'Diabetes pedigree function': [0.777],
    'Age (years)': [45]


})

In [35]:
from pycaret.classification import predict_model
predictions = predict_model(loaded_model, data=new_data)

In [36]:
print(predictions)

   Number of times pregnant  \
0                         4   

   Plasma glucose concentration a 2 hours in an oral glucose tolerance test  \
0                                                 76                          

   Diastolic blood pressure (mm Hg)  Triceps skin fold thickness (mm)  \
0                                35                                 0   

   2-Hour serum insulin (mu U/ml)  \
0                              20   

   Body mass index (weight in kg/(height in m)^2)  Diabetes pedigree function  \
0                                       34.299999                       0.777   

   Age (years)  prediction_label  prediction_score  
0           45                 0              0.66  


The person is 45 years old.

The model predicts 0, i.e., no diabetes.

But the probability of having diabetes is 0.66 (66%), which is relatively high.