#  **<span style="color:orange">Binary Classification  Tutorial (CLF101) - Level Beginner</span>**

---

### **🧪 Lab Overview** 
This beginner-friendly lab introduces **binary classification using the pycaret classification module**. You'll use a credit scoring dataset to build a model that predicts whether an applicant is likely to default (Yes/No). You will also learn **how to deploy the model using Gradio and explore deployment on Azure and Google Cloud Platform (GCP)**.

---


### 📘 Lab Scenario
You are working with a dataset from the finance domain. Your objective is to **predict if a customer is a credit risk or not**. By using PyCaret, you’ll:

* Load and preprocess the data automatically.

* Compare multiple classification models.

* Finalize and deploy the best-performing model.

---

### 🎯 Lab Goals
📥 **Load and prepare a dataset** for binary classification.

⚙️ Use **PyCaret’s Functional API** to automate ML workflows.

🧠 **Compare and select the best model** using compare_models().

💾 **Finalize** and save your trained model.

🌐 Create a **Gradio web app** for real-time predictions.

☁️ Explore deployment using **Azure and GCP**.

---

### **🛠 Step 1: Install & Import Required Libraries**

We’ll start by importing the essential libraries to build and deploy our classification model.

#### 🔍 Explanation
- **pycaret.classification** – Simplifies the process of training and comparing classification models.

- **joblib** – Saves and loads trained models for later use.

### **🧠 Step 2: Import PyCaret’s Classification Module**

We import the necessary PyCaret module to work on classification problems.

#### **🔍 Explanation**:

Before we can start building classification models, we need to import the classification module from PyCaret.

```pycaret.classification```: This module provides all the tools needed for classification tasks.

It includes functions to:

- **Load and prepare datasets**.

- Set up the **machine learning environment**.

- **Train, compare, tune, and evaluate models**.

- **Save and deploy** trained models easily.

✅ This import ensures we have access to all classification functionalities from PyCaret with a single line of code.

In [1]:
from pycaret.classification import *


### **Step 3: Downgrade NumPy to a Compatible Version**

We now downgrade to a version that works well with the rest of our machine learning stack.

#### **🔍 Updated Explanation**

- **PyCaret 3.3.2** and many other packages require **NumPy < 2.0**.

- So we must use **NumPy 1.26.4** — the highest version supported by all our tools.

- This fixes version conflicts that may break model training or evaluation.

In [7]:
!pip install numpy==1.26.4 --force-reinstall --upgrade --no-cache-dir


Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Downloading numpy-1.26.4-cp312-cp312-win_amd64.whl (15.5 MB)
   ---------------------------------------- 0.0/15.5 MB ? eta -:--:--
   -- ------------------------------------- 1.0/15.5 MB 5.6 MB/s eta 0:00:03
   ---- ----------------------------------- 1.6/15.5 MB 5.2 MB/s eta 0:00:03
   ------ --------------------------------- 2.6/15.5 MB 4.6 MB/s eta 0:00:03
   -------- ------------------------------- 3.4/15.5 MB 4.2 MB/s eta 0:00:03
   ----------- ---------------------------- 4.5/15.5 MB 4.3 MB/s eta 0:00:03
   ------------- -------------------------- 5.2/15.5 MB 4.1 MB/s eta 0:00:03
   --------------- ------------------------ 6.0/15.5 MB 4.0 MB/s eta 0:00:03
   ----------------- ---------------------- 6.8/15.5 MB 4.0 MB/s eta 0:00:03
   ------------------- -------------------- 7.6/15.5 MB 4.0 MB/s eta 0:00:02
   ---------------------- ----------------- 8.7/15.5 MB 4.1 MB/s eta 0:00:02
   -

### **Step 4: Install PyCaret with Compatible NumPy Version**

To avoid compatibility issues, we first downgrade NumPy and then install PyCaret.

#### **🔍 Explanation**

- ```!pip install numpy --upgrade --ignore-installed```

     - **Upgrades NumPy** to the latest version.

     - ```--ignore-installed```: Forces reinstallation even if NumPy is already installed.

     - ⚠️ Note: The latest version may cause conflicts with PyCaret, so we'll later adjust this if needed.

- ```!pip install pycaret```

     - Installs the **PyCaret library**, which automates tasks like preprocessing, model selection, tuning, and evaluation.

     - Useful for rapid experimentation and building production-ready ML models with minimal code.

- ```from pycaret.classification import```

     - Imports **PyCaret’s classification module**, giving access to setup, compare_models, create_model, tune_model, etc.

✅ After this step, we’re ready to load the dataset and start the machine learning workflow.

In [9]:
!pip install numpy==1.26.4 --force-reinstall --upgrade --no-cache-dir
!pip install pycaret
from pycaret.classification import *


Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Downloading numpy-1.26.4-cp312-cp312-win_amd64.whl (15.5 MB)
   ---------------------------------------- 0.0/15.5 MB ? eta -:--:--
   -- ------------------------------------- 1.0/15.5 MB 6.3 MB/s eta 0:00:03
   ---- ----------------------------------- 1.8/15.5 MB 5.0 MB/s eta 0:00:03
   ------ --------------------------------- 2.6/15.5 MB 4.6 MB/s eta 0:00:03
   --------- ------------------------------ 3.7/15.5 MB 4.4 MB/s eta 0:00:03
   ----------- ---------------------------- 4.5/15.5 MB 4.3 MB/s eta 0:00:03
   ------------- -------------------------- 5.2/15.5 MB 4.2 MB/s eta 0:00:03
   --------------- ------------------------ 6.0/15.5 MB 4.1 MB/s eta 0:00:03
   ----------------- ---------------------- 6.8/15.5 MB 4.1 MB/s eta 0:00:03
   -------------------- ------------------- 7.9/15.5 MB 4.1 MB/s eta 0:00:02
   ---------------------- ----------------- 8.7/15.5 MB 4.0 MB/s eta 0:00:02
   -

### **📊 Step 5: Load Sample Dataset from PyCaret**

We will now load a sample classification dataset provided by PyCaret to work with.

#### **🔍 Explanation**

- ```from pycaret.datasets import get_data```:
Imports the function to fetch sample datasets available in PyCaret.

- ```get_data('credit')```:
Loads the **"credit"** dataset — a sample dataset used to demonstrate credit risk classification.
It includes features like ```income```, ```age```, ```loan```, etc., and a target variable ```default```.

✅ This step provides a ready-to-use dataset for model training and evaluation.

In [10]:
from pycaret.datasets import get_data
dataset = get_data('credit')

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0


### **🔄 Step 6: Split the Dataset into Training and Unseen Data**

We will now split the dataset into two parts: one for model training and the other for making predictions later.

#### **🔍 Explanation**

- ```dataset.sample(frac=0.95, random_state=786)```:

     - This randomly selects **95%** of the dataset for training.

     - ```frac=0.95``` ensures 95% of the data is used.

     - ```random_state=786``` ensures the selection is reproducible.

- ```.reset_index(drop=True)```:
Resets the index of the resulting DataFrame, ensuring it’s clean (without the old index values).

- ```dataset.drop(data.index)```:
The remaining 5% of the data is used as the unseen data for predictions. It is removed from the training set.

- ```print('Data for Modeling: ' + str(data.shape))```:
Prints the dimensions (rows and columns) of the data being used for training.

- ```print('Unseen Data For Predictions: ' + str(data_unseen.shape))```:
Prints the dimensions of the unseen data that will be used for testing and making predictions.

✅ This step divides the data into training data and unseen data to evaluate the model after training.

In [11]:
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (22800, 24)
Unseen Data For Predictions: (1200, 24)


### **⚙️ Step 7: Initialize PyCaret Setup**

In this step, we initialize the PyCaret setup with the training data. This step prepares the environment for model training.

#### **🔍 Explanation**

- ```setup(data = data, target = 'default', session_id=123)```:

  - ```data = data```: Specifies the training dataset that we created in Step 6 (95% of the original dataset).

  - ```target = 'default'```: Specifies the target variable or the column in the dataset that we are trying to predict. Here, the target column is ```'default'```, which could represent whether someone defaulted on a loan (for example).

  - ```session_id=123```: Sets a random seed (session ID) to ensure reproducibility of the setup across different runs. This guarantees that the results will be consistent each time you run the setup.

✅ This step initializes the PyCaret environment, performs preprocessing, and prepares the data for model training.

In [13]:
exp_clf101 = setup(data = data, target = 'default', session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,default
2,Target type,Binary
3,Original data shape,"(22800, 24)"
4,Transformed data shape,"(22800, 24)"
5,Transformed train set shape,"(15959, 24)"
6,Transformed test set shape,"(6841, 24)"
7,Numeric features,23
8,Preprocess,True
9,Imputation type,simple


### **🤖 Step 8: Create the Random Forest Model**

In this step, we create the Random Forest model using the PyCaret library.

#### **🔍 Explanation**

- ```create_model('rf')```:

  - ```'rf'```: Specifies the model type you want to create. Here, ```'rf'``` stands for **Random Forest**, which is an ensemble learning method that combines multiple decision trees to make predictions.

  - ```create_model```: This function in PyCaret automatically handles the creation and training of the specified model. In this case, it will create and train a Random Forest classifier using the training dataset you provided during the ```setup``` step.

✅ This step quickly sets up the Random Forest classifier for your classification task.

In [14]:
rf = create_model('rf')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8227,0.7717,0.4108,0.6591,0.5061,0.4051,0.4219
1,0.8271,0.7726,0.4023,0.686,0.5071,0.4108,0.4323
2,0.8233,0.8079,0.4108,0.6621,0.507,0.4065,0.4237
3,0.812,0.7501,0.3711,0.6268,0.4662,0.3611,0.3794
4,0.8139,0.7623,0.3541,0.6443,0.457,0.356,0.3793
5,0.8221,0.7827,0.3909,0.6667,0.4929,0.3937,0.4144
6,0.812,0.7607,0.3711,0.6268,0.4662,0.3611,0.3794
7,0.8258,0.7959,0.4023,0.6794,0.5053,0.4079,0.4286
8,0.8208,0.7602,0.3739,0.6701,0.48,0.3821,0.4058
9,0.8138,0.7839,0.3824,0.6308,0.4762,0.3711,0.3883


### **🤖 Step 9: Tune the Random Forest Model**

In this step, we tune the Random Forest model to optimize its hyperparameters for better performance.

#### **🔍 Explanation**

- ```tune_model(rf)```:

  - ```tune_model```: This function in PyCaret performs hyperparameter tuning on the model provided as an argument. It uses techniques like **grid search** and **randomized search** to find the best possible combination of hyperparameters for the model.

  - ```rf```: Refers to the Random Forest model you previously created. The function will optimize the hyperparameters of this model to improve its performance on the dataset.

✅ This step helps improve the model's performance by adjusting parameters like the number of trees (estimators), maximum depth, and others, so that it can make more accurate predictions.

In [None]:
from pycaret.classification import *

# Create the 'rf' model first
rf = create_model('rf')

# Now you can tune the model
tuned_rf = tune_model(rf)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8227,0.7717,0.4108,0.6591,0.5061,0.4051,0.4219
1,0.8271,0.7726,0.4023,0.686,0.5071,0.4108,0.4323
2,0.8233,0.8079,0.4108,0.6621,0.507,0.4065,0.4237
3,0.812,0.7501,0.3711,0.6268,0.4662,0.3611,0.3794
4,0.8139,0.7623,0.3541,0.6443,0.457,0.356,0.3793
5,0.8221,0.7827,0.3909,0.6667,0.4929,0.3937,0.4144
6,0.812,0.7607,0.3711,0.6268,0.4662,0.3611,0.3794
7,0.8258,0.7959,0.4023,0.6794,0.5053,0.4079,0.4286
8,0.8208,0.7602,0.3739,0.6701,0.48,0.3821,0.4058
9,0.8138,0.7839,0.3824,0.6308,0.4762,0.3711,0.3883


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8202,0.7591,0.3881,0.6587,0.4884,0.388,0.408
1,0.8233,0.7609,0.3711,0.6859,0.4816,0.3863,0.4128
2,0.8233,0.7875,0.3598,0.694,0.4739,0.3803,0.41
3,0.817,0.7373,0.3598,0.658,0.4652,0.3661,0.3904
4,0.8189,0.7613,0.3598,0.6684,0.4678,0.3703,0.3961
5,0.8189,0.768,0.3456,0.6778,0.4578,0.3626,0.3922
6,0.8139,0.7321,0.3456,0.6489,0.451,0.3513,0.3766
7,0.8315,0.7566,0.3428,0.7658,0.4736,0.3902,0.435
8,0.8195,0.7315,0.3711,0.665,0.4764,0.3778,0.4013
9,0.8163,0.7466,0.3994,0.6351,0.4904,0.3854,0.4008


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


### **🤖 Step 10: Predict with Tuned Random Forest Model**

Now that we have a fine-tuned Random Forest model, it’s time to make predictions using it.

#### **🔍 Explanation**

- ```predict_model(tuned_rf)```:

  - ```predict_model```: This function from PyCaret takes a trained and tuned model (in this case, ```tuned_rf```) and uses it to predict the target variable for each row in the dataset.

  - ```tuned_rf```: This refers to the Random Forest model you previously tuned using ```tune_model```. It is now optimized for better performance based on hyperparameter adjustments.

The function adds a new column in the dataset with the predicted values for the target variable.

✅ **Result**: By calling ```predict_model(tuned_rf)```, the model generates predictions on the dataset, which can be used to evaluate the model's performance and analyze how well it is making predictions.

In [17]:
from pycaret.classification import *

# Set up the environment if not already done
# setup(data, target='your_target_column')  # Ensure data is set up

# Create and tune the random forest model
tuned_rf = tune_model(create_model('rf'))

# Now, you can use predict_model with the tuned model
predictions = predict_model(tuned_rf)


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8227,0.7717,0.4108,0.6591,0.5061,0.4051,0.4219
1,0.8271,0.7726,0.4023,0.686,0.5071,0.4108,0.4323
2,0.8233,0.8079,0.4108,0.6621,0.507,0.4065,0.4237
3,0.812,0.7501,0.3711,0.6268,0.4662,0.3611,0.3794
4,0.8139,0.7623,0.3541,0.6443,0.457,0.356,0.3793
5,0.8221,0.7827,0.3909,0.6667,0.4929,0.3937,0.4144
6,0.812,0.7607,0.3711,0.6268,0.4662,0.3611,0.3794
7,0.8258,0.7959,0.4023,0.6794,0.5053,0.4079,0.4286
8,0.8208,0.7602,0.3739,0.6701,0.48,0.3821,0.4058
9,0.8138,0.7839,0.3824,0.6308,0.4762,0.3711,0.3883


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8202,0.7591,0.3881,0.6587,0.4884,0.388,0.408
1,0.8233,0.7609,0.3711,0.6859,0.4816,0.3863,0.4128
2,0.8233,0.7875,0.3598,0.694,0.4739,0.3803,0.41
3,0.817,0.7373,0.3598,0.658,0.4652,0.3661,0.3904
4,0.8189,0.7613,0.3598,0.6684,0.4678,0.3703,0.3961
5,0.8189,0.768,0.3456,0.6778,0.4578,0.3626,0.3922
6,0.8139,0.7321,0.3456,0.6489,0.451,0.3513,0.3766
7,0.8315,0.7566,0.3428,0.7658,0.4736,0.3902,0.435
8,0.8195,0.7315,0.3711,0.665,0.4764,0.3778,0.4013
9,0.8163,0.7466,0.3994,0.6351,0.4904,0.3854,0.4008


Fitting 10 folds for each of 10 candidates, totalling 100 fits


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.8139,0.745,0.3457,0.6489,0.4511,0.3513,0.3766


### **🤖 Step 11: Predict with Tuned Random Forest Model**

In this step, we use the tuned Random Forest model to make predictions on the dataset.

#### **🔍 Explanation**

- ```predict_model(tuned_rf)```:

  - ```predict_model```: This function from PyCaret takes the tuned Random Forest model (```tuned_rf```) and applies it to the data to predict the target variable for each row.

  - ```tuned_rf```: This is the optimized Random Forest model created in the previous step using ```tune_model```. It now contains the best hyperparameters to make the most accurate predictions.

This function generates a new column with the predicted values for the target variable in the dataset.

✅ **Result**: By running predict_model(tuned_rf), you can view how well the model performs and evaluate the predictions made on the dataset.

In [53]:
# Step 1: Create the Random Forest model
rf = create_model('rf')

# Step 2: Tune the Random Forest model
tuned_rf = tune_model(rf)

# Step 3: Use the tuned model to make predictions
predictions = predict_model(tuned_rf)

# Optionally, display the predictions
print(predictions)


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7778,0.8113,0.6316,0.7059,0.6667,0.5008,0.5025
1,0.8148,0.8391,0.5263,0.9091,0.6667,0.5507,0.5902
2,0.7963,0.8632,0.5263,0.8333,0.6452,0.5123,0.5389
3,0.7593,0.7767,0.7368,0.6364,0.6829,0.4906,0.494
4,0.7593,0.812,0.5789,0.6875,0.6286,0.4524,0.4561
5,0.7593,0.8316,0.5789,0.6875,0.6286,0.4524,0.4561
6,0.8519,0.9398,0.6842,0.8667,0.7647,0.6588,0.6686
7,0.717,0.7579,0.5,0.6,0.5455,0.3424,0.3454
8,0.717,0.7484,0.5,0.6,0.5455,0.3424,0.3454
9,0.6038,0.7016,0.5,0.4286,0.4615,0.151,0.1522


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7963,0.8316,0.7895,0.6818,0.7317,0.5689,0.5729
1,0.8704,0.8737,0.7895,0.8333,0.8108,0.7123,0.7129
2,0.7593,0.8857,0.6842,0.65,0.6667,0.4785,0.4788
3,0.7593,0.7985,0.8421,0.6154,0.7111,0.5132,0.5318
4,0.7593,0.8,0.7368,0.6364,0.6829,0.4906,0.494
5,0.7778,0.8406,0.7895,0.6522,0.7143,0.5352,0.5417
6,0.9074,0.9444,0.9474,0.8182,0.878,0.8041,0.8097
7,0.6981,0.7778,0.5556,0.5556,0.5556,0.327,0.327
8,0.7736,0.7556,0.6667,0.6667,0.6667,0.4952,0.4952
9,0.6226,0.7286,0.6667,0.4615,0.5455,0.2407,0.2526


Fitting 10 folds for each of 10 candidates, totalling 100 fits


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.7489,0.8446,0.8025,0.6075,0.6915,0.4865,0.5


     Number of times pregnant  \
66                          0   
727                         0   
559                        11   
446                         1   
672                        10   
..                        ...   
521                         3   
367                         0   
715                         7   
283                         7   
476                         2   

     Plasma glucose concentration a 2 hours in an oral glucose tolerance test  \
66                                                 109                          
727                                                141                          
559                                                 85                          
446                                                100                          
672                                                 68                          
..                                                 ...                          
521                                 

### **🚀 Step 12: Finalize the Tuned Model**

In this step, we finalize the tuned model to make it ready for use and deployment.

#### **🔍 Explanation**

- ```finalize_model(tuned_rf)```:

  - This function takes the model you have just tuned and retrains it using the entire dataset.

  - It ensures that the model has learned all it can from the data and is now in its best form for making predictions.

✅ **Result**: The model is now fully trained, tested, and ready for real-world use, whether it's for saving, sharing, or deploying.

In [54]:
final_rf = finalize_model(tuned_rf)

### **📦 Step 13: View Final Random Forest Model Parameters**

In this step, we check the final tuned Random Forest model's parameters before deployment.

#### **🔍 Explanation**

- **final_rf**: This is the finalized version of the tuned model, ready for deployment or saving.

- **print()**: Displays the model’s details, including all the hyperparameters chosen during tuning.

✅ **Result**:
You’ll see the complete configuration of the final Random Forest model—like the number of trees, max depth, and other settings—so you know exactly what version you're deploying.

In [21]:
#Final Random Forest model parameters for deployment
print(final_rf)

Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['LIMIT_BAL', 'SEX', 'EDUCATION',
                                             'MARRIAGE', 'AGE', 'PAY_1',
                                             'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5',
                                             'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
                                             'BILL_AMT3', 'BILL_AMT4',
                                             'BILL_AMT5', 'BILL_AMT6',
                                             'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3',
                                             'PAY_AMT4', 'PAY_AMT5',
                                             'PAY_AMT6'],
                                    transform...
                 RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                        class_weight={}, criterion='entropy',
            

### **🧮 Step 14: Predict Using Finalized Model**

In this step, we use the finalized Random Forest model to make predictions on the validation data.

#### **🔍 Explanation**

**predict_model()**: Applies the final model to unseen or validation data to estimate how it performs in real-world scenarios.

**final_rf**: This is your trained and finalized model with the best hyperparameters.

✅ **Result**:
You get predictions and performance metrics (like Accuracy, AUC, etc.) on the holdout data automatically used by PyCaret.

In [55]:
predict_model(final_rf);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.9048,0.9824,0.9753,0.798,0.8778,0.801,0.8119


### **☁️ Step 15: Install Cloud Storage Libraries**

In this step, we install the necessary Python packages to connect and work with cloud platforms like Azure and AWS.

#### **🔍 Explanation**

**azure-storage-blob**: Allows your notebook to upload/download files to and from Azure Blob Storage.

**awscli** (*commented*): Installs the AWS Command Line Interface, useful for interacting with AWS services like S3.

**✅ Why it matters**:
These tools are essential for saving models or data to the cloud for storage or deployment.

In [56]:
! pip install azure-storage-blob
# ! pip install awscli





### **🔐 Step 16: Set Azure Storage Connection String**

In this step, we define the Azure Blob Storage connection string needed to authenticate and connect to your storage account.

#### **🔍 Explanation**

**connect_str**: This is your secure connection string, which contains credentials like the account name and key.

It allows your code to access Azure Blob Storage so you can upload or download files.

**✅ Why it's needed**:
Without this string, Python won't be able to talk to your Azure storage account securely.

> **🔒 Note**: Keep your connection string secret—never share it in public notebooks or with unauthorized users.

In [58]:
# Enter connection string when running in google colab
connect_str = 'DefaultEndpointsProtocol=https;AccountName=strjk00;AccountKey=aDsw5C0uYdGtr2me/K6wUQuEZdi0vbDtFNY/RqwM3sMlSQUC1WB/toZ5oTht1RtKJr7d+xA18o3S+AStbM8nyA==;EndpointSuffix=core.windows.net' #@param {type:"string"}
print(connect_str)

DefaultEndpointsProtocol=https;AccountName=strjk00;AccountKey=aDsw5C0uYdGtr2me/K6wUQuEZdi0vbDtFNY/RqwM3sMlSQUC1WB/toZ5oTht1RtKJr7d+xA18o3S+AStbM8nyA==;EndpointSuffix=core.windows.net


### **🔐 Step 17: Set Azure Connection String as an Environment Variable**

We securely store the Azure Blob Storage connection string in the environment for use by other functions.

#### **🔍 Explanation**

```import os```: Loads Python’s built-in module for interacting with the operating system.

```os.environ[...] = ...```: Sets a temporary environment variable (in this case, your Azure Storage connection string).

```'AZURE_STORAGE_CONNECTION_STRING'```: This is the key used by Azure SDKs to find your storage credentials.

```connect_str```: Your actual connection string value.

**✅ Why we do this**: It hides sensitive info from your code and makes it available only during this session—safer and cleaner!

In [61]:
import os
os.environ['AZURE_STORAGE_CONNECTION_STRING']= connect_str

### **🔐 Step 18: Verify Azure Connection String (Windows Version)**

In this step, we check whether the Azure connection string has been successfully set in our environment variables.

#### **🔍 Explanation**

```os.environ.get()```: Retrieves the value of the environment variable.

```'AZURE_STORAGE_CONNECTION_STRING'```: The key name of the environment variable we previously set.

**✅ Result**: This will print the Azure connection string if it was set correctly, confirming that your Python environment is ready to interact with Azure Blob Storage.

In [62]:
! echo $AZURE_STORAGE_CONNECTION_STRING

$AZURE_STORAGE_CONNECTION_STRING


### **🔑 Step 19: Retrieve Azure Storage Connection String**

In this step, we retrieve the Azure Storage Connection String from the environment variables to ensure secure access to Azure Blob Storage.

#### **🔍 Explanation**

```os.getenv```: This function retrieves the value of the environment variable AZURE_STORAGE_CONNECTION_STRING set previously. It helps in securely storing sensitive information like connection strings.

```connect_str```: The retrieved connection string is stored in this variable, which can then be used to interact with Azure Blob Storage.

**✅ Result**: This step ensures that the connection string is correctly stored and can be used to connect to Azure Blob Storage, making the process more secure by keeping the sensitive connection string out of the code.

In [64]:
os.getenv('AZURE_STORAGE_CONNECTION_STRING')

'DefaultEndpointsProtocol=https;AccountName=strjk00;AccountKey=aDsw5C0uYdGtr2me/K6wUQuEZdi0vbDtFNY/RqwM3sMlSQUC1WB/toZ5oTht1RtKJr7d+xA18o3S+AStbM8nyA==;EndpointSuffix=core.windows.net'

### **🔧 Step 20: Install Azure Storage Blob Package**

In this step, we install the azure-storage-blob package, which is required for interacting with Azure Blob Storage in Python.

#### **🔍 Explanation**

**!pip install azure-storage-blob**: This command installs the Azure Storage Blob SDK for Python, which allows you to interact with Azure Blob Storage. This package provides methods for uploading, downloading, and managing blobs in Azure Storage.

The installation ensures that you have all the necessary dependencies and tools to connect and interact with Azure Blob Storage through your Python code.

**✅ Result**: After running this command, you can easily access and manage files within Azure Blob Storage using Python.

In [65]:
!pip install azure-storage-blob




### **🔧 Step 21: Fixing Dependency Conflicts for Azure Blob Storage Access**

To avoid errors related to incompatible versions of cryptography, pyopenssl, and azure-storage-blob, we will uninstall the existing versions and reinstall specific compatible ones.

#### **🧠 Explanation**
- ```uninstall -y```: Removes old versions without asking for confirmation.

- ```--no-cache-dir```: Ensures pip installs fresh versions, avoiding old cached versions.

- ```cryptography==42.0.5```: Compatible with ```pyopenssl 24.2.1```.

- These versions ensure the ```BlobServiceClient``` works without errors like *"AzureSigningError: Incorrect padding"*.

In [3]:
# Uninstall previous versions
!pip uninstall -y pyopenssl azure-storage-blob

# Reinstall with correct versions (and skip cache)
!pip install --no-cache-dir cryptography==44.0.2
!pip install --no-cache-dir pyopenssl==24.2.1
!pip install --no-cache-dir azure-storage-blob==12.19.1


Found existing installation: pyOpenSSL 24.2.1
Uninstalling pyOpenSSL-24.2.1:
  Successfully uninstalled pyOpenSSL-24.2.1
Found existing installation: azure-storage-blob 12.19.1
Uninstalling azure-storage-blob-12.19.1:
  Successfully uninstalled azure-storage-blob-12.19.1
Collecting pyopenssl==24.2.1
  Downloading pyOpenSSL-24.2.1-py3-none-any.whl.metadata (13 kB)
Collecting cryptography<44,>=41.0.5 (from pyopenssl==24.2.1)
  Downloading cryptography-43.0.3-cp39-abi3-win_amd64.whl.metadata (5.4 kB)
Downloading pyOpenSSL-24.2.1-py3-none-any.whl (58 kB)
Downloading cryptography-43.0.3-cp39-abi3-win_amd64.whl (3.1 MB)
   ---------------------------------------- 0.0/3.1 MB ? eta -:--:--
   ------------- -------------------------- 1.0/3.1 MB 7.2 MB/s eta 0:00:01
   ----------------------- ---------------- 1.8/3.1 MB 4.6 MB/s eta 0:00:01
   ---------------------------------- ----- 2.6/3.1 MB 4.6 MB/s eta 0:00:01
   ---------------------------------------- 3.1/3.1 MB 4.5 MB/s eta 0:00:00
Insta

### **🔧 Step 22: Verify Installed Package Versions**

In this step, we check whether the correct versions of ```cryptography``` and ```pyopenssl``` are installed, ensuring compatibility with the Azure SDK.

#### **🔍 Explanation**

- ```!pip show <package_name>``` is used to display details about the installed Python packages.

- This helps confirm whether the versions you installed earlier are active in your environment.

- It's important because version mismatches (especially for ```cryptography``` and ```pyopenssl```) can cause errors while connecting securely to Azure Blob Storage.

**✅ Result**
You should see output like:

- ```Name: cryptography```, ```Version: 44.0.2```

- ```Name: pyopenssl```, ```Version: 24.2.1```

In [4]:
!pip show cryptography
!pip show pyopenssl


Name: cryptography
Version: 43.0.3
Summary: cryptography is a package which provides cryptographic recipes and primitives to Python developers.
Home-page: https://github.com/pyca/cryptography
Author: The cryptography developers <cryptography-dev@python.org>
Author-email: The Python Cryptographic Authority and individual contributors <cryptography-dev@python.org>
License: Apache-2.0 OR BSD-3-Clause
Location: C:\Users\Hello\anaconda3\Lib\site-packages
Requires: cffi
Required-by: anaconda-cloud-auth, azure-storage-blob, conda-content-trust, evidently, paramiko, pyOpenSSL, Scrapy, service-identity
Name: pyOpenSSL
Version: 24.2.1
Summary: Python wrapper module around the OpenSSL library
Home-page: https://pyopenssl.org/
Author: The pyOpenSSL developers
Author-email: cryptography-dev@python.org
License: Apache License, Version 2.0
Location: C:\Users\Hello\anaconda3\Lib\site-packages
Requires: cryptography
Required-by: Scrapy


### **🧪 Step 23: Test Azure Blob Storage Connection and List Containers**

This step checks if your connection to Azure Blob Storage is working and lists any existing containers.

#### **🔍 Explanation**

- ```from azure.storage.blob import BlobServiceClient```: Imports the client to interact with Blob storage.

- ```connect_str```: Your connection string copied from the Azure portal.

- ```BlobServiceClient.from_connection_string()```: Creates a client to access your Azure storage account.

- ```list_containers()```: Lists all the blob containers inside the specified storage account.

- Error handling is added to catch common issues:

   - ```ResourceNotFoundError```: Triggered if the storage account or container doesn't exist.

   - ```ServiceRequestError```: Happens when there’s a network or connection string issue.

   - ```Exception```: Catches anything else unexpected.

#### **📘 Concepts to Know**

**Azure Blob Storage**: Service for storing large amounts of unstructured data.

**BlobServiceClient**: Python client to interact with Blob storage.

**Connection String**: A unique string used to authenticate and connect to your Azure Storage.

**Try-Except Block**: Used to handle potential errors in a safe way.


**✅ Result**

If successful, it will print container names like ```📦 Container found: mycontainer```.
If no containers exist or an issue occurs, it will show an appropriate error message.

In [7]:
from azure.storage.blob import BlobServiceClient
from azure.core.exceptions import ResourceNotFoundError, ServiceRequestError

# ✅ Replace with your actual connection string (use your real AccountKey from Azure portal)
connect_str = "DefaultEndpointsProtocol=https;AccountName=kritikastorage123;AccountKey=MiM5aLoQCfqLX4NbKvSOd4s2464XyFoud2M3f5DIl9LrIO5DqRfqU2UF2Vwo4wkbL6eeUdzFZtUH+AStDy4nbg==;EndpointSuffix=core.windows.net"

try:
    # Create the BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string(connect_str)
    
    # Try to list containers
    containers = blob_service_client.list_containers()
    
    # Flag to check if any containers exist
    container_found = False
    for container in containers:
        container_found = True
        print(f"📦 Container found: {container.name}")
    
    # If no containers found
    if not container_found:
        print("❗ No containers found in the account.")
    
except ResourceNotFoundError:
    print("🚫 The storage account or container does not exist. Make sure the resources are created.")
except ServiceRequestError:
    print("🌐 Connection error. Check your network or connection string.")
except Exception as e:
    print(f"⚠️ An unexpected error occurred: {str(e)}")


🌐 Connection error. Check your network or connection string.


### **Step 24: 🔧 Fixing Azure Storage Blob Dependencies 🔧**

#### **Explanation:**
In this critical step, we're resolving **compatibility issues** between the Azure Storage Blob library and its supporting packages. Modern Python libraries often have specific dependency requirements, and Azure's blob storage functionality needs particular versions of cryptographic libraries to work properly. We'll downgrade two key packages to ensure everything works smoothly together.

#### **Key Concepts: 📚**

- **Version compatibility** is crucial when working with cloud storage libraries
- **Dependency management** is a common challenge in Python development
- The downgraded packages (**cryptography 39.0.1 and pyOpenSSL 21.0.0**) provide the right functionality for Azure blob operations
- **Warning messages** about other packages (conda-content-trust and evidently) can be safely ignored for this lab

#### **Important Note**: ⚠️ 

If you see dependency conflict warnings after running this code, don't worry! These are just informational messages. As long as the final import statement works without errors, you're good to continue with the lab. The Azure Storage Blob functionality will work properly despite these warnings.

In [17]:
# Reinstall azure-storage-blob with the downgraded dependencies
!pip install --upgrade --force-reinstall azure-storage-blob

# Test basic Azure blob functionality
import sys
try:
    from azure.storage.blob import BlobServiceClient
    print(f"Azure Storage Blob client library version: {azure.storage.blob.__version__}")
    print("Import successful")
except Exception as e:
    print(f"Error: {e}")
    print(f"Python version: {sys.version}")

Collecting azure-storage-blob
  Using cached azure_storage_blob-12.25.1-py3-none-any.whl.metadata (26 kB)
Collecting azure-core>=1.30.0 (from azure-storage-blob)
  Using cached azure_core-1.33.0-py3-none-any.whl.metadata (42 kB)
Collecting cryptography>=2.1.4 (from azure-storage-blob)
  Using cached cryptography-44.0.2-cp39-abi3-win_amd64.whl.metadata (5.7 kB)
Collecting typing-extensions>=4.6.0 (from azure-storage-blob)
  Using cached typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting isodate>=0.6.1 (from azure-storage-blob)
  Using cached isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Collecting requests>=2.21.0 (from azure-core>=1.30.0->azure-storage-blob)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting six>=1.11.0 (from azure-core>=1.30.0->azure-storage-blob)
  Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting cffi>=1.12 (from cryptography>=2.1.4->azure-storage-blob)
  Using cached cffi-1.17.1-cp312-cp312-win

### **Step 25: Loading the Diabetes Dataset using PyCaret 📊**

This step loads the diabetes dataset from PyCaret's built-in dataset library for further analysis.

#### **Explanation:**

- The ```get_data()``` function from PyCaret is used to load built-in datasets.

- In this case, the ```'diabetes'``` dataset is being loaded, which is often used for classification tasks, where the goal is to predict whether a person has diabetes based on various features.

#### **Concepts You Should Know:**

- ```get_data()``` **Function**: Fetches datasets from PyCaret’s library for quick access.

- **Dataset Inspection**: Checking column names or first few rows of a dataset is important before performing any analysis.

- **Classification Dataset**: The diabetes dataset typically includes features like age, BMI, and glucose level to predict the presence of diabetes.

In [18]:
from pycaret.datasets import get_data

# Load diabetes dataset
diabetes = get_data('diabetes')

# Print the column names to check
print(diabetes.columns)



Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Index(['Number of times pregnant',
       'Plasma glucose concentration a 2 hours in an oral glucose tolerance test',
       'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age (years)', 'Class variable'],
      dtype='object')


#### **Step 26: Loading and Setting Up the Diabetes Dataset 🍩**

This step demonstrates how to load a dataset and set it up for classification in PyCaret.

#### **Explanation with 🍩:**

- ```from pycaret.classification import *```: Imports all necessary functions and modules for classification tasks using PyCaret.

- ```from pycaret.datasets import get_data```: Imports the get_data function to load datasets available within PyCaret.

- ```diabetes = get_data('diabetes')```: Loads the built-in 'diabetes' dataset, which contains data about diabetic and non-diabetic patients, into a variable named ```diabetes```.

- ```setup(data=diabetes, target='Class variable')```: Initializes the PyCaret setup, specifying the dataset (```diabetes```) and the target column (```Class variable```) which we aim to predict (whether a patient has diabetes or not).

#### **Concepts to know**:

- **PyCaret 🧠**: A Python library used for easy machine learning automation.

- **Dataset setup ⚙️**: Prepares the data for further machine learning steps like model training and evaluation.

#### **Result 🚀**: 

This step loads the data and prepares it for model training in PyCaret, ensuring the target column is correctly identified.

In [19]:
from pycaret.classification import *
from pycaret.datasets import get_data

# Load and set up the diabetes dataset
diabetes = get_data('diabetes')

# Run setup and specify the correct target column
setup(data=diabetes, target='Class variable')



Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Unnamed: 0,Description,Value
0,Session id,5681
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


<pycaret.classification.oop.ClassificationExperiment at 0x24a82baa500>

### **Step 27: Installing the Azure Storage Blob Library 📦**

This step demonstrates how to install the azure-storage-blob package, which is used to interact with Azure Blob Storage.

#### **Explanation with 📦**:

- ```pip install azure-storage-blob```: Installs the ```azure-storage-blob``` library using ```pip```, which allows Python applications to interact with Azure Blob Storage. This library helps in uploading, downloading, and managing blobs (files) in the cloud.

#### **Concepts to know**:

- **pip 🔧**: Python's package installer, used to install external libraries.

- **Azure Blob Storage ☁️**: A cloud storage service provided by Microsoft Azure, used for storing large amounts of unstructured data such as text and binary data.

#### **Result 🚀**:

After running this command, the ```azure-storage-blob``` package is installed and ready to be used in the project for managing Azure Blob Storage operations.

In [20]:
pip install azure-storage-blob


Note: you may need to restart the kernel to use updated packages.


### **Step 28: Model Creation, Tuning, Finalization, and Saving 🛠️**

This step walks through creating, tuning, finalizing, and saving a Random Forest model using PyCaret.

#### **Explanation with 🛠️:**

- ```from pycaret.datasets import get_data```: Imports the ```get_data``` function to load datasets.

- ```from pycaret.classification import *```: Imports all necessary functions for classification tasks from PyCaret.

- ```diabetes = get_data('diabetes')```: Loads the diabetes dataset.

- ```setup(data=diabetes, target='Class variable')```: Initializes the PyCaret setup, specifying the diabetes dataset and target column.

- ```rf = create_model('rf')```: Creates a Random Forest classifier model (```rf```).

- ```tuned_rf = tune_model(rf)```: Tunes the hyperparameters of the ```rf``` model to improve performance.

- ```final_rf = finalize_model(tuned_rf)```: Finalizes the tuned model, preparing it for deployment and making it ready for predictions.

- ```save_model(final_rf, 'rf_clf_model')```: Saves the finalized model as a local file named ```rf_clf_model```.

- ```model_local = load_model('rf_clf_model')```: (Optional) Loads the saved model back into the environment for further use.

#### **Concepts to know**:

- **Random Forest (RF) 🌲**: An ensemble learning method using multiple decision trees to improve classification accuracy.

- **Model Tuning ⚙️**: The process of adjusting hyperparameters to enhance the model's performance.

- **Model Finalization 🏁**: The process of preparing a model for deployment by fixing its parameters.

- **Saving and Loading Models 💾**: Storing models for later use and loading them for predictions or further analysis.

#### **Result 🚀**: 

After this step, the Random Forest model is created, tuned, finalized, saved locally, and can be loaded back whenever needed for future predictions or analysis.

In [21]:
from pycaret.datasets import get_data
from pycaret.classification import *

# Load diabetes dataset
diabetes = get_data('diabetes')

# Set up the dataset for modeling
setup(data=diabetes, target='Class variable')  # Replace with correct target column if needed

# Create, tune, and finalize the model
rf = create_model('rf')  # Random Forest model
tuned_rf = tune_model(rf)  # Tune the model
final_rf = finalize_model(tuned_rf)  # Finalize the model

# Save the model to a local file
save_model(final_rf, 'rf_clf_model')

# Load the model back (optional)
model_local = load_model('rf_clf_model')


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Unnamed: 0,Description,Value
0,Session id,1916
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7963,0.8451,0.4737,0.9,0.6207,0.4992,0.5472
1,0.7593,0.8053,0.6842,0.65,0.6667,0.4785,0.4788
2,0.6852,0.7932,0.4737,0.5625,0.5143,0.2839,0.2862
3,0.7407,0.7992,0.5789,0.6471,0.6111,0.4176,0.419
4,0.7778,0.8158,0.6842,0.6842,0.6842,0.5128,0.5128
5,0.7222,0.8383,0.6316,0.6,0.6154,0.3982,0.3985
6,0.7407,0.8038,0.5789,0.6471,0.6111,0.4176,0.419
7,0.717,0.7825,0.5556,0.5882,0.5714,0.3604,0.3607
8,0.8491,0.8556,0.6667,0.8571,0.75,0.6443,0.6547
9,0.7736,0.8571,0.5556,0.7143,0.625,0.4664,0.474


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7222,0.8466,0.3158,0.75,0.4444,0.2981,0.3477
1,0.7407,0.8466,0.6316,0.6316,0.6316,0.4316,0.4316
2,0.6852,0.791,0.4737,0.5625,0.5143,0.2839,0.2862
3,0.7407,0.7955,0.4737,0.6923,0.5625,0.3874,0.4014
4,0.7407,0.8075,0.6316,0.6316,0.6316,0.4316,0.4316
5,0.7222,0.8195,0.5263,0.625,0.5714,0.3682,0.3711
6,0.7778,0.8391,0.5263,0.7692,0.625,0.4749,0.4921
7,0.717,0.7762,0.5,0.6,0.5455,0.3424,0.3454
8,0.8491,0.8683,0.6111,0.9167,0.7333,0.6339,0.6592
9,0.7547,0.8222,0.5556,0.6667,0.6061,0.4301,0.4339


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Loaded


### **Step 29: Saving the Model Locally 💾**

This step demonstrates how to save a trained model to a local file for future use.

#### **Explanation with 💾**:

- ```from pycaret.classification import save_model```: Imports the ```save_model``` function from PyCaret, which is used to save trained models to a file.

- ```save_model(final_rf, 'rf_clf_model')```: Saves the ```final_rf``` (finalized Random Forest model) to a file named ```rf_clf_model```. This allows you to store the model for future predictions or deployment without retraining it.

#### **Concepts to know**:

- **Model Saving 💾**: The process of storing a trained machine learning model to a file so that it can be loaded and used later.

- **File Storage 📂**: In this case, the model is saved to a local file, making it portable and easily accessible.

#### **Result 🚀**: 

The trained Random Forest model is saved as rf_clf_model on your local machine, and can now be loaded and reused for making predictions or further analysis without retraining the model.

In [22]:
from pycaret.classification import save_model
save_model(final_rf, 'rf_clf_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Number of times pregnant',
                                              'Plasma glucose concentration a 2 '
                                              'hours in an oral glucose '
                                              'tolerance test',
                                              'Diastolic blood pressure (mm Hg)',
                                              'Triceps skin fold thickness (mm)',
                                              '2-Hour serum insulin (mu U/ml)',
                                              'Body mass index (weight in '
                                              'kg/(height in m)^2)',
                                              'Diabetes pedigre...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                         class_w

### **Step 30: Loading the Saved Model 🔄**

This step demonstrates how to load a previously saved model from a local file for further use.

#### **Explanation with 🔄**:

- ```from pycaret.classification import load_model```: Imports the ```load_model``` function from PyCaret, which is used to load a saved model.

- ```model_local = load_model('rf_clf_model')```: Loads the saved model named ```rf_clf_model``` back into the environment for further use, such as making predictions or evaluations.

#### **Concepts to know**:

- ```Model Loading 🔄```: The process of retrieving a saved model from storage and loading it into memory for reuse.

- ```Reusability ♻️```: Once the model is saved, it can be loaded multiple times, eliminating the need to retrain it.

#### **Result 🚀**:

The saved model rf_clf_model is loaded back into the environment, ready for use in making predictions or further analysis without retraining the model.

In [23]:
from pycaret.classification import load_model
model_local = load_model('rf_clf_model')

Transformation Pipeline and Model Successfully Loaded


### **Step 31: Making Predictions on Unseen Data 🔮**

This step shows how to use the trained model to make predictions on new, unseen data.

#### **Explanation with 🔮**:

- ```data_unseen = diabetes.sample(10)```: Randomly selects 10 rows from the diabetes dataset to simulate new, unseen data for making predictions.

- ```unseen_predictions = predict_model(final_rf, data=data_unseen, verbose=True)```: Uses the trained and finalized Random Forest model (```final_rf```) to make predictions on the ```data_unseen```. The ```verbose=True``` argument ensures that detailed information about the prediction process is displayed.

#### **Concepts to know**:

- **Unseen Data 🌱**: Data that the model has not encountered during training, used to evaluate how well the model generalizes.

- **Prediction 🚀**: The process of using a trained model to make predictions based on new input data.

#### **Result 🚀**: 

The model makes predictions on the 10 randomly selected rows from the diabetes dataset, and the results (predictions) are displayed for further analysis.










In [24]:
# Example: Load the same diabetes dataset for unseen predictions
data_unseen = diabetes.sample(10)  # Select 10 random rows for prediction

# Now use the model to make predictions
unseen_predictions = predict_model(final_rf, data=data_unseen, verbose=True)


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### **Step 32: Viewing Unseen Predictions 🔍**

This step shows how to view the predictions made on unseen data.

#### **Explanation with 🔍**:

- ```unseen_predictions```: This variable holds the predictions made by the model on the unseen data (```data_unseen```). It contains the predicted class values (whether a patient is diabetic or not) and may include additional information, such as probabilities or model performance metrics.

#### **Concepts to know**:

- **Prediction Output 📊**: The output of **predict_model** contains both the predicted values and any relevant performance details, such as prediction probabilities.

- **Model Evaluation 🧮**: By examining **unseen_predictions**, you can assess how well the model is performing on new, unseen data.

#### **Result 🚀**:

The predicted class labels (and additional information if available) are displayed for the 10 randomly selected rows from the diabetes dataset, allowing you to evaluate the model’s performance on unseen data.

In [25]:
unseen_predictions

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
461,1,71,62,0,0,21.799999,0.416,26,0,0,1.0
366,6,124,72,0,0,27.6,0.368,29,1,1,0.78
767,1,93,70,31,0,30.4,0.315,23,0,0,0.99
666,4,145,82,18,0,32.5,0.235,70,1,1,0.77
742,1,109,58,18,116,28.5,0.219,22,0,0,1.0
333,12,106,80,0,0,23.6,0.137,44,0,0,0.91
757,0,123,72,0,0,36.299999,0.258,52,1,1,0.8
83,0,101,65,28,0,24.6,0.237,22,0,0,1.0
469,6,154,78,41,140,46.099998,0.571,27,0,0,0.69
167,4,120,68,0,0,29.6,0.709,34,0,0,0.79


### **Step 33: Creating the Gradio Interface for Predictions 🎛️**

This step involves creating a Gradio interface to allow users to input data and receive predictions, while logging each prediction to Azure storage.

#### **Explanation with 🎛️**:

- ```from pycaret.classification import *```: Imports necessary PyCaret functions for classification tasks.

- ```import gradio as gr```: Imports the Gradio library, which is used to build interactive UIs for machine learning models.

- ```import pandas as pd```: Imports Pandas for handling data structures, like DataFrames.

- ```from azure.storage.blob import BlobServiceClient```: Imports the Azure Blob service client to interact with Azure Blob storage.

#### **Key Functions**:

- ```log_prediction```: Logs input data, predictions, and confidence scores to Azure Blob Storage, including a timestamp.

- ```predict_default```: Takes input data from the user, runs the trained model to predict credit card default, logs the prediction to Azure, and returns the result with confidence score.

#### **Gradio Interface**:

- ```gr.Interface```: Creates an interactive web interface that allows users to input customer details (like credit limit, age, marital status, etc.) and predict whether a credit card default will occur.

- **Inputs and Outputs**: The interface contains various input fields for each customer feature and outputs the prediction along with the confidence score.

#### **Concepts to know**:

- **Gradio 🎨**: A Python library that allows easy creation of user interfaces for machine learning models.

- **Azure Blob Storage ☁️**: A cloud storage service that stores large amounts of unstructured data, used here to log predictions.

- **Prediction Logging 📈**: The process of storing prediction details (inputs, results, confidence) for future reference or analysis.

#### **Result 🚀**:

The Gradio interface launches, allowing users to input credit card data and receive predictions about whether the customer will default, while also logging these predictions to Azure storage for future tracking.

In [26]:
import gradio as gr
from pycaret.classification import *
import pandas as pd
import json
from datetime import datetime
from azure.storage.blob import BlobServiceClient

def log_prediction(input_data, prediction, score):
    # Create log entry
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "inputs": input_data,
        "prediction": prediction,
        "confidence_score": float(score)
    }

    # Connect to Azure storage
    blob_service_client = BlobServiceClient.from_connection_string(connect_str)
    container_client = blob_service_client.get_container_client(authentication['container'])

    # Create blob name with timestamp
    blob_name = f"predictions/prediction_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

    # Upload log as JSON
    blob_client = container_client.get_blob_client(blob_name)
    blob_client.upload_blob(json.dumps(log_entry))

def predict_default(limit_bal, sex, education, marriage, age, pay_1, pay_2, pay_3, pay_4, pay_5, pay_6,
                   bill_amt1, bill_amt2, bill_amt3, bill_amt4, bill_amt5, bill_amt6,
                   pay_amt1, pay_amt2, pay_amt3, pay_amt4, pay_amt5, pay_amt6):

    # Create input data dictionary
    input_data = {
        'LIMIT_BAL': limit_bal, 'SEX': sex, 'EDUCATION': education, 'MARRIAGE': marriage,
        'AGE': age, 'PAY_1': pay_1, 'PAY_2': pay_2, 'PAY_3': pay_3, 'PAY_4': pay_4,
        'PAY_5': pay_5, 'PAY_6': pay_6, 'BILL_AMT1': bill_amt1, 'BILL_AMT2': bill_amt2,
        'BILL_AMT3': bill_amt3, 'BILL_AMT4': bill_amt4, 'BILL_AMT5': bill_amt5,
        'BILL_AMT6': bill_amt6, 'PAY_AMT1': pay_amt1, 'PAY_AMT2': pay_amt2,
        'PAY_AMT3': pay_amt3, 'PAY_AMT4': pay_amt4, 'PAY_AMT5': pay_amt5,
        'PAY_AMT6': pay_amt6
    }

    # Convert to dataframe
    data = pd.DataFrame([input_data])

    # Get prediction
    prediction = predict_model(model_azure, data=data)
    pred_label = prediction['prediction_label'].iloc[0]
    pred_score = prediction['prediction_score'].iloc[0]

    # Log prediction
    log_prediction(input_data, int(pred_label), pred_score)

    # Return result with confidence score
    result = "Default" if pred_label == 1 else "No Default"
    return f"{result} (Confidence: {pred_score:.2%})"

# Create interface
iface = gr.Interface(
    fn=predict_default,
    inputs=[
        gr.Number(label="Credit Limit Balance (NT dollar)", value=200000),
        gr.Number(label="Sex (1=male, 2=female)", value=2, minimum=1, maximum=2),
        gr.Number(label="Education (1=graduate, 2=university, 3=high school, 4=others)", value=2, minimum=1, maximum=4),
        gr.Number(label="Marital Status (1=married, 2=single, 3=others)", value=2, minimum=1, maximum=3),
        gr.Number(label="Age (years)", value=30),
        gr.Number(label="Repayment Status Month 1 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 2 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 3 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 4 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 5 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 6 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Bill Amount Month 1 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 2 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 3 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 4 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 5 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 6 (NT dollar)", value=50000),
        gr.Number(label="Payment Amount Month 1 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 2 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 3 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 4 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 5 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 6 (NT dollar)", value=5000)
    ],
    outputs=gr.Text(label="Credit Card Default Prediction"),
    title="Credit Card Default Prediction with Azure Logging",
    description="Enter customer details to predict credit card default probability. All predictions are logged to Azure storage."
)

iface.launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




## **Used below prompt in GitHub Copilot with in the notebook to generate the below app:**

#### **Prompt:**
Help me create a gradio app now that connects to model from blob storage and gives prediction on new data that will be entered by user in the application. put a default value in the box and also at the title of the box show which value is expected

### **Step 34: Building the Gradio Interface for Credit Card Default Prediction 🎮**

This step shows how to create a Gradio interface that allows users to input customer data and get credit card default predictions from the trained model.

#### **Explanation with 🎮**:

- ```def predict_default(...)```: This function receives customer details as input, prepares the data, loads the trained model from Azure Blob Storage, and returns the prediction ("Default" or "No Default").

- ```data = pd.DataFrame(data)```: Converts the input data dictionary into a Pandas DataFrame, which is the required format for model prediction.

- ```model = load_model('rf-clf-101', platform='azure', authentication=authentication)```: Loads the trained Random Forest model from Azure Blob Storage using the specified credentials.

- ```prediction = predict_model(model, data=data)```: Uses the loaded model to make a prediction on the input data.

##### **Gradio Interface:**

- ```gr.Interface```: Creates an interactive UI for the user. The interface takes in multiple input fields, such as credit limit, repayment status, bill amounts, etc.

- **Inputs**: The user is asked to enter various financial data points, such as credit limit balance, sex, age, repayment statuses, and payment amounts.

- **Outputs**: The result, either "Default" or "No Default," is displayed along with a brief prediction explanation.

#### **Concepts to know:**

- **Gradio Interface 🎨**: A simple and powerful library to create interactive UIs for machine learning models.

- **Azure Blob Storage ☁️**: Used to store the trained model so it can be accessed from anywhere.

- **Model Deployment 🚀**: Making the model accessible for predictions in a user-friendly interface.

#### **Result 🚀**: 

The Gradio interface is created and launched, allowing users to input customer data and receive predictions on whether the customer will default on their credit card. The interface is interactive and user-friendly.

In [27]:
import gradio as gr
from pycaret.classification import *
import pandas as pd

def predict_default(limit_bal, sex, education, marriage, age, pay_1, pay_2, pay_3, pay_4, pay_5, pay_6,
                   bill_amt1, bill_amt2, bill_amt3, bill_amt4, bill_amt5, bill_amt6,
                   pay_amt1, pay_amt2, pay_amt3, pay_amt4, pay_amt5, pay_amt6):

    # Create data dictionary
    data = {
        'LIMIT_BAL': [limit_bal], 'SEX': [sex], 'EDUCATION': [education], 'MARRIAGE': [marriage],
        'AGE': [age], 'PAY_1': [pay_1], 'PAY_2': [pay_2], 'PAY_3': [pay_3], 'PAY_4': [pay_4],
        'PAY_5': [pay_5], 'PAY_6': [pay_6], 'BILL_AMT1': [bill_amt1], 'BILL_AMT2': [bill_amt2],
        'BILL_AMT3': [bill_amt3], 'BILL_AMT4': [bill_amt4], 'BILL_AMT5': [bill_amt5],
        'BILL_AMT6': [bill_amt6], 'PAY_AMT1': [pay_amt1], 'PAY_AMT2': [pay_amt2],
        'PAY_AMT3': [pay_amt3], 'PAY_AMT4': [pay_amt4], 'PAY_AMT5': [pay_amt5],
        'PAY_AMT6': [pay_amt6]
    }

    # Convert to dataframe
    data = pd.DataFrame(data)

    # Load model from azure blob
    authentication = {'container': 'pycaret-cls-10111'}
    model = load_model('rf-clf-101', platform='azure', authentication=authentication)

    # Make prediction
    prediction = predict_model(model, data=data)

    return "Default" if prediction['prediction_label'][0] == 1 else "No Default"

# Create the interface
iface = gr.Interface(
    fn=predict_default,
    inputs=[
        gr.Number(label="Credit Limit Balance (NT dollar)", value=200000),
        gr.Number(label="Sex (1=male, 2=female)", value=2, minimum=1, maximum=2),
        gr.Number(label="Education (1=graduate, 2=university, 3=high school, 4=others)", value=2, minimum=1, maximum=4),
        gr.Number(label="Marital Status (1=married, 2=single, 3=others)", value=2, minimum=1, maximum=3),
        gr.Number(label="Age (years)", value=30),
        gr.Number(label="Repayment Status Month 1 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 2 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 3 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 4 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 5 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Repayment Status Month 6 (-2 to 8)", value=0, minimum=-2, maximum=8),
        gr.Number(label="Bill Amount Month 1 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 2 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 3 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 4 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 5 (NT dollar)", value=50000),
        gr.Number(label="Bill Amount Month 6 (NT dollar)", value=50000),
        gr.Number(label="Payment Amount Month 1 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 2 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 3 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 4 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 5 (NT dollar)", value=5000),
        gr.Number(label="Payment Amount Month 6 (NT dollar)", value=5000)
    ],
    outputs=gr.Text(label="Credit Card Default Prediction"),
    title="Credit Card Default Prediction",
    description="Enter customer details to predict credit card default probability"
)

iface.launch()

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




### **Step 35: Upgrading the google-cloud-storage Package 📦**

This step involves upgrading the google-cloud-storage package to ensure you have the latest version for interacting with Google Cloud Storage.

#### **Explanation with 📦**:

- ```pip install --upgrade google-cloud-storage```: This command upgrades the ```google-cloud-storage``` package to its latest version. It ensures that you are using the newest features, bug fixes, and improvements.

- **Why upgrade? 🔄**: Keeping packages up-to-date is important to avoid compatibility issues and to take advantage of new features and optimizations provided by the package maintainers.

#### **Concepts to know:**

- **Google Cloud Storage ☁️**: A service for storing and retrieving large amounts of data in Google Cloud, often used for saving models, datasets, and logs.

- ```pip``` **Package Manager 🛠️**: A tool for installing and managing Python packages from the Python Package Index (PyPI).

**Result 🚀**: 

The ```google-cloud-storage``` package is now updated to the latest version, enabling you to use the newest features and maintain compatibility with Google Cloud services.

In [28]:
pip install --upgrade google-cloud-storage

Note: you may need to restart the kernel to use updated packages.


### **Step 36: Set Google Application Credentials for Authentication 🌐**

Set up the environment variable for Google Cloud authentication.

#### **Explanation: 🌟**

- We use the ```os.environ``` method to set an environment variable that tells Google Cloud libraries where to find the service account credentials file.

- The environment variable ```GOOGLE_APPLICATION_CREDENTIALS``` is set to the path of the service account JSON file.

- This step is crucial for authenticating and authorizing your program to interact with Google Cloud services using the credentials in the JSON file.

#### **Result: 🏁**

The environment variable is set successfully, enabling authentication with Google Cloud services.

In [34]:
import os

# Set the Google Application Credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'C:\Users\Hello\Downloads\keen-bucksaw-457308-v7-e20312101698.json'

# Verify if the environment variable is set correctly
print(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS'))


C:\Users\Hello\Downloads\keen-bucksaw-457308-v7-e20312101698.json


### **Step 37: List Google Cloud Storage Buckets 📦**

Retrieve and list all storage buckets in your Google Cloud project.

#### **Explanation: 🌟**

- Use the ```google.cloud``` library to interact with Google Cloud Storage.

- Set the ```GOOGLE_APPLICATION_CREDENTIALS``` environment variable to authenticate using the service account credentials.

- Initialize the ```storage.Client()``` to interact with Google Cloud Storage.

- List all available storage buckets using ```list_buckets()``` method.

- If no buckets are found, display a relevant message.

#### **Result: 🏁**

The program will attempt to connect to your Google Cloud project, authenticate using the provided credentials, and list all available storage buckets. If no buckets are found, a message will indicate that no buckets exist in the project.



In [37]:
import os
from google.cloud import storage

# Set the Google Application Credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'C:\Users\Hello\Downloads\keen-bucksaw-457308-v7-e20312101698.json'

# Verify if the environment variable is set correctly
print(f"Using credentials: {os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')}")

try:
    # Initialize the client
    storage_client = storage.Client()

    # Verify the project the client is connected to
    print(f"Connected to project: {storage_client.project}")

    # List all buckets in your project
    buckets = list(storage_client.list_buckets())

    # Check if buckets exist and print them
    if buckets:
        for bucket in buckets:
            print(f"Bucket name: {bucket.name}")
    else:
        print("No buckets found in the project.")
        
except Exception as e:
    print(f"Error occurred: {e}")


Using credentials: C:\Users\Hello\Downloads\keen-bucksaw-457308-v7-e20312101698.json
Connected to project: keen-bucksaw-457308-v7
No buckets found in the project.


### **Step 38: Install AWS CLI 🛠️**

Install the AWS Command Line Interface (CLI) to interact with AWS services directly from the command line.

#### **Explanation: 🌟**

- Use the ```pip``` package manager to install the AWS CLI.

- The AWS CLI allows you to manage AWS resources and services like S3, EC2, Lambda, etc., from your terminal.

- After installation, you can configure the AWS CLI using your AWS credentials to access and manage your AWS account.

#### **Concepts to know: 💡**

- **AWS CLI**: A command-line tool for managing AWS resources.

- **pip**: A package manager for Python that helps install and manage software packages.

- **AWS Credentials**: Your access keys that authenticate your requests to AWS services.

#### **Result: 🏁**

Once the installation is complete, you will be able to use the AWS CLI to manage your AWS resources. You can then configure it by running ```aws configure``` to set your AWS access keys and region.

In [38]:
! pip install awscli



### **Step 39: 🔍 Verifying Google Cloud SDK Version 🔍**

#### **Explanation:**
In this step, we're **checking the currently installed version** of the Google Cloud SDK and its components. This verification ensures that you have the necessary tools available before proceeding with Google Cloud operations in subsequent steps.

#### **Key Concepts: 📚**

- **Google Cloud SDK** provides the command-line tools for interacting with Google Cloud Platform
- **Version checking** confirms your environment is properly set up
- The command shows versions of multiple components including **bq** (BigQuery), **gsutil** (Storage), and core SDK
- **Update notifications** are normal and don't indicate errors

#### **Important Note: ⚠️**

![image.png](attachment:b9ab7745-0602-4d47-bd48-f07c38028f47.png)

You may see a message stating "Updates are available for some Google Cloud CLI components" after running this command. This is **NOT an error** - it's simply informing you that newer versions exist. You can continue with the lab using your current version without any issues. If you want to update, you would need to run ```!gcloud components update``` in a separate cell, but this is completely optional and not required for completing the lab exercises.

In [44]:
!gcloud --version

Google Cloud SDK 518.0.0
bq 2.1.15
core 2025.04.11
gcloud-crc32c 1.0.0
gsutil 5.33


Updates are available for some Google Cloud CLI components.  To install them,
please run:
  $ gcloud components update


### **Step 40: 🔄 Setting Your Google Cloud Project 🔄**

#### **Explanation:**

In this step, we're **configuring the Google Cloud environment** by setting the active project for all subsequent gcloud commands. This ensures that any resources you create or access will be associated with your specific project. We're also defining the bucket name format that will be used later for model storage.

#### **Key Concepts: 📚**

- **Project configuration** is essential when working with Google Cloud resources
- The **gcloud config set** project command changes your active project
- **Environment variables** like CLOUD_PROJECT help maintain consistency across commands
- **Bucket names** are being prepared for later use in model storage and deployment

#### **Important Note: ⚠️**

![image.png](attachment:f148d95b-6f76-4c34-a4e1-069e06df2544.png)

The message "Updated property [core/project]" is a **success confirmation**, not an error. It indicates that your project has been successfully set as the active project for all subsequent gcloud commands. This is the expected and correct output for this command.

In [45]:
# GCP project name, Change the name based on your own GCP project.
CLOUD_PROJECT = 'gcpessentials-rz' # GCP project name
bucket_name = 'pycaret-clf1011-test1' # bucket name for storage of your model
BUCKET = 'gs://' + CLOUD_PROJECT + '-{}'.format(bucket_name)
# Set the gcloud consol to $CLOUD_PROJECT Environment Variable for your Desired Project)
!gcloud config set project $CLOUD_PROJECT

Updated property [core/project].


### **🧪 Step 41: Train and Finalize Best Classification Model 🔍**

Train, compare, and finalize the best model using PyCaret's automation power.

#### **✅ Explanation**:

- **Load Data**: Uses PyCaret’s built-in ```credit``` dataset.

- **setup()**: Prepares the ML environment (target column = ```'default'```).

- **compare_models()**: Tests multiple algorithms and picks the best one automatically.

- **finalize_model()**: Retrains the best model on the entire dataset for deployment.

In [52]:
from pycaret.classification import *
import pandas as pd

# Load sample dataset
from pycaret.datasets import get_data
data = get_data('credit')

# Initialize setup
clf = setup(data=data, target='default', session_id=123, verbose=True)

# Compare models
best_model = compare_models()

# Finalize the best model
final_rf = finalize_model(best_model)


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0


Unnamed: 0,Description,Value
0,Session id,123
1,Target,default
2,Target type,Binary
3,Original data shape,"(24000, 24)"
4,Transformed data shape,"(24000, 24)"
5,Transformed train set shape,"(16800, 24)"
6,Transformed test set shape,"(7200, 24)"
7,Numeric features,23
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8215,0.784,0.3751,0.6729,0.4813,0.3839,0.408,6.681
lightgbm,Light Gradient Boosting Machine,0.8203,0.7768,0.3756,0.6664,0.4802,0.3816,0.4047,0.63
ada,Ada Boost Classifier,0.8186,0.7748,0.3325,0.685,0.4473,0.3541,0.3874,1.498
lda,Linear Discriminant Analysis,0.8129,0.7226,0.2719,0.6985,0.391,0.3049,0.3527,0.142
rf,Random Forest Classifier,0.8124,0.7645,0.367,0.6313,0.4637,0.3596,0.3793,4.885
et,Extra Trees Classifier,0.8089,0.7516,0.3667,0.6142,0.459,0.3518,0.3692,2.094
lr,Logistic Regression,0.8062,0.7108,0.2267,0.6897,0.3403,0.2595,0.3154,3.588
ridge,Ridge Classifier,0.8014,0.7226,0.1621,0.7296,0.265,0.2008,0.2782,0.067
dummy,Dummy Classifier,0.7789,0.5,0.0,0.0,0.0,0.0,0.0,0.056
knn,K Neighbors Classifier,0.7507,0.5971,0.1826,0.3705,0.2443,0.1154,0.1259,0.752


### **🧾 Step 42: Save the Final Model Locally 💾**

Save your trained and finalized model to a ```.pkl``` file on your system.

- This step is useful when you want to reuse or deploy your model later.

#### **🔍 Explanation:**

- ```save_model()```: Saves the model pipeline as a ```.pkl``` file.

- ```'rf_clf_101'```: The filename (without extension); PyCaret adds ```.pkl```.

- You’ll find ```rf_clf_101.pkl``` in your current working directory.

#### **✅ Result**:

The model is successfully saved locally as ```rf_clf_101.pkl``` and ready for deployment or sharing.

In [58]:
save_model(final_rf, 'rf_clf_101')
print("✅ Model saved locally as 'rf_clf_101.pkl'")


Transformation Pipeline and Model Successfully Saved
✅ Model saved locally as 'rf_clf_101.pkl'


### **🌱 Step 43: Train and Finalize Model on Iris Dataset 🌸**

Use the Iris dataset to train and finalize the best classification model.

#### **🔍 Explanation:**

- ```get_data('iris')```: Loads the popular iris flower dataset.

- ```setup()```: Prepares the dataset for modeling.

- ```compare_models()```: Automatically compares several models and selects the best.

- ```finalize_model()```: Locks the model for deployment by training on the entire dataset.

#### **✅ Result:**

The best classification model for the Iris dataset is selected and finalized, ready for saving or deployment.

In [48]:
from pycaret.datasets import get_data
from pycaret.classification import *

# Step 1: Load sample dataset
data = get_data('iris')  # You can also use your own CSV with pd.read_csv()

# Step 2: Setup PyCaret
clf1 = setup(data, target='species')  # 'species' is the target column in iris

# Step 3: Compare models and finalize the best one
best_model = compare_models()
final_rf = finalize_model(best_model)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Unnamed: 0,Description,Value
0,Session id,1743
1,Target,species
2,Target type,Multiclass
3,Target mapping,"Iris-setosa: 0, Iris-versicolor: 1, Iris-virginica: 2"
4,Original data shape,"(150, 5)"
5,Transformed data shape,"(150, 5)"
6,Transformed train set shape,"(105, 5)"
7,Transformed test set shape,"(45, 5)"
8,Numeric features,4
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.9718,0.0,0.9718,0.9789,0.9718,0.9579,0.9614,0.059
lda,Linear Discriminant Analysis,0.9718,0.0,0.9718,0.9789,0.9718,0.9579,0.9614,0.038
lr,Logistic Regression,0.9527,0.0,0.9527,0.9591,0.9523,0.9286,0.9319,1.175
knn,K Neighbors Classifier,0.9427,0.9849,0.9427,0.9561,0.9419,0.9137,0.9209,0.08
nb,Naive Bayes,0.9427,0.9883,0.9427,0.947,0.9423,0.9133,0.9158,0.053
et,Extra Trees Classifier,0.9327,0.9815,0.9327,0.9436,0.9319,0.8983,0.9042,0.288
gbc,Gradient Boosting Classifier,0.9245,0.0,0.9245,0.937,0.9228,0.8854,0.8927,0.387
rf,Random Forest Classifier,0.9236,0.9858,0.9236,0.9322,0.9228,0.8844,0.8892,0.247
lightgbm,Light Gradient Boosting Machine,0.9164,0.9806,0.9164,0.9191,0.9143,0.8726,0.8756,0.557
ada,Ada Boost Classifier,0.9055,0.0,0.9055,0.9136,0.9042,0.8565,0.8615,0.2


### **💾 Step 44: Save Finalized Model Using Joblib 🗄️**

Save the finalized model locally as a .pkl file using joblib.

#### **🔍 Explanation:**

- ```finalize_model()```: Locks the best model for deployment.

- ```joblib.dump()```: Saves the model to a file (in this case, 'rf_model.pkl') for later use.

#### **✅ Result:**

The best model is saved locally as ```'rf_model.pkl'```, which can be loaded and used for future predictions.

In [59]:
from pycaret.classification import *
import joblib  # Import joblib to save the model

# Assuming you've finalized the model
final_rf = finalize_model(best_model)

# Save the model using joblib
joblib.dump(final_rf, 'rf_model.pkl')  # Save the model as a .pkl file

print("Model saved locally as 'rf_model.pkl'")


Model saved locally as 'rf_model.pkl'


### **💾 Step 45: Save and Load Model Using Joblib 🗄️**

This step demonstrates how to save the finalized model using joblib and load it for future predictions.

#### **🔍 Explanation:**

- ```joblib.dump()```: Saves the trained and finalized model to a ```.pkl``` file, enabling reuse.

- ```joblib.load()```: Loads the saved model from the ```.pkl``` file for further predictions or testing.

#### **✅ Result:**

The model is saved locally as ```'rf_model.pkl'```.

The model is successfully loaded back for future use and predictions.

In [60]:
from pycaret.classification import *
import joblib  # For saving and loading the model

# Load your dataset (you can replace this with your own dataset)
from pycaret.datasets import get_data
data = get_data('iris')  # Sample dataset

# Set up PyCaret
clf1 = setup(data, target='species')  # Replace 'species' with your target column name

# Compare models and finalize the best one
best_model = compare_models()
final_rf = finalize_model(best_model)

# Save the model using joblib (local deployment)
joblib.dump(final_rf, 'rf_model.pkl')

print("Model saved locally as 'rf_model.pkl'")

# Load the saved model for further predictions
loaded_model = joblib.load('rf_model.pkl')
print("Model loaded successfully for prediction")


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Unnamed: 0,Description,Value
0,Session id,6503
1,Target,species
2,Target type,Multiclass
3,Target mapping,"Iris-setosa: 0, Iris-versicolor: 1, Iris-virginica: 2"
4,Original data shape,"(150, 5)"
5,Transformed data shape,"(150, 5)"
6,Transformed train set shape,"(105, 5)"
7,Transformed test set shape,"(45, 5)"
8,Numeric features,4
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.9718,0.0,0.9718,0.9789,0.9718,0.9579,0.9614,0.041
lda,Linear Discriminant Analysis,0.9718,0.0,0.9718,0.9743,0.9718,0.9576,0.9589,0.038
nb,Naive Bayes,0.9618,0.992,0.9618,0.9739,0.9606,0.9431,0.9497,0.06
lr,Logistic Regression,0.9527,0.0,0.9527,0.963,0.9519,0.9295,0.935,1.385
ada,Ada Boost Classifier,0.9527,0.0,0.9527,0.963,0.9519,0.9295,0.935,0.187
gbc,Gradient Boosting Classifier,0.9527,0.0,0.9527,0.963,0.9519,0.9295,0.935,0.425
knn,K Neighbors Classifier,0.9518,0.9898,0.9518,0.9664,0.9505,0.928,0.936,0.09
dt,Decision Tree Classifier,0.9436,0.9598,0.9436,0.9554,0.9428,0.9163,0.9225,0.039
rf,Random Forest Classifier,0.9427,0.9976,0.9427,0.955,0.9414,0.9141,0.921,0.267
et,Extra Trees Classifier,0.9427,0.9976,0.9427,0.955,0.9414,0.9141,0.921,0.22


Model saved locally as 'rf_model.pkl'
Model loaded successfully for prediction


### **📊 Step 46: Save, Load, and Predict with a Finalized Model Using Joblib 🧑‍💻**

In this step, we load the diabetes dataset, train the model, save it, load it back, and make predictions on unseen data.

#### **🔍 Explanation:**

- **Loading and Preparing the Dataset**: The ```diabetes``` dataset is loaded, and its columns are printed.

- **Model Training**: PyCaret is used to set up the dataset and train the best model.

- **Saving and Loading the Model**: The finalized model is saved using ```joblib.dump()``` and loaded with ```joblib.load()```.

- **Prediction**: Predictions are made on unseen data (target column is removed) using the loaded model.

#### **✅ Result:**

- **Model Saved**: The model is saved locally as ```'diabetes_model.pkl'```.

- **Model Loaded**: The saved model is successfully loaded.

- **Predictions**: Predictions are made on unseen data, showing the first few resul

In [61]:
from pycaret.classification import *
from pycaret.datasets import get_data
import joblib

# Load the diabetes dataset (PIMA)
data = get_data('diabetes')

# Show the columns
print("✅ Available columns:", data.columns.tolist())

# Setup PyCaret with correct target column
clf = setup(data, target='Class variable', verbose=False)

# Train and finalize best model
best_model = compare_models()
final_model = finalize_model(best_model)

# Save the model locally
joblib.dump(final_model, 'diabetes_model.pkl')
print("✅ Model saved as 'diabetes_model.pkl'")

# Load the model
loaded_model = joblib.load('diabetes_model.pkl')
print("✅ Model loaded successfully")

# Create unseen data (remove target column)
data_unseen = data.drop('Class variable', axis=1)

# Predict using the loaded model
predictions = predict_model(loaded_model, data=data_unseen)
print("✅ Predictions on unseen data:")
print(predictions.head())


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


✅ Available columns: ['Number of times pregnant', 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test', 'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)', '2-Hour serum insulin (mu U/ml)', 'Body mass index (weight in kg/(height in m)^2)', 'Diabetes pedigree function', 'Age (years)', 'Class variable']


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.7803,0.8353,0.619,0.7222,0.661,0.5005,0.5077,0.327
lr,Logistic Regression,0.767,0.8445,0.5699,0.7074,0.6221,0.4588,0.4702,0.074
ridge,Ridge Classifier,0.7669,0.8452,0.5646,0.7095,0.6204,0.4576,0.4689,0.039
lda,Linear Discriminant Analysis,0.7651,0.8449,0.5646,0.7056,0.6186,0.4541,0.4653,0.044
nb,Naive Bayes,0.7577,0.8172,0.624,0.6696,0.6366,0.457,0.4645,0.043
et,Extra Trees Classifier,0.7577,0.8336,0.5541,0.6965,0.6085,0.438,0.4495,0.239
gbc,Gradient Boosting Classifier,0.7504,0.8138,0.5816,0.6699,0.618,0.4346,0.4403,0.241
qda,Quadratic Discriminant Analysis,0.7502,0.821,0.5646,0.6732,0.6027,0.4254,0.4362,0.039
knn,K Neighbors Classifier,0.7318,0.7679,0.5442,0.6514,0.5844,0.3898,0.399,0.064
lightgbm,Light Gradient Boosting Machine,0.73,0.8121,0.5871,0.6242,0.6006,0.3978,0.4014,0.257


✅ Model saved as 'diabetes_model.pkl'
✅ Model loaded successfully


✅ Predictions on unseen data:
   Number of times pregnant  \
0                         6   
1                         1   
2                         8   
3                         1   
4                         0   

   Plasma glucose concentration a 2 hours in an oral glucose tolerance test  \
0                                                148                          
1                                                 85                          
2                                                183                          
3                                                 89                          
4                                                137                          

   Diastolic blood pressure (mm Hg)  Triceps skin fold thickness (mm)  \
0                                72                                35   
1                                66                                29   
2                                64                                 0   
3               

### **📊 Step 47: Predict on Unseen Data Using the Loaded Model**

In this step, we will use the loaded model to make predictions on unseen data, and display the predicted results.

#### **🔍 Explanation:**

- **Unseen Data**: This is the data without the target column (```'Class variable'```).

- **Prediction**: The ```predict_model()``` function is used to make predictions on the unseen data using the loaded model.

- **Displaying Results**: The top 5 predictions from the unseen data are printed.

#### **✅ Result**:

- **Predictions**: You will see the first 5 predicted values for the unseen data.

   - This output includes predicted labels along with the probability for each prediction.

In [62]:
unseen_predictions

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable,prediction_label,prediction_score
461,1,71,62,0,0,21.799999,0.416,26,0,0,1.0
366,6,124,72,0,0,27.6,0.368,29,1,1,0.78
767,1,93,70,31,0,30.4,0.315,23,0,0,0.99
666,4,145,82,18,0,32.5,0.235,70,1,1,0.77
742,1,109,58,18,116,28.5,0.219,22,0,0,1.0
333,12,106,80,0,0,23.6,0.137,44,0,0,0.91
757,0,123,72,0,0,36.299999,0.258,52,1,1,0.8
83,0,101,65,28,0,24.6,0.237,22,0,0,1.0
469,6,154,78,41,140,46.099998,0.571,27,0,0,0.69
167,4,120,68,0,0,29.6,0.709,34,0,0,0.79


### **🎉 Conclusion:**

In this tutorial, we walked through the entire process of building, saving, and deploying a machine learning model using PyCaret. Here's a recap of what we accomplished:

- **Data Preprocessing**: We loaded datasets and performed initial setup using PyCaret.

- **Model Building**: We trained a classification model and selected the best model through comparison.

- **Model Finalization**: We finalized the best-performing model for predictions.

- **Model Saving and Loading**: We saved the model locally using both ```joblib``` and PyCaret's ```save_model()``` function, then demonstrated loading the model for making predictions.

- **Predictions on Unseen Data**: We used the saved model to predict outcomes for new, unseen data.

This streamlined process highlights how PyCaret simplifies machine learning workflows, making it easier to develop, save, and deploy models efficiently. You can apply this workflow to various datasets and problems, improving your data science productivity. 🚀