<a href="https://colab.research.google.com/github/psrana/Machine-Learning-using-PyCaret/blob/main/02_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **PyCaret for Classification**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Tutorial on Pycaret **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html" target="_blank"> Click Here</a>**

2. Documentation on Pycaret-Classification: **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html#classification" target="_blank"> Click Here </a>**

---

### **In this tutorial we will learn:**

- Getting Data
- Setting up Environment
- Create Model
- Tune Model
- Plot Model
- Finalize Model
- Predict Model
- Save / Load Model
---



### **(a) Install Pycaret**

In [None]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

### **(b) Get the version of the pycaret**

In [None]:
from pycaret.utils import version
version()

---
# **1. Classification: Basics**
---
### **1.1 Get the list of datasets available in pycaret (Total Datasets = 56)**




In [None]:
from pycaret.datasets import get_data
dataSets = get_data('index')

---
### **1.2 Get the "diabetes" dataset (Step-I)**
---

In [None]:
diabetesDataSet = get_data("diabetes")    # SN is 7

# This is supervised dataset.
# "Class variable" is the target column.

---
### **1.3 Save and Download the dataset to local system**
---

In [None]:
# Save the dataset to file
diabetesDataSet.to_csv("diabetesDataSet.csv")

# Download from "Files Folder"

---
### **1.4 Parameter setting for all models (Step-II)**
- ##### **Train/Test division, applying data pre-processing** {Sampling, Normalization, Transformation, PCA, Handaling of Outliers, Feature Selection}
---

In [None]:
from pycaret.classification import *
s = setup(data=diabetesDataSet, target='Class variable')

# Other Parameters:
# train_size = 0.7
# data_split_shuffle = False, data_split_stratify = False

---
### **1.5 Run all models (Step-III)**
---

In [None]:
cm = compare_models()

---
### **1.6 "Three line of code" for model comparison for "Diabetes" dataset**
---



In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

diabetesDataSet = get_data("diabetes", verbose=False)
setup(data=diabetesDataSet, target='Class variable', train_size = 0.7, verbose=False)
cm = compare_models()

---
### **1.7 "Three line of code" for model comparison for "Cancer" dataset**
---



In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

cancerDataSet = get_data("cancer", verbose=False)
setup(data = cancerDataSet, target='Class', train_size = 0.7, verbose=False)
cm = compare_models()

---
# **2. Classification: working with user dataset**
---


In [None]:
from pycaret.classification import *
import pandas as pd

# First upload the user file (myData.csv) in the colab

# myDataSet = pd.read_csv("myDataSet.csv")                    # Read the file; change the file name accordingly
# s = setup(data = myDataSet, target='target', verbose=False)   # Change the target name accordingly
# cm = compare_models()                                       # Uncomment and execute

---
# **3. Classification: Apply "Data Preprocessing"**
---

### **3.1 Model performance using "Normalization"**

In [None]:
diabetesDataSet = get_data("diabetes", verbose=False)
diabetesDataSet.columns = diabetesDataSet.columns.str.replace(' ', '_')
setup(data=diabetesDataSet, target='Class_variable',
      normalize = True, normalize_method = 'zscore',
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

# Re-run the code again for different parameters
# normalize_method = {zscore, minmax, maxabs, robust}

---
### **3.2 Model performance using "Feature Selection"**
---

In [None]:
setup(data=diabetesDataSet, target='Class_variable',
      feature_selection = True, feature_selection_method = 'classic',
      n_features_to_select = 0.2,
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

# Re-run the code again for different parameters
# feature_selection_method = {classic, univariate, sequential}
# n_features_to_select = {0.1, 0.2, 0.3, 0.4, 0.5, ..... }

---
### **3.3 Model performance using "Outlier Removal"**
---

In [None]:
setup(data=diabetesDataSet, target='Class_variable',
      remove_outliers = True, outliers_method = "iforest", outliers_threshold = 0.05,
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

# Re-run the code again for different parameters
# outliers_threshold = {0.04, 0.05, 0.06, 0.07, 0.08, ....}
# outliers_method = {iforest, ee, lof}

---
### **3.4 Model performance using "Transformation"**
---

In [None]:
setup(data=diabetesDataSet, target='Class_variable',
      transformation = True, transformation_method = 'yeo-johnson',
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

---
### **3.5 Model performance using "PCA"**
---

In [None]:
setup(data=diabetesDataSet, target='Class_variable',
      pca = True, pca_method = 'linear',
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

# Re-run the code again for different parameters
# pca_method = (linear, kernel, incremental)

---
### **3.6 Model performance using "Outlier Removal" + "Normalization"**
---

In [None]:
setup(data=diabetesDataSet, target='Class_variable',
      remove_outliers = True, outliers_threshold = 0.05,
      normalize = True, normalize_method = 'zscore',
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

---
### **3.7 Model performance using "Outlier Removal" +  "Normalization" + "Transformation"**
---

In [None]:
setup(data=diabetesDataSet, target='Class_variable',
      remove_outliers = True, outliers_threshold = 0.05,
      normalize = True, normalize_method = 'zscore',
      transformation = True, transformation_method = 'yeo-johnson',
      data_split_shuffle = False, data_split_stratify = False, verbose=False)
cm = compare_models()

---
### **3.8 Explore more parameters of "setup()" on pycaret**
---
- Explore setup() paramaeters in **Step 1.3**
- **<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html" target="_blank"> Click Here</a>** for more

---
# **4. Classification: More Operations**
---
### **4.1 Build a single model - "RandomForest"**

In [None]:
from pycaret.datasets import get_data
from pycaret.classification import *

diabetesDataSet = get_data("diabetes", verbose=False)
setup(data=diabetesDataSet, target='Class variable', verbose=False)

rfModel = create_model('rf')
# Explore more parameters

---
### **4.2 Other available classification models**
---
-	'ada' -	Ada Boost Classifier
-	'dt' -	Decision Tree Classifier
-	'et' -	Extra Trees Classifier
-	'gbc' -	Gradient Boosting Classifier
-	'knn' -	K Neighbors Classifier
-	'lightgbm' -	Light Gradient Boosting Machine
-	'lda' -	Linear Discriminant Analysis
-	'lr' -	Logistic Regression
-	'nb' -	Naive Bayes
-	'qda' -	Quadratic Discriminant Analysis
-	'rf' -	Random Forest Classifier
-	'ridge' -	Ridge Classifier
-	'svm' -	SVM - Linear Kernel

---
### **4.3 Explore more parameters of "create_model()" on pycaret**
---

**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.create_model" target="_blank"> Click Here</a>**

---
### **4.4 Make prediction on the "new unseen dataset"**
---
#### **Get the "new unseen dataset"**



In [None]:
# Select top 10 rows from diabetes dataset
newDataSet = get_data("diabetes").iloc[:10]

#### **Make prediction on "new unseen dataset"**

In [None]:
newPredictions = predict_model(rfModel, data = newDataSet)
newPredictions

---
### **4.5 "Save" and "Download" the prediction result**
---

In [None]:
from google.colab import files

#newPredictions.to_csv("NewPredictions.csv", index=False)       # Uncomment and run again
#files.download('NewPredictions.csv')                           # Uncomment and run again

---
### **4.6 "Save" the trained model**
---

In [None]:
#sm = save_model(rfModel, 'rfModelFile')      # Uncomment and run again

---
### **4.7 Download the "trained model file" to user local system**
---

In [None]:
from google.colab import files
#files.download('rfModelFile.pkl')             # Uncomment and run again

---
### **4.8  "Upload the trained model" --> "Load the model"  --> "Make the prediction" on "new unseen dataset"**
---
### **4.8.1 Upload the  "Trained Model"**


In [None]:
from google.colab import files
#files.upload()    # Uncomment and run again

---
### **4.8.2 Load the "Model"**
---

In [None]:
#rfModel = load_model('rfModelFile (1)') # Uncomment and run again

---
### **4.8.3 Make the prediction on "new unseen dataset"**
---

In [None]:
#newPredictions = predict_model(rfModel, data = newDataSet)    # Uncomment and run again
#newPredictions            # Uncomment and run again

---
# **5. Plot the trained model**
---
**Following parameters can be plot for a trained model**
*   Area Under the Curve         - 'auc'
*   Discrimination Threshold     - 'threshold'
*   Precision Recall Curve       - 'pr'
*   Confusion Matrix             - 'confusion_matrix'
*   Class Prediction Error       - 'error'
*   Classification Report        - 'class_report'
*   Decision Boundary            - 'boundary'
*   Recursive Feat. Selection    - 'rfe'
*   Learning Curve               - 'learning'
*   Manifold Learning            - 'manifold'
*   Calibration Curve            - 'calibration'
*   Validation Curve             - 'vc'
*   Dimension Learning           - 'dimension'
*   Feature Importance           - 'feature'
*   Model Hyperparameter         - 'parameter'

---
### **5.1 Create RandomForest model or any other model**
---

In [None]:
rfModel = create_model('rf')

---
### **5.2 Create "Confusion Matrix"**
---

In [None]:
plot_model(rfModel, plot='confusion_matrix')

---
### **5.3 Plot the "learning curve"**
---

In [None]:
plot_model(rfModel, plot='learning')

---
### **5.4 Plot the "AUC Curve" (Area Under the Curve)**
---

In [None]:
plot_model(rfModel, plot='auc')

---
### **5.5 Plot the "Decision Boundary"**
---

In [None]:
plot_model(rfModel, plot='boundary')

---
### **5.6 Get the model "parameters"**
---

In [None]:
plot_model(rfModel, plot='parameter')

---
### **5.7 Explore the more parameters of "plot_model()" on pycaret**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.plot_model" target="_blank"> Click Here </a>**

---
# **6. Feature Importance**
---
### **6.1 Feature Importance using "Random Forest"**


In [None]:
rfModel = create_model('rf', verbose=False)
plot_model(rfModel, plot='feature')

---
### **6.2 Feature Importance using "Extra Trees Regressor"**
---

In [None]:
etModel = create_model('et', verbose=False)
plot_model(etModel, plot='feature')

---
### **6.3 Feature Importance using "Decision Tree"**
---

In [None]:
dtModel = create_model('dt', verbose=False)
plot_model(dtModel, plot='feature')

---
# **7. Tune/Optimize the model performance**
---
### **7.1 Train "Decision Tree" with default parameters**


In [None]:
dtModel = create_model('dt')

#### **Get the "parameters" of Decision Tree**

In [None]:
plot_model(dtModel, plot='parameter')

---
### **7.2 Tune "Decision Tree" model**
---

In [None]:
dtModelTuned = tune_model(dtModel, n_iter=50)

#### **Get the "tuned parameters" of Decision Tree**

In [None]:
plot_model(dtModelTuned, plot='parameter')

---
### **7.3 Explore more parameters of "tune_model()" on pycaret**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.tune_model" target="_blank"> Click Here </a>**

---
# **8. AutoML - Advanced Machine Learning**
---

- Select n Best Models:
  - Ensemble, Stacking, Begging, Blending
  - Auto tune the best n models

**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.automl" target="_blank">Click Here</a>**


---
# **9. Deploy the model on AWS / Azure**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.deploy_model" target="_blank">Click Here</a>**