<a href="https://colab.research.google.com/github/psrana/Machine-Learning-using-PyCaret/blob/main/03_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **PyCaret for Clustering**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Documentation on Pycaret-Clustering: **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html#clustering"> Click Here </a>**

---


### **(a) Install Pycaret**

In [None]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

### **(b) Get the version of the pycaret**

In [None]:
from pycaret.utils import version
version()

---
# **1. Clustering - Part 1 (Kmean Clustering)**
---
### **1.1 Get the list of datasets available in pycaret (56)**

In [None]:
from pycaret.datasets import get_data
get_data('index')

---
### **1.2 Get the "jewellery" dataset**
---

In [None]:
myDataSet = get_data("jewellery")    # SN is 30

# This is unsupervised dataset.
# No target is defined.

---
### **1.3 Save and Download the dataset to local system**
---

In [None]:
myDataSet.to_csv("myDataSet.csv")

# Explore the "Files" Folder

  ---
### **1.4 "Parameter setting"  for clustering model**
- ##### **Train/Test division, applying data pre-processing** {Sampling, Normalization, Transformation, PCA, Handaling of Outliers, Feature Selection}
---

In [None]:
from pycaret.clustering import *
s = setup(myDataSet)

# Re-run the code if any error occur

---
### **1.5 Building "KMean" clustering model**
---

In [None]:
KMeanClusteringModel = create_model('kmeans', num_clusters=4)

---
### **1.6 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

---
### **1.7 Clustering in "Three line of code"**
---

In [None]:
from pycaret.clustering import *

myDataSet = get_data("jewellery", verbose=False)
s = setup(myDataSet, verbose=False)

KMeanClusteringModel = create_model('kmeans', num_clusters=4)
kMeanPrediction = assign_model(KMeanClusteringModel)

kMeanPrediction

---
### **1.8 "Saving" the result**
---



In [None]:
kMeanPrediction.to_csv("kMeanPrediction.csv")

# Explore the "Files Folder"

---
# **2. Clustering: Save, Upload and Load the Model**
---
### **2.1 Save the "trained model"**
---

In [None]:
s = save_model(KMeanClusteringModel, 'kMeanClusteringModelFile')

# Explore the "Files" Folder and download it.

---
### **2.2 Load the model**
---
##### Use it, while working on **"Anaconda/Jupyter notebook"** on local machine

In [None]:
kMeanClusteringModel = load_model('kMeanClusteringModelFile')

---
# **3. Clustering: Cluster the new dataset (Unseen Data)**
---
### **3.1 Select some data or upload user dataset file**

In [None]:
# Select top 10 rows from jewellery dataset

newData = get_data("jewellery").iloc[:10]

---
### **3.2 Make prediction on the new dataset (Unseen Data)**
---

In [None]:
newPredictions = predict_model(kMeanClusteringModel, data = newData)
newPredictions

---
### **3.3 Save the prediction result to csv**
---

In [None]:
newPredictions.to_csv("NewPredictions.csv")


# Explore the "Files" Folder and download it.

---
# **4. Clustering: Ploting the Cluster**
---
#### Following plots can be generated for clusters
```
1. Cluster PCA Plot (2d)          'cluster'
2. Cluster TSnE (3d)              'tsne'
3. Elbow Plot                     'elbow'
4. Silhouette Plot                'silhouette'
5. Distance Plot                  'distance'
6. Distribution Plot              'distribution'
```

---
### **4.1 Evaluate Cluster Model**
---

In [None]:
evaluate_model(kMeanClusteringModel)

---
### **4.2 2D-plot for Cluster**
---

In [None]:
plot_model(kMeanClusteringModel, plot='cluster')

---
### **4.3 3D-plot for Cluster**
---

In [None]:
plot_model(kMeanClusteringModel, plot = 'tsne')

---
### **4.4 Elbow Plot**
---

In [None]:
plot_model(kMeanClusteringModel, plot = 'elbow')

---
### **4.5 Silhouette Plot**
---

In [None]:
plot_model(kMeanClusteringModel, plot = 'silhouette')

---
### **4.6 Distribution Plot**
---

In [None]:
plot_model(kMeanClusteringModel, plot = 'distribution')

---
### **4.7 Distance Plot**
---

In [None]:
plot_model(kMeanClusteringModel, plot = 'distance') # Rerun the code

---
# **5. Compelete Code for Clustering (KMean)**
---
### **5.1 For Cluster = 3, 4, 5, 6**

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

myDataSet = get_data('jewellery', verbose=False)
setup(data = myDataSet, verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **5.2 Other Clustering Algorithms**
---
```
1. K-Means clustering                 'kmeans'
2. Affinity Propagation               'ap'
3. Mean shift clustering              'meanshift'
4. Spectral Clustering                'sc'
5. Agglomerative Clustering           'hclust'
6. Density-Based Spatial Clustering   'dbscan'
7. OPTICS Clustering                  'optics'
8. Birch Clustering                   'birch'
9. K-Modes clustering                 'kmodes'
```

---
# **6. Clustering: Apply "Data Preprocessing"**
---
### **Read the Dataset**

In [None]:
from pycaret.clustering import *
from pycaret.datasets import get_data

myDataSet = get_data('jewellery')

---
### **6.1 Model Performance using "Normalization"**
---
### **6.1.1 Elbow Plot**


In [None]:
setup(data = myDataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans', verbose=False)
plot_model(x, plot = 'elbow')

# Re-run the code again for different parameters
# normalize_method = {zscore, minmax, maxabs, robust}

---
### **6.1.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = myDataSet, normalize = True, normalize_method = 'zscore', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)


---
### **6.1.3 3D Plot for Cluster = 5 (Most important and interesting step)**
---

In [None]:
setup(data = myDataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans', num_clusters = 5)
plot_model(x, plot = 'tsne')

---
### **6.2 Model Performance using "Transformation"**
---

### **6.2.1 Elbow Plot**


In [None]:
setup(data = myDataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans', verbose=False)
plot_model(x, plot = 'elbow')

# transformation_method = {yeo-johnson, quantile}

---
### **6.2.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = myDataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.3 Model Performance using "PCA"**
---
### **6.3.1 Elbow Plot**

In [None]:
setup(data = myDataSet, pca = True, pca_method = 'linear', verbose=False)
x = create_model('kmeans', verbose=False)
plot_model(x, plot = 'elbow')


# pca_method = {linear, kernel, incremental}

---
### **6.3.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = myDataSet, pca = True, pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.4 Model Performance using "Transformation" + "Normalization"**
---
### **6.4.1 Elbow Plot**

In [None]:
setup(data = myDataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans', verbose=False)
plot_model(x, plot = 'elbow')

---
### **6.4.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = myDataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.5 Model Performance using "Transformation" + "Normalization" + "PCA"**
---
### **6.5.1 Elbow Plot**

In [None]:
setup(data = myDataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

x = create_model('kmeans', verbose=False)
plot_model(x, plot = 'elbow')

---
### **6.5.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = myDataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
# **7. Other Clustering Techniques**
---
```
1. K-Means clustering                 'kmeans'
2. Affinity Propagation               'ap'
3. Mean shift clustering              'meanshift'
4. Spectral Clustering                'sc'
5. Agglomerative Clustering           'hclust'
6. Density-Based Spatial Clustering   'dbscan'
7. OPTICS Clustering                  'optics'
8. Birch Clustering                   'birch'
9. K-Modes clustering                 'kmodes'
```

---
### **7.1 Buildign Agglomerative (Hierarchical) clustering model**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

myDataSet = get_data('jewellery', verbose=False)
setup(data = myDataSet, verbose=False)

x = create_model('hclust')
plot_model(x, plot = 'elbow')

---
### **7.1.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
hierarchicalModel = create_model('hclust', num_clusters=3)
hierarchicalModelPrediction = assign_model(hierarchicalModel)
hierarchicalModelPrediction

---
### **7.1.2 Evaluate Agglomerative (Hierarchical) Clustering**
---

In [None]:
evaluate_model(hierarchicalModel)

---
### **7.2 Density-Based Spatial Clustering**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

myDataSet = get_data('jewellery', verbose=False)
setup(data = myDataSet, verbose=False)
dbscanModel = create_model('dbscan')

---
### **7.2.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
dbscanModelPrediction = assign_model(dbscanModel)
dbscanModelPrediction

# Noisy samples are given the label -1 i.e. 'Cluster -1'

### **Key Points**

- num_clusters not required for some of the clustering Alorithms (Affinity Propagation ('ap'), Mean shift
  clustering ('meanshift'), Density-Based Spatial Clustering ('dbscan') and OPTICS Clustering ('optics')).
- num_clusters param for these models are automatically determined.

- When fit doesn't converge in Affinity Propagation ('ap') model, all datapoints are labelled as -1.

- Noisy samples are given the label -1, when using Density-Based Spatial  ('dbscan') or OPTICS Clustering ('optics').

- OPTICS ('optics') clustering may take longer training times on large datasets.


---
# **8. Deploy the model on AWS**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html#pycaret.clustering.deploy_model">Click Here</a>**