# Clustering Exploration

Pada dokumentasi ini akan dilakukan eksplorasi berbagai algoritma clustering scikit-learn pada Jupyter Notebook. Dokumentasi ini dibuat oleh Muhammad Hilmi Asyrofi (13515083) sebagai salah satu Tugas Kecil mata kuliah IF4071 Pembelajaran Mesin (<i>Machine Learning</i>). Algoritma yang diimplementasikan pada dokumentasi ini yaitu K-Means, Agglomerative, DBSCAN, Gaussian Mixture, K-Medoids, MST, dan Grid-based Clustering. Untuk beberapa algoritma, akan dilakukan penyimpanan model pembelajaran ke dalam <i>file external</i> agar dapat digunakan lagi pada waktu lain. Pipeline utama dari metode percobaan yang dilakukan yaitu:
    - Import dataset
    - One Hot Encoder (hanya pada data kategorik)
    - Implementasi algoritma clustering
    - Prediksi label    

### Import Dataset

Pada tahap ini dilakukan pembacaan dataset dari file csv menjadi sebuah variabel yang dapat digunakan pada tahap selanjutnya. Berdasarkan kolom yang ada, dataset akan dibagi menjadi dua yaitu, dataset yang berisi feature dan dataset yang berisi label kelas. Ada dua dataset yang digunakan pada percobaan kali ini, yaitu dataset iris dan dataset play tennis.

In [48]:
import pandas as pd 
import numpy as np
np.seterr(over='ignore')
%matplotlib inline
import matplotlib.pyplot as plt
import pickle

# read iris data
df = pd.read_csv('dataset/iris.csv')
iris_features = df.drop('variety', axis = 1)
iris_labels = df.drop(list(iris_features), axis = 1)

# read tennis data
df = pd.read_csv('dataset/tennis.csv')
df = df.drop('day', axis = 1)
tennis_features = df.drop('play', axis = 1)
tennis_labels = df.drop(list(tennis_features), axis = 1)

In [3]:
iris_features.head(5)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
tennis_features.head(5)

Unnamed: 0,outlook,temp,humidity,wind
0,Sunny,Hot,High,Weak
1,Sunny,Hot,High,Strong
2,Overcast,Hot,High,Weak
3,Rain,Mild,High,Weak
4,Rain,Cool,Normal,Weak


### One Hot Encoder 

Konversi data kategorik menjadi sparse

In [55]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

tennis_features = pd.get_dummies(tennis_features)

In [56]:
tennis_features.head()

Unnamed: 0,outlook_Overcast,outlook_Rain,outlook_Sunny,temp_Cool,temp_Hot,temp_Mild,humidity_High,humidity_Normal,wind_Strong,wind_Weak
0,0,0,1,0,1,0,1,0,0,1
1,0,0,1,0,1,0,1,0,1,0
2,1,0,0,0,1,0,1,0,0,1
3,0,1,0,0,0,1,1,0,0,1
4,0,1,0,1,0,0,0,1,0,1


### K-Means

In [57]:
from sklearn.cluster import KMeans

# iris dataset
kmeans = KMeans(n_clusters = 3, random_state = 0)

kmeans.fit(iris_features)

# save the model to disk
filename = 'kmeans_iris_model.sav'
pickle.dump(kmeans, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

iris_result = loaded_model.predict(iris_features)
print(iris_result)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]


In [58]:
# tennis dataset
kmeans = KMeans(n_clusters = 2, random_state = 0)

kmeans.fit(tennis_features)

# save the model to disk
filename = 'kmeans_tennis_model.sav'
pickle.dump(kmeans, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

tennis_result = loaded_model.predict(tennis_features)

print(tennis_result)

[1 1 1 1 0 0 0 1 0 0 0 1 0 1]


### Agglomerative Clustering

In [9]:
from sklearn.cluster import AgglomerativeClustering

# iris dataset
clustering = AgglomerativeClustering()
iris_result = clustering.fit_predict(iris_features)

print(iris_result)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]


In [10]:
# tennis dataset
clustering = AgglomerativeClustering()
tennis_result = clustering.fit_predict(tennis_features)

print(tennis_result)

[0 0 0 0 1 1 1 0 1 1 0 0 0 0]


### DBSCAN

In [11]:
from sklearn.cluster import DBSCAN

# iris dataset
clustering = DBSCAN(eps=1, min_samples=4)
iris_result = clustering.fit_predict(iris_features)

print(iris_result)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]


In [12]:
# tennis dataset
clustering = DBSCAN(eps=4, min_samples=2)
tennis_result = clustering.fit_predict(tennis_features)

print(tennis_result)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### Gaussian Mixture

In [59]:
from sklearn.mixture import GaussianMixture

# iris dataset
model = GaussianMixture()
model.fit(iris_features)

# save the model to disk
filename = 'gmm_iris_model.sav'
pickle.dump(model, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

iris_result = loaded_model.predict(iris_features)

print(iris_result)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]


In [60]:
# tennis dataset
model = GaussianMixture()
model.fit(tennis_features)

# save the model to disk
filename = 'gmm_tennis_model.sav'
pickle.dump(model, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

tennis_result = loaded_model.predict(tennis_features)

print(tennis_result)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### K-Medoids

In [61]:
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric

# iris dataset
features = np.array(iris_features)
    
metric = distance_metric(type_metric.MINKOWSKI, degree = 4)
# metric = distance_metric(type_metric.CHEBYSHEV)

# set initial medoids
initial_medoids = [1, 6]

# create instance of K-Medoids algorithm
kmedoids_instance = kmedoids(list_features, initial_medoids, metric = metric)

# run cluster analysis and obtain results
kmedoids_instance.process();

clusters = kmedoids_instance.get_clusters()

# show allocated clusters
print(clusters)

[[7, 0, 1, 2, 3, 10, 11, 13], [4, 5, 6, 8, 9, 12]]


In [47]:
# tennis dataset
features = tennis_features
list_features = np.array(features)
    
metric = distance_metric(type_metric.MINKOWSKI, degree = 4)
# metric = distance_metric(type_metric.CHEBYSHEV)

# set initial medoids
initial_medoids = [1, 6]

# create instance of K-Medoids algorithm
kmedoids_instance = kmedoids(list_features, initial_medoids, metric = metric)

# run cluster analysis and obtain results
kmedoids_instance.process();
clusters = kmedoids_instance.get_clusters()

# show allocated clusters
print(clusters)

[[7, 0, 1, 2, 3, 10, 11, 13], [4, 5, 6, 8, 9, 12]]


### MST

In [17]:
import matplotlib.pyplot as plt

# iris dataset
from mst_clustering import MSTClustering
model = MSTClustering(cutoff_scale=0.7, approximate=False)
iris_result = model.fit_predict(iris_features)

print(iris_result)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]


In [18]:
# tennis dataset
from mst_clustering import MSTClustering
model = MSTClustering(cutoff_scale=1.414214, approximate=False)
tennis_results = model.fit_predict(tennis_features)

print(tennis_results)

[0 0 0 0 0 0 0 0 0 0 1 0 0 0]


### Grid Clustering

In [46]:
from pyclustering.cluster.bang import bang

# Read data n dimensional data.
data = list_features

# Prepare algorithm's parameters.
levels = 2

# Create instance of BANG algorithm.
bang_instance = bang(np.array(tennis_features), levels)
bang_instance.process()

# Obtain clustering results.
clusters = bang_instance.get_clusters()
print(clusters)

[[0, 13]]


  return self.__get_amount_points() / self.__spatial_block.get_volume()
