# K-Means Demo


## K-Means algorithm

K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that an observation in a given group is more similar to another observation in the same group than to another observation in a different group.

![kmenas](https://media0.giphy.com/media/12vVAGkaqHUqCQ/giphy.gif?cid=790b7611178aaedddb5b58de2ef94d55dc6c3feecd2d02f2&rid=giphy.gif)

Nice youtube video explanation: https://www.youtube.com/watch?v=4b5d3muPQmA

## Data - Iris dataset

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. 

(Source: https://en.wikipedia.org/wiki/Iris_flower_data_set)

![iris](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png)

In [1]:
import sys
sys.path.append("/home/mori/Documents/h2o/env/h2o-env/lib/python3.7/site-packages")
sys.path.append("/home/mori/Documents/h2o/code/h2o-3/h2o-py/build")

import plotly.express as px
iris = px.data.iris()

# look into data
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,3
146,6.3,2.5,5.0,1.9,virginica,3
147,6.5,3.0,5.2,2.0,virginica,3
148,6.2,3.4,5.4,2.3,virginica,3


In [40]:
iris.iloc[41]

sepal_length       4.5
sepal_width        2.3
petal_length       1.3
petal_width        0.3
species         setosa
species_id           1
Name: 41, dtype: object

In [41]:
iris.iloc[133]

sepal_length          6.3
sepal_width           2.8
petal_length          5.1
petal_width           1.5
species         virginica
species_id              3
Name: 133, dtype: object

In [21]:
import pandas as pd
import numpy as np
# reduced iris dataset
# 0, 5, 10, 18, 20 - setosa
# 51, 56, 65, 74, 87 - versicolor
# 105, 117, 118, 122, 131 - virginica
iris = iris.loc[[0, 5, 10, 18, 20, 51, 56, 65, 74, 87, 105, 117, 118, 122, 131, 2, 52, 102]]

# add centroids
iris.loc[2,"species"] = "centroid"
iris.loc[52,"species"] = "centroid"
iris.loc[102,"species"] = "centroid"
iris.loc[2,"species_id"] = 4
iris.loc[52,"species_id"] = 4
iris.loc[102,"species_id"] = 4
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,centroid,4
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,3
146,6.3,2.5,5.0,1.9,virginica,3
147,6.5,3.0,5.2,2.0,virginica,3
148,6.2,3.4,5.4,2.3,virginica,3


In [28]:
import pandas as pd
import numpy as np

# add centroids
iris.loc[147,"species"] = "centroid"
iris.loc[148,"species"] = "centroid"
iris.loc[149,"species"] = "centroid"
iris.loc[147,"species_id"] = 4
iris.loc[148,"species_id"] = 4
iris.loc[149,"species_id"] = 4
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,3
146,6.3,2.5,5.0,1.9,virginica,3
147,6.5,3.0,5.2,2.0,centroid,4
148,6.2,3.4,5.4,2.3,centroid,4


In [29]:
# print data 3D
# you can rotate the graph using mouse

# plot all data
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', title="Iris data")
fig.show()

# plot all data labeled by species
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', color='species', title="Iris data labeled by species")
fig.show()


## scikit-learn

Pure Python Machine Learning Library - Open Source

instalation: pip install scikit-learn

Implemented algorithms:
- Support Vector Machines
- Nearest Neighbours
- Random Forest
- Epsilon-Support Vector Regression
- Ridge regression
- K-means
- Spectral Clustering
- Principal Component Analysis
- Feature selection
- Grid Search
- Cross validation
- others ...

### SKlearn K-means

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html




In [31]:
# Import sklearn library
import sklearn

# Import only Kmeans algorithm
from sklearn.cluster import KMeans
from h2o.estimators.kmeans import H2OKMeansEstimator

# run sklearn Kmeans
#sci_km = KMeans(n_clusters=3, init=np.asarray(points), n_init=1)
sci_km = KMeans(n_clusters=3, init=np.asarray(iris.iloc[147:150, 0:4]), n_init=1)
sci_km.fit(iris.iloc[0:148,0:4])

KMeans(algorithm='auto', copy_x=True,
    init=array([[6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]]),
    max_iter=300, n_clusters=3, n_init=1, n_jobs=None,
    precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

## H2O

Machine Learning Library - backend Java, REST API, Python/R Bindings - Open Source

Instalation: pip install h2o

Implemented algorithms:

- Cox Proportional Hazards (CoxPH)
- Deep Learning (Neural Networks)
- Distributed Random Forest (DRF)
- Generalized Linear Model (GLM)
- Gradient Boosting Machine (GBM)
- Naïve Bayes Classifier
- Stacked Ensembles
- Support Vector Machine (SVM)
- XGBoost
- K-means
- Isolation Forest
- Generalized Low Rank Models (GLRM)
- Principal Component Analysis (PCA)
- Word2vec
- others



# H2O K-means

Documentation: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html

In [2]:
# run h2o Kmeans

# Import h2o library
import h2o
from h2o.estimators import H2OKMeansEstimator

# init h2o cluster
h2o.init(strict_version_check=False)

# transform data to h2o frame structure
iris_h2o = h2o.H2OFrame(iris.iloc[0:148, 0:4])

# transform start cluster points to h2o frame structure
start_clusters = h2o.H2OFrame(iris.iloc[147:150, 0:4])

versionFromGradle='3.29.0',projectVersion='3.29.0.99999',branch='maurever_PUBDEV-6447_constrained_kmeans_improvement',lastCommitHash='6523aa259d948d1ccc186195b7c23479db4945dd',gitDescribe='jenkins-master-4911-1-g6523aa259d-dirty',compiledOn='2020-01-14 12:49:33',compiledBy='mori'
Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,4 hours 20 mins
H2O cluster timezone:,Europe/Berlin
H2O data parsing timezone:,UTC
H2O cluster version:,3.29.0.99999
H2O cluster version age:,1 day
H2O cluster name:,H2O_from_python_mori_r3i73x
H2O cluster total nodes:,1
H2O cluster free memory:,5.661 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [3]:
# run h2o Kmeans standardize true
h2o_km_co_t = H2OKMeansEstimator(k=3, user_points=start_clusters, standardize=True, cluster_size_constraints=[2, 5, 8])
h2o_km_co_t.train(x=list(range(4)),training_frame=iris_h2o)

# show details
h2o_km_co_t.show()

# run h2o Kmeans standardize false
h2o_km_co_f = H2OKMeansEstimator(k=3, user_points=start_clusters, standardize=False, cluster_size_constraints=[5, 5, 5])
h2o_km_co_f.train(x=list(range(4)),training_frame=iris_h2o)

# show details
h2o_km_co_f.show()

kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1579081201160_8


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,148.0,3.0,0.0,10.0,137.052343,588.0,450.947657




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 137.05234282392894
Total Sum of Square Error to Grand Mean: 587.9999999999992
Between Cluster Sum of Square Error: 450.94765717607027

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,54.0,45.218016
1,,2.0,44.0,44.123736
2,,3.0,50.0,47.710591



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-01-15 15:00:14,0.000 sec,0.0,,
1,,2020-01-15 15:00:14,0.167 sec,1.0,148.0,713.78292
2,,2020-01-15 15:00:14,0.269 sec,2.0,26.0,302.48785
3,,2020-01-15 15:00:14,0.435 sec,3.0,28.0,242.323034
4,,2020-01-15 15:00:14,0.605 sec,4.0,22.0,171.357006
5,,2020-01-15 15:00:14,0.743 sec,5.0,8.0,142.357957
6,,2020-01-15 15:00:15,0.823 sec,6.0,4.0,138.829644
7,,2020-01-15 15:00:15,0.951 sec,7.0,1.0,137.616588
8,,2020-01-15 15:00:15,1.047 sec,8.0,2.0,137.423798
9,,2020-01-15 15:00:15,1.140 sec,9.0,1.0,137.118768


kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1579081201160_9


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,148.0,3.0,0.0,7.0,77.515677,674.42223,596.906553




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 77.51567691626053
Total Sum of Square Error to Grand Mean: 674.4222297297295
Between Cluster Sum of Square Error: 596.9065528134689

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,61.0,39.113115
1,,2.0,37.0,23.162162
2,,3.0,50.0,15.2404



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-01-15 15:00:15,0.002 sec,0.0,,
1,,2020-01-15 15:00:15,0.123 sec,1.0,148.0,964.3
2,,2020-01-15 15:00:16,0.207 sec,2.0,37.0,271.109283
3,,2020-01-15 15:00:16,0.292 sec,3.0,26.0,147.128302
4,,2020-01-15 15:00:16,0.370 sec,4.0,12.0,90.625952
5,,2020-01-15 15:00:16,0.458 sec,5.0,7.0,80.218033
6,,2020-01-15 15:00:16,0.549 sec,6.0,3.0,77.876
7,,2020-01-15 15:00:16,0.636 sec,7.0,0.0,77.515677


In [4]:
# run h2o Kmeans standardize true
h2o_km_t = H2OKMeansEstimator(k=3, user_points=start_clusters, standardize=True)
h2o_km_t.train(x=list(range(4)),training_frame=iris_h2o)

# show details
h2o_km_t.show()

# run h2o Kmeans standardize false
h2o_km_f = H2OKMeansEstimator(k=3, user_points=start_clusters, standardize=False)
h2o_km_f.train(x=list(range(4)),training_frame=iris_h2o)

# show details
h2o_km_f.show()

kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1579081201160_10


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,148.0,3.0,0.0,10.0,137.052343,588.0,450.947657




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 136.8970176108704
Total Sum of Square Error to Grand Mean: 588.0000091055117
Between Cluster Sum of Square Error: 451.1029914946413

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,52.0,43.015356
1,,2.0,46.0,46.171071
2,,3.0,50.0,47.710591



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-01-15 15:00:21,0.002 sec,0.0,,
1,,2020-01-15 15:00:21,0.003 sec,1.0,148.0,713.78292
2,,2020-01-15 15:00:21,0.003 sec,2.0,26.0,302.48785
3,,2020-01-15 15:00:21,0.003 sec,3.0,28.0,242.323034
4,,2020-01-15 15:00:21,0.003 sec,4.0,22.0,171.357006
5,,2020-01-15 15:00:21,0.003 sec,5.0,8.0,142.357957
6,,2020-01-15 15:00:21,0.004 sec,6.0,4.0,138.829644
7,,2020-01-15 15:00:21,0.004 sec,7.0,1.0,137.616588
8,,2020-01-15 15:00:21,0.004 sec,8.0,2.0,137.423798
9,,2020-01-15 15:00:21,0.004 sec,9.0,1.0,137.118768


kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1579081201160_11


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,148.0,3.0,0.0,7.0,77.515677,674.42223,596.906553




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 77.51567517475033
Total Sum of Square Error to Grand Mean: 674.422191499639
Between Cluster Sum of Square Error: 596.9065163248887

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,61.0,39.113115
1,,2.0,37.0,23.16216
2,,3.0,50.0,15.2404



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-01-15 15:00:21,0.000 sec,0.0,,
1,,2020-01-15 15:00:21,0.001 sec,1.0,148.0,964.3
2,,2020-01-15 15:00:21,0.001 sec,2.0,36.0,270.890823
3,,2020-01-15 15:00:21,0.002 sec,3.0,26.0,147.128302
4,,2020-01-15 15:00:21,0.002 sec,4.0,12.0,90.625952
5,,2020-01-15 15:00:21,0.003 sec,5.0,7.0,80.218033
6,,2020-01-15 15:00:21,0.003 sec,6.0,3.0,77.876
7,,2020-01-15 15:00:21,0.003 sec,7.0,0.0,77.515677


## Prediction

While we have a model trained, we can 'predict' with them. You can use the data which were used for training. However you can also use totally new data (with the same format) and get predictions based on the trained model.

Usually, the model in Machine Learning libraries implements some 'predict' or 'score' method. In sklearn and h2o, the model has the method 'predict'.

The result of the predict method is a new frame where a cluster assignment is saved. Than  you can compare result cluster assignment with original labeled data.

In [8]:
# predict sklearn
prediction_sci = sci_km.predict(iris.iloc[0:15, 0:4])
iris["sci_prediction"] = np.append(prediction_sci, np.array([3, 3, 3]))


# predict h2o
prediction_h2o_t = h2o_km_t.predict(iris_h2o)
iris["h2o_prediction_t"] = np.append(prediction_h2o_t.as_data_frame().to_numpy().reshape(15), np.array([3, 3, 3]))

prediction_h2o_f = h2o_km_f.predict(iris_h2o)
iris["h2o_prediction_f"] = np.append(prediction_h2o_f.as_data_frame().to_numpy().reshape(15), np.array([3, 3, 3]))

# predict h2o con
prediction_h2o_co_t = h2o_km_co_t.predict(iris_h2o)
iris["h2o_prediction_co_t"] = np.append(prediction_h2o_co_t.as_data_frame().to_numpy().reshape(15), np.array([3, 3, 3]))
iris

prediction_h2o_co_f = h2o_km_co_f.predict(iris_h2o)
iris["h2o_prediction_co_f"] = np.append(prediction_h2o_co_f.as_data_frame().to_numpy().reshape(15), np.array([3, 3, 3]))
iris


kmeans prediction progress: |█████████████████████████████████████████████| 100%
kmeans prediction progress: |█████████████████████████████████████████████| 100%
kmeans prediction progress: |█████████████████████████████████████████████| 100%
kmeans prediction progress: |█████████████████████████████████████████████| 100%


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id,sci_prediction,h2o_prediction_t,h2o_prediction_f,h2o_prediction_co_t,h2o_prediction_co_f
0,5.1,3.5,1.4,0.2,setosa,1,1,1,1,1,1
5,5.4,3.9,1.7,0.4,setosa,1,1,1,1,1,1
10,5.4,3.7,1.5,0.2,setosa,1,1,1,1,1,1
18,5.7,3.8,1.7,0.3,setosa,1,1,1,1,1,1
20,5.4,3.4,1.7,0.2,setosa,1,1,1,1,1,1
51,6.4,3.2,4.5,1.5,versicolor,2,2,2,2,2,2
56,6.3,3.3,4.7,1.6,versicolor,2,2,2,2,2,2
65,6.7,3.1,4.4,1.4,versicolor,2,2,2,2,2,2
74,6.4,2.9,4.3,1.3,versicolor,2,2,2,2,2,2
87,6.3,2.3,4.4,1.3,versicolor,2,2,2,2,2,2


In [12]:
# prepare data to compare
# iris id is from 1 to 3, prediction is from 0 to 2
iris['species_id'] = iris['species_id']-1
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id,sci_prediction,h2o_prediction_t,h2o_prediction_f,h2o_prediction_co_t,h2o_prediction_co_f
0,5.1,3.5,1.4,0.2,setosa,0,1,1,1,1,1
5,5.4,3.9,1.7,0.4,setosa,0,1,1,1,1,1
10,5.4,3.7,1.5,0.2,setosa,0,1,1,1,1,1
18,5.7,3.8,1.7,0.3,setosa,0,1,1,1,1,1
20,5.4,3.4,1.7,0.2,setosa,0,1,1,1,1,1
51,6.4,3.2,4.5,1.5,versicolor,1,2,2,2,2,2
56,6.3,3.3,4.7,1.6,versicolor,1,2,2,2,2,2
65,6.7,3.1,4.4,1.4,versicolor,1,2,2,2,2,2
74,6.4,2.9,4.3,1.3,versicolor,1,2,2,2,2,2
87,6.3,2.3,4.4,1.3,versicolor,1,2,2,2,2,2


In [36]:
iris.to_csv("/home/mori/Documents/h2o/code/test/constrained_kmeans/result_iris_prediction.csv")

In [177]:
# all predictions are the same for h2o and sklearn
all(iris['h2o_prediction_f'] == iris['sci_prediction'])

True

In [141]:
# h2o kmeans vs h2o constrained kmeans prediction
all(iris['h2o_prediction'] == iris['h2o_prediction_co'])

False

In [142]:
# there is prediction which is not the same as original class
all(iris['species_id'] == iris['sci_prediction'])

False

In [31]:
iris['error'] = iris['species_id'] - iris['h2o_prediction']

In [32]:
iris['error_co'] = iris['species_id'] - iris['h2o_prediction_co_n']

In [13]:
iris.loc[iris.h2o_prediction_t == 0, 'h2o_prediction_t'] = 4 
iris.loc[iris.h2o_prediction_t == 1, 'h2o_prediction_t'] = 0
iris.loc[iris.h2o_prediction_t == 2, 'h2o_prediction_t'] = 1
iris.loc[iris.h2o_prediction_t == 4, 'h2o_prediction_t'] = 2 
iris.loc[iris.h2o_prediction_t == 3, 'h2o_prediction_t'] = 3

iris.loc[iris.h2o_prediction_f == 0, 'h2o_prediction_f'] = 4 
iris.loc[iris.h2o_prediction_f == 1, 'h2o_prediction_f'] = 0
iris.loc[iris.h2o_prediction_f == 2, 'h2o_prediction_f'] = 1
iris.loc[iris.h2o_prediction_f == 4, 'h2o_prediction_f'] = 2 
iris.loc[iris.h2o_prediction_f == 3, 'h2o_prediction_f'] = 3

iris


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id,sci_prediction,h2o_prediction_t,h2o_prediction_f,h2o_prediction_co_t,h2o_prediction_co_f
0,5.1,3.5,1.4,0.2,setosa,0,1,0,0,1,1
5,5.4,3.9,1.7,0.4,setosa,0,1,0,0,1,1
10,5.4,3.7,1.5,0.2,setosa,0,1,0,0,1,1
18,5.7,3.8,1.7,0.3,setosa,0,1,0,0,1,1
20,5.4,3.4,1.7,0.2,setosa,0,1,0,0,1,1
51,6.4,3.2,4.5,1.5,versicolor,1,2,1,1,2,2
56,6.3,3.3,4.7,1.6,versicolor,1,2,1,1,2,2
65,6.7,3.1,4.4,1.4,versicolor,1,2,1,1,2,2
74,6.4,2.9,4.3,1.3,versicolor,1,2,1,1,2,2
87,6.3,2.3,4.4,1.3,versicolor,1,2,1,1,2,2


In [14]:
iris.loc[iris.h2o_prediction_co_t == 2, 'h2o_prediction_co_t'] = 4 
iris.loc[iris.h2o_prediction_co_t == 0, 'h2o_prediction_co_t'] = 2
iris.loc[iris.h2o_prediction_co_t == 4, 'h2o_prediction_co_t'] = 0 
iris.loc[iris.h2o_prediction_co_t == 3, 'h2o_prediction_co_t'] = 3

iris.loc[iris.h2o_prediction_co_f == 2, 'h2o_prediction_co_f'] = 4 
iris.loc[iris.h2o_prediction_co_f == 0, 'h2o_prediction_co_f'] = 2
iris.loc[iris.h2o_prediction_co_f == 4, 'h2o_prediction_co_f'] = 0 
iris.loc[iris.h2o_prediction_co_f == 3, 'h2o_prediction_co_f'] = 3

iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id,sci_prediction,h2o_prediction_t,h2o_prediction_f,h2o_prediction_co_t,h2o_prediction_co_f
0,5.1,3.5,1.4,0.2,setosa,0,1,0,0,1,1
5,5.4,3.9,1.7,0.4,setosa,0,1,0,0,1,1
10,5.4,3.7,1.5,0.2,setosa,0,1,0,0,1,1
18,5.7,3.8,1.7,0.3,setosa,0,1,0,0,1,1
20,5.4,3.4,1.7,0.2,setosa,0,1,0,0,1,1
51,6.4,3.2,4.5,1.5,versicolor,1,2,1,1,0,0
56,6.3,3.3,4.7,1.6,versicolor,1,2,1,1,0,0
65,6.7,3.1,4.4,1.4,versicolor,1,2,1,1,0,0
74,6.4,2.9,4.3,1.3,versicolor,1,2,1,1,0,0
87,6.3,2.3,4.4,1.3,versicolor,1,2,1,1,0,0


In [15]:
# labeled result cluster assignment

iris.loc[iris.h2o_prediction_t == 0, 'h2o_prediction_t_label'] = 'setosa' 
iris.loc[iris.h2o_prediction_t == 1, 'h2o_prediction_t_label'] = 'versicolor'
iris.loc[iris.h2o_prediction_t == 2, 'h2o_prediction_t_label'] = 'virginica'
iris.loc[iris.h2o_prediction_t == 3, 'h2o_prediction_t_label'] = 'centroid'

iris.loc[iris.h2o_prediction_f == 0, 'h2o_prediction_f_label'] = 'setosa' 
iris.loc[iris.h2o_prediction_f == 1, 'h2o_prediction_f_label'] = 'versicolor'
iris.loc[iris.h2o_prediction_f == 2, 'h2o_prediction_f_label'] = 'virginica'
iris.loc[iris.h2o_prediction_f == 3, 'h2o_prediction_f_label'] = 'centroid'

iris.loc[iris.h2o_prediction_co_t == 0, 'h2o_prediction_co_t_label'] = 'setosa' 
iris.loc[iris.h2o_prediction_co_t == 1, 'h2o_prediction_co_t_label'] = 'versicolor'
iris.loc[iris.h2o_prediction_co_t == 2, 'h2o_prediction_co_t_label'] = 'virginica'
iris.loc[iris.h2o_prediction_co_t == 3, 'h2o_prediction_co_t_label'] = 'centroid'

iris.loc[iris.h2o_prediction_co_f == 0, 'h2o_prediction_co_f_label'] = 'setosa' 
iris.loc[iris.h2o_prediction_co_f == 1, 'h2o_prediction_co_f_label'] = 'versicolor'
iris.loc[iris.h2o_prediction_co_f == 2, 'h2o_prediction_co_f_label'] = 'virginica'
iris.loc[iris.h2o_prediction_co_f == 3, 'h2o_prediction_co_f_label'] = 'centroid'
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id,sci_prediction,h2o_prediction_t,h2o_prediction_f,h2o_prediction_co_t,h2o_prediction_co_f,h2o_prediction_t_label,h2o_prediction_f_label,h2o_prediction_co_t_label,h2o_prediction_co_f_label
0,5.1,3.5,1.4,0.2,setosa,0,1,0,0,1,1,setosa,setosa,versicolor,versicolor
5,5.4,3.9,1.7,0.4,setosa,0,1,0,0,1,1,setosa,setosa,versicolor,versicolor
10,5.4,3.7,1.5,0.2,setosa,0,1,0,0,1,1,setosa,setosa,versicolor,versicolor
18,5.7,3.8,1.7,0.3,setosa,0,1,0,0,1,1,setosa,setosa,versicolor,versicolor
20,5.4,3.4,1.7,0.2,setosa,0,1,0,0,1,1,setosa,setosa,versicolor,versicolor
51,6.4,3.2,4.5,1.5,versicolor,1,2,1,1,0,0,versicolor,versicolor,setosa,setosa
56,6.3,3.3,4.7,1.6,versicolor,1,2,1,1,0,0,versicolor,versicolor,setosa,setosa
65,6.7,3.1,4.4,1.4,versicolor,1,2,1,1,0,0,versicolor,versicolor,setosa,setosa
74,6.4,2.9,4.3,1.3,versicolor,1,2,1,1,0,0,versicolor,versicolor,setosa,setosa
87,6.3,2.3,4.4,1.3,versicolor,1,2,1,1,0,0,versicolor,versicolor,setosa,setosa


In [14]:
# print original data vs result cluster assignment

fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', color='species', title='Original data')
fig.show()

In [16]:
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', color='h2o_prediction_t_label', title='Result cluster assignment')
fig.show()

fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', color='h2o_prediction_co_t_label', title='Result cluster assignment')
fig.show()

In [17]:
fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', color='h2o_prediction_f_label', title='Result cluster assignment')
fig.show()

fig = px.scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_width', color='h2o_prediction_co_f_label', title='Result cluster assignment')
fig.show()

In [18]:
centers = pd.concat([pd.DataFrame(h2o_km_t.centers()), pd.DataFrame(h2o_km_f.centers()), pd.DataFrame(h2o_km_co_t.centers()), pd.DataFrame(h2o_km_co_f.centers())])
centers['type'] = ['LKT', 'LKT','LKT','LKF', 'LKF','LKF','CKT', 'CKT','CKT','CKF','CKF','CKF']


centers.columns = ["sepal_length","sepal_width","petal_length","petal_width","species"]
centers

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,7.8,3.8,6.55,2.1,LKT
1,5.4,3.66,1.6,0.26,LKT
2,6.8875,2.9,5.3125,1.6875,LKT
0,7.72,3.2,6.66,2.12,LKF
1,5.4,3.66,1.6,0.26,LKF
2,6.42,2.96,4.46,1.42,LKF
0,7.8,3.8,6.55,2.1,CKT
1,5.4,3.66,1.6,0.26,CKT
2,6.8875,2.9,5.3125,1.6875,CKT
0,7.72,3.2,6.66,2.12,CKF


In [104]:
fig = px.scatter_3d(centers, x='sepal_length', y='sepal_width', z='petal_width', color='species', title='Result centers')
fig.show()

In [105]:
centers_iris = pd.concat([iris.iloc[:,0:5], centers], ignore_index=True)
centers_iris['size'] = 3
centers_iris.iloc[15:19, 5] = 2
centers_iris.iloc[18:30, 5] = 1
centers_iris


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,size
0,5.1,3.5,1.4,0.2,setosa,3
1,5.4,3.9,1.7,0.4,setosa,3
2,5.4,3.7,1.5,0.2,setosa,3
3,5.7,3.8,1.7,0.3,setosa,3
4,5.4,3.4,1.7,0.2,setosa,3
5,6.4,3.2,4.5,1.5,versicolor,3
6,6.3,3.3,4.7,1.6,versicolor,3
7,6.7,3.1,4.4,1.4,versicolor,3
8,6.4,2.9,4.3,1.3,versicolor,3
9,6.3,2.3,4.4,1.3,versicolor,3


In [106]:
fig = px.scatter_3d(centers_iris, x='sepal_length', y='sepal_width', z='petal_width', color='species', title='Result cluster assignment', size='size')
fig.show()

In [22]:
#
# Author: Stanislaw Adaszewski, 2015
#

import networkx as nx
import numpy as np
import time


def constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9, points=None):
    data = np.array(data)
    
    min_ = np.min(data, axis = 0)
    max_ = np.max(data, axis = 0)
    
    if points is None:
        C = min_ + np.random.random((len(demand), data.shape[1])) * (max_ - min_)
        print(C)
    else:
        C = points
    M = np.array([-1] * len(data), dtype=np.int)

    itercnt = 0
    while True:
        itercnt += 1
        print("Iteration:", itercnt)
        # memberships
        g = nx.DiGraph()
        g.add_nodes_from(range(0, data.shape[0]), demand=-1) # points
        for i in range(0, len(C)):
            g.add_node(len(data) + i, demand=demand[i])

        # Calculating cost...
        cost = np.array([np.linalg.norm(np.tile(data.T, len(C)).T - np.tile(C, len(data)).reshape(len(C) * len(data), C.shape[1]), axis=1)])
        # Preparing data_to_C_edges...
        data_to_C_edges = np.concatenate((np.tile([range(0, data.shape[0])], len(C)).T, np.tile(np.array([range(data.shape[0], data.shape[0] + C.shape[0])]).T, len(data)).reshape(len(C) * len(data), 1), cost.T * fixedprec), axis=1).astype(np.uint64)
        # Adding to graph
        g.add_weighted_edges_from(data_to_C_edges)
        

        a = len(data) + len(C)
        g.add_node(a, demand=len(data)-np.sum(demand))
        C_to_a_edges = np.concatenate((np.array([range(len(data), len(data) + len(C))]).T, np.tile([[a]], len(C)).T), axis=1)
        g.add_edges_from(C_to_a_edges)
        

        # Calculating min cost flow...
        f = nx.min_cost_flow(g)
        print("calculate ", itercnt)
        
        # assign
        M_new = np.ones(len(data), dtype=np.int) * -1
        for i in range(len(data)):
            p = sorted(f[i].items(), key=lambda x: x[1])[-1][0]
            M_new[i] = p - len(data)
            
        # stop condition
        if np.all(M_new == M):
            # Stop
            return (C, M, f)
            
        M = M_new
        print(M.tolist().count(0), M.tolist().count(1), M.tolist().count(2))
        
        # compute new centers
        for i in range(len(C)):
            C[i, :] = np.mean(data[M==i, :], axis=0)
                
        if maxiter is not None and itercnt >= maxiter:
            # Max iterations reached
            return (C, M, f)

In [19]:
# import time to measure elapsed time
from timeit import default_timer as timer
from datetime import timedelta
import time

start = timer()
end = timer()
print("Time:", timedelta(seconds=end-start))

Time: 0:00:00.000044


In [8]:
import plotly.express as px
iris = px.data.iris()

# look into data
iris
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1


In [15]:
data = iris.iloc[:,[0, 1, 2, 3]]
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [36]:
# run h2o Kmeans standardize true
start_clusters=h2o.H2OFrame([[4.07405796, 3.8763009,  2.06770276, 1.17116832],
 [5.61650778, 3.50729223, 2.99255861, 0.14215857],
 [5.72002172, 3.92998087, 5.51366063, 0.10292584]])
h2o_km_co_t = H2OKMeansEstimator(k=3, user_points=start_clusters, standardize=True, cluster_size_constraints=[2, 5, 8])
h2o_km_co_t.train(x=list(range(4)),training_frame=iris_h2o)
h2o_km_co_t.show()

Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1579081201160_13


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,148.0,3.0,0.0,7.0,136.750269,588.0,451.249731




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 136.75026938898384
Total Sum of Square Error to Grand Mean: 587.9999999999992
Between Cluster Sum of Square Error: 451.24973061101537

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,50.0,47.710591
1,,2.0,52.0,42.945062
2,,3.0,46.0,46.094616



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-01-15 15:25:21,0.001 sec,0.0,,
1,,2020-01-15 15:25:21,0.105 sec,1.0,148.0,1049.975406
2,,2020-01-15 15:25:21,0.182 sec,2.0,73.0,225.119695
3,,2020-01-15 15:25:21,0.260 sec,3.0,4.0,138.544641
4,,2020-01-15 15:25:21,0.337 sec,4.0,2.0,137.60738
5,,2020-01-15 15:25:21,0.413 sec,5.0,2.0,137.153231
6,,2020-01-15 15:25:21,0.512 sec,6.0,2.0,136.897018
7,,2020-01-15 15:25:22,0.601 sec,7.0,0.0,136.750269


In [51]:
h2o_km_co_t = H2OKMeansEstimator(k=3, user_points=start_clusters, standardize=True)
h2o_km_co_t.train(x=list(range(4)),training_frame=iris_h2o)
h2o_km_co_t.show()

kmeans Model Build progress: |████████████████████████████████████████████| 100%
Model Details
H2OKMeansEstimator :  K-means
Model Key:  KMeans_model_python_1579081201160_14


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_rows,number_of_clusters,number_of_categorical_columns,number_of_iterations,within_cluster_sum_of_squares,total_sum_of_squares,between_cluster_sum_of_squares
0,,148.0,3.0,0.0,6.0,136.750269,588.0,451.249731




ModelMetricsClustering: kmeans
** Reported on train data. **

MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 136.75026928920195
Total Sum of Square Error to Grand Mean: 588.0000091055117
Between Cluster Sum of Square Error: 451.2497398163098

Centroid Statistics: 


Unnamed: 0,Unnamed: 1,centroid,size,within_cluster_sum_of_squares
0,,1.0,50.0,47.710591
1,,2.0,52.0,42.945062
2,,3.0,46.0,46.094616



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,number_of_reassigned_observations,within_cluster_sum_of_squares
0,,2020-01-15 16:18:09,0.002 sec,0.0,,
1,,2020-01-15 16:18:09,0.004 sec,1.0,148.0,1049.332806
2,,2020-01-15 16:18:09,0.005 sec,2.0,73.0,234.873967
3,,2020-01-15 16:18:09,0.007 sec,3.0,6.0,138.619064
4,,2020-01-15 16:18:09,0.007 sec,4.0,3.0,137.335711
5,,2020-01-15 16:18:09,0.008 sec,5.0,2.0,136.897018
6,,2020-01-15 16:18:09,0.010 sec,6.0,0.0,136.750269


In [62]:
h2o_km_co_f.end_time - h2o_km_co_f.end_time

0

In [83]:
t = time.time()
points = np.array([[4.07405796, 3.8763009,  2.06770276, 1.17116832],
 [5.61650778, 3.50729223, 2.99255861, 0.14215857],
 [5.72002172, 3.92998087, 5.51366063, 0.10292584]])
data_np = data.to_numpy()
(C, M, f) = constrained_kmeans(data_np, [2, 5, 8],fixedprec=1e12, points=points, maxiter=7)
print('Elapsed:', (timedelta(seconds=time.time() - t)), 's')
print('C:', C)
print('M:', M)
M.tolist().count(0), M.tolist().count(1), M.tolist().count(2)

Iteration: 1
i:18 p:6 q:150
i:33 p:11 q:150
i:61 p:20 q:151
i:73 p:24 q:151
i:94 p:31 q:151
i:118 p:39 q:151
i:133 p:44 q:151
i:170 p:56 q:152
i:191 p:63 q:152
i:200 p:66 q:152
i:221 p:73 q:152
i:257 p:85 q:152
i:275 p:91 q:152
i:287 p:95 q:152
i:311 p:103 q:152
i:383 p:127 q:152
i:404 p:134 q:152
i:451 p:151 q:153
i:16 p:5 q:151
i:31 p:10 q:151
i:55 p:18 q:151
i:79 p:26 q:151
i:97 p:32 q:151
i:139 p:46 q:151
i:145 p:48 q:151
i:172 p:57 q:151
i:193 p:64 q:151
i:208 p:69 q:151
i:238 p:79 q:151
i:265 p:88 q:151
i:280 p:93 q:151
i:295 p:98 q:151
i:319 p:106 q:151
i:358 p:119 q:151
i:379 p:126 q:151
i:400 p:133 q:151
i:415 p:138 q:151
i:448 p:149 q:151
i:450 p:150 q:153
i:21 p:7 q:150
i:57 p:19 q:150
i:69 p:23 q:150
i:87 p:29 q:150
i:126 p:42 q:150
i:129 p:43 q:150
i:160 p:53 q:151
i:178 p:59 q:151
i:202 p:67 q:151
i:223 p:74 q:151
i:244 p:81 q:151
i:277 p:92 q:151
i:286 p:95 q:151
i:311 p:103 q:152
i:340 p:113 q:151
i:364 p:121 q:151
i:382 p:127 q:151
i:412 p:137 q:151
i:427 p:142 q:151
i

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



104 q:152
i:332 p:110 q:152
i:362 p:120 q:152
i:371 p:123 q:152
i:398 p:132 q:152
i:419 p:139 q:152
i:0 p:0 q:150
i:12 p:4 q:150
i:36 p:12 q:150
i:63 p:21 q:150
i:81 p:27 q:150
i:105 p:35 q:150
i:120 p:40 q:150
i:144 p:48 q:150
i:167 p:55 q:152
i:200 p:66 q:152
i:221 p:73 q:152
i:236 p:78 q:152
i:254 p:84 q:152
i:290 p:96 q:152
i:308 p:102 q:152
i:326 p:108 q:152
i:338 p:112 q:152
i:377 p:125 q:152
i:383 p:127 q:152
i:401 p:133 q:152
i:437 p:145 q:152
i:6 p:2 q:150
i:18 p:6 q:150
i:57 p:19 q:150
i:75 p:25 q:150
i:87 p:29 q:150
i:102 p:34 q:150
i:138 p:46 q:150
i:154 p:51 q:151
i:185 p:61 q:152
i:203 p:67 q:152
i:224 p:74 q:152
i:248 p:82 q:152
i:272 p:90 q:152
i:287 p:95 q:152
i:317 p:105 q:152
i:341 p:113 q:152
i:344 p:114 q:152
i:380 p:126 q:152
i:404 p:134 q:152
i:425 p:141 q:152
i:440 p:146 q:152
i:3 p:1 q:150
i:24 p:8 q:150
i:66 p:22 q:150
i:69 p:23 q:150
i:90 p:30 q:150
i:111 p:37 q:150
i:141 p:47 q:150
i:172 p:57 q:151
i:188 p:62 q:152
i:215 p:71 q:152
i:242 p:80 q:152
i:245 p:8

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [87]:
M.tolist().count(0), M.tolist().count(1), M.tolist().count(2)

(50, 50, 50)