K-Prototype clustering in an excel.

K-Prototype Clustering algorithm in an excel.

Clustering is very important statistical technique which is used to cluster the data based on different methods. While dealing with real time data, we will have both continuous variables and categorical variables. If we have only continuous variables, then we can go for k-means clustering algorithm to cluster the data. If we have only categorical variables in your data, then we can go for k-modes clustering algorithm to cluster the data. If we have both the variables types in the data, then k-means and k-modes doesn't applicable. We should use different algorithm to cluster such type of data. We have a good technique to tackle with such type of data called 'k-prototype clustering'. K-prototype clustering is nothing but the combination of both k-means and k-modes clustering.

In the above formula, first part deals with continuous variables.

And the second part deals with categorical variables.

Here, is a weight for categorical attributes for cluster l.

In k-prototype clustering, we are calculating the distance based on different methods such as Euclidean distance, Manhattan distance, Mahalanobis distance, Chebechevs and minkowski distance.

Ex: Let us apply the above equation on the following data.

A	B	C	D
10	22	1	1
15	23	1	2
25	21	1	2
33	33	2	1
45	35	2	2
50	31	3	3
24	15	2	2
70	10	1	3

In the above data, A and B are continuous variables, C and D are categorical variables.

Step1:

Initially, we have to select 'k' number of records randomly from the data. Those selected records will be called as clusters.

						C1					C2
				A	B	C	D	C1_T	A	B	C	D	C2_T	New_Cls	C1_T	C2_T
A	B	C	D	10	22	1	1		70	10	1	3
10	22	1	1	0.0	0.0	0.0	0.0	0.00	3600.00	144.00	0.00	1.00	62.19	1	0.00	62.19
15	23	1	2	25.0	1.0	0.0	1.0	6.10	3025.00	169.00	0.00	1.00	57.52	1	6.10	57.52
25	21	1	2	225.0	1.0	0.0	1.0	16.03	2025.00	121.00	0.00	1.00	47.32	1	16.03	47.32
33	33	2	1	529.0	121.0	1.0	0.0	26.50	1369.00	529.00	1.00	1.00	45.57	1	26.50	45.57
45	35	2	2	1225.0	169.0	1.0	1.0	39.34	625.00	625.00	1.00	1.00	37.36	2	39.34	37.36
50	31	3	3	1600.0	81.0	1.0	1.0	43.00	400.00	441.00	1.00	0.00	30.00	2	43.00	30.00
24	15	2	2	196.0	49.0	1.0	1.0	17.65	2116.00	25.00	1.00	1.00	48.27	1	17.65	48.27
70	10	1	3	3600.0	144.0	0.0	1.0	62.19	0.00	0.00	0.00	0.00	0.00	2	62.19	0.00

In the above table, I selected first and last records as initial clusters. Base on those clusters, I calculated distance by using Euclidean and maximum distances.

Here A and B are continuous variables. So, we should use any one of above specified distance measure to calculate the distance.

C and D are categorical variables, So, we should calculate difference between the numbers. If they are same, we should flag them as 0 otherwise 1.

C1_T is the sum of all the distances in cluster 1(C1).

Similarly, we have to caluculate distances for Cluster 2.

Sno.	A	B	C	D
1	10	22	1	1
2	15	23	1	2
3	25	21	1	2
4	33	33	2	1
5	45	35	2	2
6	50	31	3	3
7	24	15	2	2
8	70	10	1	3
C1 updates	21.4	22.8	1	2
C2 Updates	55	25.33333	#N/A	3

Here 1,2,3,4 and 7th records are belonging to cluster 1 and 5,6 and 8th records are belonging to cluster 2. Base on above table, we have updated C1 values. But corresponding to C2 updates, we have #N/A under C column. Because, corresponding to cluster 2 records under column C are distinct. No value is repeated for more the one time. So, we should not update that value.

My updated clusters are

C1 updates	21.4	22.8	1	2
C2 Updates	55	25.33333	1	3

Iteration 2: Let us update the cluster and repeat the same process.

						C1					C2
				A	B	C	D	C1_T	A	B	C	D	C2_T	New_Cls	C1_T	C2_T
A	B	C	D	21.4	22.8	1	2		55	25.333	1	3
10	22	1	1	129.96	0.64	0	1	12.428	2025	11.111	0	1	46.1	1	12.42804	46.12329
15	23	1	2	40.96	0.04	0	1	7.4031	1600	5.4444	0	1	41.1	1	7.403124	41.068
25	21	1	2	12.96	3.24	0	1	5.0249	900	18.778	0	1	31.3	1	5.024922	31.31135
33	33	2	1	134.56	104.04	1	0	16.447	484	58.778	1	1	25.3	1	16.44668	25.29759
45	35	2	2	556.96	148.84	1	1	28.567	100	93.444	1	1	15.9	2	28.5669	15.90843
50	31	3	3	817.96	67.24	1	1	31.752	25	32.111	1	0	8.56	2	31.75231	8.557189
24	15	2	2	6.76	60.84	1	1	10.222	961	106.78	1	1	34.7	1	10.22192	34.67687
70	10	1	3	2362	163.84	0	1	51.257	225	235.11	0	0	21.5	2	51.25734	21.4502

Update cluster information is shown below.

A	B	C	D	Iter 1	Iter 2
10	22	1	1	1	1
15	23	1	2	1	1
25	21	1	2	1	1
33	33	2	1	1	1
45	35	2	2	2	2
50	31	3	3	2	2
24	15	2	2	1	1
70	10	1	3	2	2

In the above table, we can see that there is no difference in clusters between first iteration and second iteration.

So, we can stop the process and conclude that 1,2,3,4 and 7 records belongs to first cluster. Record numbers 5,6 and 8 belongs to second cluster.

This functionality is available in R. If you want to use this functionality install a package called "clustMixType". In this package, there is a function called "kproto" hich can provide the k-prototype functionality.

library(clustMixType)

model = kproto(data,k=2)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
k-prototype clustering.xlsx		k-prototype clustering.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

K-Prototype clustering in an excel.

About

Uh oh!

Releases

Packages

A	B	C	D	Iter 1	Iter 2
10	22	1	1	1	1
15	23	1	2	1	1
25	21	1	2	1	1
33	33	2	1	1	1
45	35	2	2	2	2
50	31	3	3	2	2
24	15	2	2	1	1
70	10	1	3	2	2

A	B	C	D	Iter 1	Iter 2
10	22	1	1	1	1
15	23	1	2	1	1
25	21	1	2	1	1
33	33	2	1	1	1
45	35	2	2	2	2
50	31	3	3	2	2
24	15	2	2	1	1
70	10	1	3	2	2

krskalla/Cluster-algorithms

Folders and files

Latest commit

History

Repository files navigation

K-Prototype clustering in an excel.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages

A	B	C	D	Iter 1	Iter 2
10	22	1	1	1	1
15	23	1	2	1	1
25	21	1	2	1	1
33	33	2	1	1	1
45	35	2	2	2	2
50	31	3	3	2	2
24	15	2	2	1	1
70	10	1	3	2	2