# K-Prototype-Clustering

## Why K-Prototype Clustering?

The k-means based methods are efficient for processing large data sets, but they are often limited to numeric data. 

Kmeans optimize a cost function defined on the Euclidean distance
measure between data points and means of clusters. 

Minimizing the cost function by
calculating means limits their use to numeric data.

This is where K-Prototype shines. When applied to numeric data the algorithm is identical to k-means. 

For categorical data algorithm uses a simple matching dissimilarity measure
, replaces the means of clusters with modes, and uses a frequency-based method to
update modes in the clustering process to minimize the clustering cost function.

## Installing

In [0]:
! pip install -q kmodes

## Applying to Data

In [0]:
import numpy as np
import pandas as pd
from kmodes.kprototypes import KPrototypes

In [0]:
marketing_df = pd.read_csv('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/marketing_cva_f.csv')

In [4]:
marketing_df.head()

Unnamed: 0,Customer,State,CLV,Coverage,Income,loc_type,monthly_premium,months_last_claim,Months_Since_Policy_Inception,Total_Claim_Amount,Vehicle_Class,avg_vehicle_age
0,BU79786,Washington,2763.519279,Basic,56274,Suburban,69,32,5,384.811147,Two-Door Car,40.696695
1,AI49188,Nevada,12887.43165,Premium,48767,Suburban,108,18,38,566.472247,Two-Door Car,48.755298
2,HB64268,Washington,2813.692575,Basic,43836,Rural,73,12,44,138.130879,Four-Door Car,70.394474
3,OC83172,Oregon,8256.2978,Basic,62902,Rural,69,14,94,159.383042,Two-Door Car,53.460212
4,XZ87318,Oregon,5380.898636,Basic,55350,Suburban,67,0,13,321.6,Four-Door Car,32.811507


In [0]:
# Removing some unnecesary columns
marketing_df = marketing_df.drop(['Customer','Vehicle_Class','avg_vehicle_age','months_last_claim','Total_Claim_Amount'],axis=1)

In [8]:
# Most columns are categorical now
marketing_df.head()

Unnamed: 0,State,CLV,Coverage,Income,loc_type,monthly_premium,Months_Since_Policy_Inception
0,Washington,2763.519279,Basic,56274,Suburban,69,5
1,Nevada,12887.43165,Premium,48767,Suburban,108,38
2,Washington,2813.692575,Basic,43836,Rural,73,44
3,Oregon,8256.2978,Basic,62902,Rural,69,94
4,Oregon,5380.898636,Basic,55350,Suburban,67,13


In [0]:
# Creating a numpy array
mark_array = marketing_df.values

In [10]:
print(mark_array.shape)

(6817, 7)


In [0]:
# typecast the integer columns to float
mark_array[:, 1] = mark_array[:, 1].astype(float)
mark_array[:, 3] = mark_array[:, 3].astype(float)
mark_array[:, 5] = mark_array[:, 5].astype(float)
mark_array[:, 6] = mark_array[:, 6].astype(float)

In [17]:
# Using K-prototype
kproto = KPrototypes(n_clusters=3, verbose=1, max_iter=20)
clusters = kproto.fit_predict(mark_array, categorical=[0, 2, 4])

Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 1, iteration: 1/20, moves: 1751, ncost: 791401986804.418
Run: 1, iteration: 2/20, moves: 335, ncost: 785619626647.1771
Run: 1, iteration: 3/20, moves: 41, ncost: 785541835808.0706
Run: 1, iteration: 4/20, moves: 2, ncost: 785541255349.355
Run: 1, iteration: 5/20, moves: 0, ncost: 785541255349.355
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 2, iteration: 1/20, moves: 2077, ncost: 860274609656.4653
Run: 2, iteration: 2/20, moves: 638, ncost: 813095270310.2484
Run: 2, iteration: 3/20, moves: 396, ncost: 794167173703.8223
Run: 2, iteration: 4/20, moves: 201, ncost: 788858059321.6635
Run: 2, iteration: 5/20, moves: 123, ncost: 786870228602.3741
Run: 2, iteration: 6/20, moves: 87, ncost: 785903097925.9777
Run: 2, iteration: 7/20, moves: 51, ncost: 785606912443.4153
Run: 2, iteration: 8/20, moves: 18, ncost: 7855

In [18]:
print(kproto.cluster_centroids_)

[array([[8.05507106e+03, 8.29810393e+04, 9.09618363e+01, 4.76133850e+01],
       [8.31971978e+03, 5.47364709e+04, 9.50943747e+01, 4.91492329e+01],
       [8.03516014e+03, 2.66601263e+04, 9.26249125e+01, 4.69804059e+01]]), array([['California', 'Basic', 'Suburban'],
       ['California', 'Basic', 'Suburban'],
       ['California', 'Basic', 'Suburban']], dtype='<U10')]


In [19]:
print(clusters)

[1 1 1 ... 0 2 2]


In [0]:
cluster_l=[]
for c in clusters:
    cluster_l.append(c)

In [0]:
marketing_df['cluster'] = cluster_l

In [23]:
marketing_df.head()

Unnamed: 0,State,CLV,Coverage,Income,loc_type,monthly_premium,Months_Since_Policy_Inception,cluster
0,Washington,2763.519279,Basic,56274,Suburban,69,5,1
1,Nevada,12887.43165,Premium,48767,Suburban,108,38,1
2,Washington,2813.692575,Basic,43836,Rural,73,44,1
3,Oregon,8256.2978,Basic,62902,Rural,69,94,1
4,Oregon,5380.898636,Basic,55350,Suburban,67,13,1


In [24]:
marketing_df[marketing_df['cluster']== 0].head(10)

Unnamed: 0,State,CLV,Coverage,Income,loc_type,monthly_premium,Months_Since_Policy_Inception,cluster
7,California,8798.797003,Premium,77026,Urban,110,82,0
8,Arizona,8819.018934,Basic,99845,Suburban,110,25,0
9,California,5384.431665,Basic,83689,Urban,70,10,0
19,Oregon,5802.065978,Basic,97541,Suburban,72,1,0
21,Arizona,12902.56014,Premium,86584,Suburban,111,54,0
22,Oregon,3235.360468,Extended,75690,Suburban,80,44,0
27,Arizona,5744.229745,Basic,68987,Urban,71,40,0
34,Washington,2443.665166,Basic,92834,Suburban,61,56,0
53,Oregon,22643.83478,Basic,93011,Rural,113,12,0
56,Oregon,4974.801539,Basic,75644,Suburban,65,68,0


In [25]:
marketing_df[marketing_df['cluster']== 1].head(10)

Unnamed: 0,State,CLV,Coverage,Income,loc_type,monthly_premium,Months_Since_Policy_Inception,cluster
0,Washington,2763.519279,Basic,56274,Suburban,69,5,1
1,Nevada,12887.43165,Premium,48767,Suburban,108,38,1
2,Washington,2813.692575,Basic,43836,Rural,73,44,1
3,Oregon,8256.2978,Basic,62902,Rural,69,94,1
4,Oregon,5380.898636,Basic,55350,Suburban,67,13,1
13,Oregon,5710.333115,Basic,51148,Urban,72,1,1
14,California,8162.617053,Premium,66140,Suburban,101,21,1
15,Oregon,2872.051273,Basic,57749,Suburban,74,21,1
24,Nevada,18975.45611,Extended,65999,Urban,237,14,1
25,Washington,5018.885233,Basic,54500,Suburban,63,17,1


In [26]:
marketing_df[marketing_df['cluster']== 2].head(10)

Unnamed: 0,State,CLV,Coverage,Income,loc_type,monthly_premium,Months_Since_Policy_Inception,cluster
5,Oregon,24127.50402,Basic,14072,Suburban,71,3,2
6,Oregon,7388.178085,Extended,28812,Urban,93,7,2
10,Oregon,7463.139377,Basic,24599,Rural,64,50,2
11,Nevada,2566.867823,Basic,25049,Suburban,67,7,2
12,California,3945.241604,Basic,28855,Suburban,101,59,2
16,Washington,3041.791561,Extended,13789,Suburban,79,49,2
17,Arizona,24127.50402,Basic,14072,Suburban,71,3,2
18,California,2392.10789,Basic,17870,Suburban,61,91,2
20,Washington,5346.916576,Extended,10511,Urban,139,64,2
23,Arizona,2454.58354,Basic,23158,Suburban,63,6,2
