## K - Prototype Clustring Algorithm Implementation 

We use K-Means clustering algorithm if we have numeric data and K-Modes algorithm for Categorical data but what if we have mixed data i.e numeric and categorical both. We use K-Prototype in this case. It has been designed to use K-Means for numeric columns and K-Modes for categorical columns.

### Load Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from kmodes.kprototypes import KPrototypes

### Load Data 

In [2]:
marketing_data = pd.read_csv("marketing.csv",sep=",")

In [3]:
marketing_data.head()

Unnamed: 0,Customer,State,CLV,Coverage,Income,loc_type,monthly_premium,months_last_claim,Months_Since_Policy_Inception,Total_Claim_Amount,Vehicle_Class,avg_vehicle_age
0,BU79786,Washington,2763.519279,Basic,56274,Suburban,69,32,5,384.811147,Two-Door Car,40.696695
1,AI49188,Nevada,12887.43165,Premium,48767,Suburban,108,18,38,566.472247,Two-Door Car,48.755298
2,HB64268,Washington,2813.692575,Basic,43836,Rural,73,12,44,138.130879,Four-Door Car,70.394474
3,OC83172,Oregon,8256.2978,Basic,62902,Rural,69,14,94,159.383042,Two-Door Car,53.460212
4,XZ87318,Oregon,5380.898636,Basic,55350,Suburban,67,0,13,321.6,Four-Door Car,32.811507


As we can see we have mixed data type so we can't just use either K-Means or K-Modes. We will have to use combination of both these algorithm which is nothing but K-Prototype clustering algorithm

### Data Cleaning 

Lets drop 'Customer' column as this has no use.

In [5]:
marketing_data.drop('Customer',axis=1,inplace=True)

### Let's  Scale Numeric Columns

In [10]:
Numeric_Cols = ['CLV','Income','monthly_premium','months_last_claim','Months_Since_Policy_Inception','Total_Claim_Amount','avg_vehicle_age']

In [11]:
cat_data = marketing_data.drop(Numeric_Cols,axis=1)
cat_data.head()

Unnamed: 0,State,Coverage,loc_type,Vehicle_Class
0,Washington,Basic,Suburban,Two-Door Car
1,Nevada,Premium,Suburban,Two-Door Car
2,Washington,Basic,Rural,Four-Door Car
3,Oregon,Basic,Rural,Two-Door Car
4,Oregon,Basic,Suburban,Four-Door Car


In [14]:
numeric_data = marketing_data[Numeric_Cols]
numeric_data.head()

Unnamed: 0,CLV,Income,monthly_premium,months_last_claim,Months_Since_Policy_Inception,Total_Claim_Amount,avg_vehicle_age
0,2763.519279,56274,69,32,5,384.811147,40.696695
1,12887.43165,48767,108,18,38,566.472247,48.755298
2,2813.692575,43836,73,12,44,138.130879,70.394474
3,8256.2978,62902,69,14,94,159.383042,53.460212
4,5380.898636,55350,67,0,13,321.6,32.811507


In [15]:
# Lets do scaling of numeric columns

scaler = StandardScaler()

In [16]:
numeric_data_scaled = scaler.fit_transform(numeric_data)

In [19]:
numeric_data_scaled_df = pd.DataFrame(data=numeric_data_scaled,columns=Numeric_Cols)
numeric_data_scaled_df.head()

Unnamed: 0,CLV,Income,monthly_premium,months_last_claim,Months_Since_Policy_Inception,Total_Claim_Amount,avg_vehicle_age
0,-0.772349,0.239359,-0.692321,1.684521,-1.535588,0.027591,-0.950653
1,0.684632,-0.069518,0.434436,0.293427,-0.352508,0.74667,-0.138215
2,-0.765128,-0.272405,-0.576756,-0.302756,-0.137403,-0.948856,2.043365
3,0.018143,0.512069,-0.692321,-0.104028,1.655142,-0.864733,0.336117
4,-0.395669,0.201341,-0.750103,-1.495122,-1.248781,-0.222621,-1.745608


In [20]:
# Now lets combine both categorical and scaled numeric data

In [21]:
final_data = pd.concat([cat_data,numeric_data_scaled_df],axis=1)
final_data.head()

Unnamed: 0,State,Coverage,loc_type,Vehicle_Class,CLV,Income,monthly_premium,months_last_claim,Months_Since_Policy_Inception,Total_Claim_Amount,avg_vehicle_age
0,Washington,Basic,Suburban,Two-Door Car,-0.772349,0.239359,-0.692321,1.684521,-1.535588,0.027591,-0.950653
1,Nevada,Premium,Suburban,Two-Door Car,0.684632,-0.069518,0.434436,0.293427,-0.352508,0.74667,-0.138215
2,Washington,Basic,Rural,Four-Door Car,-0.765128,-0.272405,-0.576756,-0.302756,-0.137403,-0.948856,2.043365
3,Oregon,Basic,Rural,Two-Door Car,0.018143,0.512069,-0.692321,-0.104028,1.655142,-0.864733,0.336117
4,Oregon,Basic,Suburban,Four-Door Car,-0.395669,0.201341,-0.750103,-1.495122,-1.248781,-0.222621,-1.745608


In [23]:
# Convert the dataframe to Numpy array before training the model

final_data_array = final_data.values

### Train the Model 

In [22]:
kproto = KPrototypes(n_clusters=3,verbose=2)

In [27]:
kproto.fit_predict(final_data_array,categorical=[0,1,2,3])

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 1, iteration: 1/100, moves: 1684, ncost: 43961.88599677503
Run: 1, iteration: 2/100, moves: 1044, ncost: 42727.25320856664
Run: 1, iteration: 3/100, moves: 491, ncost: 42513.888709012106
Run: 1, iteration: 4/100, moves: 213, ncost: 42474.6955836033
Run: 1, iteration: 5/100, moves: 138, ncost: 42455.958910534646
Run: 1, iteration: 6/100, moves: 156, ncost: 42402.820714246474
Run: 1, iteration: 7/100, moves: 105, ncost: 42379.19358381165
Run: 1, iteration: 8/100, moves: 58, ncost: 42372.9477994903
Run: 1, iteration: 9/100, moves: 19, ncost: 42372.36140413257
Run: 1, iteration: 10/100, moves: 8, ncost: 42372.19091052687
Run: 1, iteration: 11/100, moves: 5, ncost: 42372.116424505504
Run: 1, iteration: 12/100, moves: 2, ncost: 42372.10264999943
Run: 1, iteration: 13/100, moves: 1, ncost: 42372.1000756721
Run: 1, iteration: 14/100, moves: 0, ncost: 42372.1000756721
Init: initializing centroids
Init: initiali

Run: 8, iteration: 8/100, moves: 114, ncost: 42000.60539527798
Run: 8, iteration: 9/100, moves: 92, ncost: 41990.77458475859
Run: 8, iteration: 10/100, moves: 39, ncost: 41988.42789926502
Run: 8, iteration: 11/100, moves: 37, ncost: 41985.97249469417
Run: 8, iteration: 12/100, moves: 15, ncost: 41985.66987788474
Run: 8, iteration: 13/100, moves: 17, ncost: 41985.21467575553
Run: 8, iteration: 14/100, moves: 10, ncost: 41985.05512206572
Run: 8, iteration: 15/100, moves: 1, ncost: 41985.045903069724
Run: 8, iteration: 16/100, moves: 0, ncost: 41985.045903069724
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 9, iteration: 1/100, moves: 1572, ncost: 45250.2468638323
Run: 9, iteration: 2/100, moves: 1187, ncost: 43306.023793413246
Run: 9, iteration: 3/100, moves: 712, ncost: 42688.35717897862
Run: 9, iteration: 4/100, moves: 540, ncost: 42524.88917206295
Run: 9, iteration: 5/100, moves: 257, ncost: 42436.692081033456
Run: 9, iteration: 6/100, moves: 173

array([2, 2, 2, ..., 1, 2, 2], dtype=uint16)

In [28]:
# Cluster centroid

kproto.cluster_centroids_

array([['1.4892479757271928', '-0.17866403872290276',
        '1.8528232498282782', '0.022514279683447786',
        '-0.0044803090434404995', '1.5534283318919457',
        '0.05841242762031066', 'California', 'Extended', 'Suburban',
        'SUV'],
       ['-0.14719282878528323', '0.964370037557572',
        '-0.24110631112290754', '-0.06802247735658976',
        '0.0794618380306117', '-0.5022677261053352',
        '-0.022226919467376642', 'California', 'Basic', 'Rural',
        'Four-Door Car'],
       ['-0.23807796696267414', '-0.7465400492202009',
        '-0.24873173020243483', '0.05027100686665932',
        '-0.06397908602880217', '0.03717920157933059',
        '0.004133005118793087', 'California', 'Basic', 'Suburban',
        'Four-Door Car']], dtype='<U32')

## Assign cluster labels to original dataframe

In [31]:
marketing_data['cluster'] = kproto.labels_

In [32]:
marketing_data.head()

Unnamed: 0,State,CLV,Coverage,Income,loc_type,monthly_premium,months_last_claim,Months_Since_Policy_Inception,Total_Claim_Amount,Vehicle_Class,avg_vehicle_age,cluster
0,Washington,2763.519279,Basic,56274,Suburban,69,32,5,384.811147,Two-Door Car,40.696695,2
1,Nevada,12887.43165,Premium,48767,Suburban,108,18,38,566.472247,Two-Door Car,48.755298,2
2,Washington,2813.692575,Basic,43836,Rural,73,12,44,138.130879,Four-Door Car,70.394474,2
3,Oregon,8256.2978,Basic,62902,Rural,69,14,94,159.383042,Two-Door Car,53.460212,1
4,Oregon,5380.898636,Basic,55350,Suburban,67,0,13,321.6,Four-Door Car,32.811507,2
