# Customer Segmentation Using K-means

This is a dataset of history different types of customers. The problem is  to to apply segmentation on this historical 
data based on different types of customers. Customer Segmentation relates to creating segments or partitioning a
customer base into groups who have similar characteristics

In [2]:
import pandas as pd 
import numpy as np

In [3]:
df = pd.read_csv('CustSegmentation.csv') #reading the customer file
df

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,Address,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,NBA001,6.3
1,2,47,1,26,100,4.582,8.218,0.0,NBA021,12.8
2,3,33,2,10,57,6.111,5.802,1.0,NBA013,20.9
3,4,29,2,4,19,0.681,0.516,0.0,NBA009,6.3
4,5,47,1,31,253,9.308,8.908,0.0,NBA008,7.2
...,...,...,...,...,...,...,...,...,...,...
845,846,27,1,5,26,0.548,1.220,,NBA007,6.8
846,847,28,2,7,34,0.359,2.021,0.0,NBA002,7.0
847,848,25,4,0,18,2.802,3.210,1.0,NBA001,33.4
848,849,32,1,12,28,0.116,0.696,0.0,NBA012,2.9


### Data Pre-processing and Selection

Droping the <b> Address </b> Column, because it contains categorical data because K-Means algo cannot be 
applied to Categorical Data directly because Euclidean distance function isn't really meaningful for discrete variables. So, lets drop this feature and run clustering.

In [6]:
df = df.drop('Address',axis=1)
df

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,6.3
1,2,47,1,26,100,4.582,8.218,0.0,12.8
2,3,33,2,10,57,6.111,5.802,1.0,20.9
3,4,29,2,4,19,0.681,0.516,0.0,6.3
4,5,47,1,31,253,9.308,8.908,0.0,7.2
...,...,...,...,...,...,...,...,...,...
845,846,27,1,5,26,0.548,1.220,,6.8
846,847,28,2,7,34,0.359,2.021,0.0,7.0
847,848,25,4,0,18,2.802,3.210,1.0,33.4
848,849,32,1,12,28,0.116,0.696,0.0,2.9


Now, <b> Normalizing </b> the Dataset

In [7]:
from sklearn.preprocessing import StandardScaler

In [9]:
x = df.values[:,1:]
#getting all rows and all columns from no 1 to 8, except the 'Customer id' columnsxx
x = np.nan_to_num(x) 
#using np.nan_to_num to replace NANs to zero and infinity with finite large numbers
x

array([[41.   ,  2.   ,  6.   , ...,  1.073,  0.   ,  6.3  ],
       [47.   ,  1.   , 26.   , ...,  8.218,  0.   , 12.8  ],
       [33.   ,  2.   , 10.   , ...,  5.802,  1.   , 20.9  ],
       ...,
       [25.   ,  4.   ,  0.   , ...,  3.21 ,  1.   , 33.4  ],
       [32.   ,  1.   , 12.   , ...,  0.696,  0.   ,  2.9  ],
       [52.   ,  1.   , 16.   , ...,  3.638,  0.   ,  8.6  ]])

In [10]:
#now narmalizing the data
clus_data = StandardScaler().fit_transform(x)
clus_data

array([[ 0.74291541,  0.31212243, -0.37878978, ..., -0.59048916,
        -0.52379654, -0.57652509],
       [ 1.48949049, -0.76634938,  2.5737211 , ...,  1.51296181,
        -0.52379654,  0.39138677],
       [-0.25251804,  0.31212243,  0.2117124 , ...,  0.80170393,
         1.90913822,  1.59755385],
       ...,
       [-1.24795149,  2.46906604, -1.26454304, ...,  0.03863257,
         1.90913822,  3.45892281],
       [-0.37694723, -0.76634938,  0.50696349, ..., -0.70147601,
        -0.52379654, -1.08281745],
       [ 2.1116364 , -0.76634938,  1.09746566, ...,  0.16463355,
        -0.52379654, -0.2340332 ]])

### Appling the K-Means Algorithm

In [11]:
from sklearn.cluster import KMeans

In [12]:
k_means = KMeans(init='k-means++',n_clusters = 3,n_init=12)

### K-Means parameters
- ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
- ‘n_clusters’ The number of clusters to form as well as the number of centroids to generate.
- ‘n_init’ Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best      output of   n_init consecutive runs in terms of inertia.

In [13]:
k_means.fit(clus_data)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=12, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

ow let's grab the labels for each point in the model using KMeans' .labels_ attribute and save it as k_means_labels

In [14]:
k_means_labels = k_means.labels_
k_means_labels

array([0, 2, 1, 0, 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0,
       1, 2, 2, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2, 1, 2, 0, 2,
       0, 2, 0, 0, 0, 0, 2, 2, 1, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 2, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 2, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 1, 2, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 2, 1, 0,
       0, 0, 0, 2, 1, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 1,
       0, 0, 1, 2, 1, 0, 0, 2, 1, 2, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 2,
       2, 0, 2, 0, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0,
       1, 2, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 0, 0,
       0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0, 0, 2, 0, 1,
       2, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 0, 0, 2, 0,
       2, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0,

In [16]:
'Total labels are ==> {}'.format(set(k_means_labels))

'Total labels are ==> {0, 1, 2}'

### Insights

We assign the labels to each row in dataframe.

In [17]:
df['clus_labels'] = k_means_labels 
#creating a new column of 'clus_labels', which consists of labels for each row

In [18]:
df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
0,1,41,2,6,19,0.124,1.073,0.0,6.3,0
1,2,47,1,26,100,4.582,8.218,0.0,12.8,2
2,3,33,2,10,57,6.111,5.802,1.0,20.9,1
3,4,29,2,4,19,0.681,0.516,0.0,6.3,0
4,5,47,1,31,253,9.308,8.908,0.0,7.2,2


We can easily check the centroid values by averaging the features in each cluster.

In [19]:
df.groupby('clus_labels').mean()

Unnamed: 0_level_0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
clus_labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,426.122905,33.817505,1.603352,7.625698,36.143389,0.853128,1.816855,0.0,7.964991
1,424.451807,31.891566,1.861446,3.963855,31.789157,1.576675,2.843355,0.993939,13.994578
2,424.408163,43.0,1.931973,17.197279,101.959184,4.220673,7.954483,0.162393,13.915646


### Examine the Cluster

## Cluster 0

In [20]:
c0 = df[df['clus_labels']==0]
c0

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
0,1,41,2,6,19,0.124,1.073,0.0,6.3,0
3,4,29,2,4,19,0.681,0.516,0.0,6.3,0
6,7,38,2,4,56,0.442,0.454,0.0,1.6,0
7,8,42,3,0,64,0.279,3.945,0.0,6.6,0
8,9,26,1,5,18,0.575,2.215,,15.5,0
...,...,...,...,...,...,...,...,...,...,...
843,844,32,2,8,45,0.982,0.683,0.0,3.7,0
844,845,41,1,7,43,0.694,1.198,0.0,4.4,0
845,846,27,1,5,26,0.548,1.220,,6.8,0
846,847,28,2,7,34,0.359,2.021,0.0,7.0,0


In [21]:
c0.shape

(537, 10)

In [22]:
c0.describe()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
count,537.0,537.0,537.0,537.0,537.0,537.0,537.0,418.0,537.0,537.0
mean,426.122905,33.817505,1.603352,7.625698,36.143389,0.853128,1.816855,0.0,7.964991,0.0
std,250.158472,7.053912,0.873014,5.341293,17.499358,0.778158,1.312747,0.0,4.927747,0.0
min,1.0,20.0,1.0,0.0,13.0,0.012,0.046,0.0,0.1,0.0
25%,209.0,29.0,1.0,4.0,24.0,0.291,0.867,0.0,4.4,0.0
50%,412.0,34.0,1.0,7.0,32.0,0.606,1.481,0.0,7.0,0.0
75%,656.0,39.0,2.0,11.0,44.0,1.168,2.47,0.0,10.5,0.0
max,849.0,56.0,5.0,23.0,120.0,4.881,7.286,0.0,24.6,0.0


##### Here we  can see that the Cluster 0 consist of customers which have the following attributes
- Age between 20 - 56
- Years Employed between 0 - 23
- Income between 13 0 120

## Cluster 1

In [23]:
c1 = df[df['clus_labels']==1]
c1

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
2,3,33,2,10,57,6.111,5.802,1.0,20.9,1
10,11,44,3,8,88,0.285,5.083,1.0,6.1,1
14,15,28,3,2,20,0.233,1.647,1.0,9.4,1
22,23,28,3,6,47,5.574,3.732,1.0,19.8,1
32,33,23,2,0,42,1.019,0.619,1.0,3.9,1
...,...,...,...,...,...,...,...,...,...,...
816,817,36,2,6,27,0.262,0.980,1.0,4.6,1
823,824,27,4,0,25,1.419,1.756,1.0,12.7,1
824,825,41,2,4,26,1.473,3.519,1.0,19.2,1
830,831,33,1,13,52,2.714,8.362,1.0,21.3,1


In [24]:
c1.shape

(166, 10)

In [26]:
c1.describe()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
count,166.0,166.0,166.0,166.0,166.0,166.0,166.0,165.0,166.0,166.0
mean,424.451807,31.891566,1.861446,3.963855,31.789157,1.576675,2.843355,0.993939,13.994578,1.0
std,234.246091,8.031019,0.952869,3.807316,15.785229,1.3943,2.323803,0.07785,7.465137,0.0
min,3.0,20.0,1.0,0.0,14.0,0.073,0.161,0.0,0.9,1.0
25%,223.25,26.0,1.0,1.0,20.0,0.46225,1.255,1.0,8.4,1.0
50%,439.5,29.5,2.0,3.0,27.0,1.2095,2.3355,1.0,13.2,1.0
75%,607.5,36.75,2.0,6.0,40.0,2.142,3.77175,1.0,18.55,1.0
max,848.0,55.0,5.0,16.0,94.0,6.912,15.405,1.0,35.3,1.0


##### Here we  can see that the Cluster 1 consist of customers which have the following attributes
- Age between 20 - 56
- Years Employed between 0 - 16
- Income between 14 0 94

## Cluster 2

In [32]:
c2 = df[df['clus_labels']==2]
c2

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
1,2,47,1,26,100,4.582,8.218,0.0,12.8,2
4,5,47,1,31,253,9.308,8.908,0.0,7.2,2
5,6,40,1,23,81,0.998,7.831,,10.9,2
9,10,47,3,23,115,0.653,3.947,0.0,4.0,2
18,19,44,1,18,61,2.806,3.782,,10.8,2
...,...,...,...,...,...,...,...,...,...,...
801,802,48,1,30,101,1.875,4.589,0.0,6.4,2
808,809,45,1,17,62,2.437,6.863,0.0,15.0,2
825,826,32,2,12,116,4.027,2.585,,5.7,2
826,827,48,1,13,50,6.114,9.286,1.0,30.8,2


In [33]:
c2.shape

(147, 10)

In [34]:
c2.describe()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,clus_labels
count,147.0,147.0,147.0,147.0,147.0,147.0,147.0,117.0,147.0,147.0
mean,424.408163,43.0,1.931973,17.197279,101.959184,4.220673,7.954483,0.162393,13.915646,2.0
std,242.422273,6.31697,1.031423,6.609084,59.124188,3.590995,4.988961,0.370397,7.860493,0.0
min,2.0,26.0,1.0,1.0,30.0,0.288,1.003,0.0,2.0,2.0
25%,220.0,39.0,1.0,12.0,64.0,1.6395,4.691,0.0,7.7,2.0
50%,444.0,43.0,2.0,17.0,83.0,3.176,7.036,0.0,13.1,2.0
75%,639.0,47.0,3.0,22.0,119.0,5.3265,9.6765,0.0,17.85,2.0
max,850.0,56.0,5.0,33.0,446.0,20.561,35.197,1.0,41.3,2.0


##### Here we  can see that the Cluster 2 consist of customers which have the following attributes
- Age between 26 - 56
- Years Employed between 1 - 33
- Income between 30 to 446