# Clustering: K - Means

The dataset contains both information on marketing newsletters/e-mail campaigns (e-mail offers sent) and transaction level data from customers (which offer customers responded to and what they bought).

The dataset contains both information on marketing newsletters/e-mail campaigns


Offer: Is the offer ID  - the offer id which a company has given

Campain : the month in which the campaign was run

Varietal : Product variety. The company has ran an offer on certain veriety of products.

Minimum Qty: the minimum quantity of the product need to be purchased.

Discount: the amount of discount given to the customer

Origin: The location in which the campaign was run

Past peak: where the sales high or low in the past.

In [1]:
import pandas as pd

df_offers = pd.read_excel("WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()

Unnamed: 0,offer_id,campaign,varietal,min_qty,discount,origin,past_peak
0,1,January,Malbec,72,56,France,False
1,2,January,Pinot Noir,72,17,France,False
2,3,February,Espumante,144,32,Oregon,True
3,4,February,Champagne,72,48,France,True
4,5,February,Cabernet Sauvignon,144,44,New Zealand,True


In [2]:
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()

Unnamed: 0,customer_name,offer_id,n
0,Smith,2,1
1,Smith,24,1
2,Johnson,17,1
3,Johnson,24,1
4,Johnson,26,1


In [3]:
# join the offers and transactions table
df = pd.merge(df_offers, df_transactions)
# create a "pivot table" which will give us the number of times each customer responded to a given offer
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'], values='n')
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
# save a list of the 0/1 columns. we'll use these a bit later
x_cols = matrix.columns[1:]

In [4]:
matrix.columns[2:]

Index([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
      dtype='object', name='offer_id')

In [5]:
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=5,)
# slice matrix so we only include the 0/1 indicator columns in the clustering
matrix['cluster'] = cluster.fit_predict(matrix[matrix.columns[2:]])
matrix.cluster.value_counts()

1    33
0    31
3    17
2    15
4     4
Name: cluster, dtype: int64

In [6]:
matrix

offer_id,customer_name,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,cluster
0,Adams,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1
1,Allen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
2,Anderson,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,Bailey,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
4,Baker,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3
5,Barnes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3
6,Bell,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2
7,Bennett,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
8,Brooks,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
9,Brown,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1


In [12]:
#cluster.fit_predict(df)

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in df.columns:
    if df[i].dtypes == 'object':
        df[i] = le.fit_transform(df[i])
clusters = cluster.fit_predict(df)        

In [14]:
model = cluster.fit(df)
model.cluster_centers_
model.labels_
model.n_clusters

5

In [15]:
df['clusters']=clusters

In [16]:
from sklearn.metrics import silhouette_score
ss = silhouette_score(df,clusters)
print('Silhouette Score: {}'.format(ss))
# The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.
# The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
# Selecting the number of clusters or K's with silhouette analysis 


# these metrics need labeled data
#metrics.adjusted_rand_score,
#metrics.v_measure_score,
#metrics.mutual_info_score,

Silhouette Score: 0.4120552014964031


In [70]:
data = pd.read_csv("train.csv")
data.isnull().sum()
value = data.Age.mean()
data.Age.fillna(value=value,inplace=True)
cdata = data[["Pclass","Sex","Age","SibSp","Parch","Ticket","Fare","Embarked","Survived"]]

In [71]:
def get_position(mylist,position):
    return mylist[position]
cdata['firstName'] = data.Name.str.split().apply(get_position,position = 0)
cdata['designation'] = data.Name.str.split().apply(get_position,position = 1)
cdata['lastname'] = data.Name.str.split().apply(get_position,position = 2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [76]:
cdata.dropna(inplace=True)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in cdata.columns:
    if cdata[i].dtypes == 'object':
        cdata[i] = le.fit_transform(cdata[i])

cdata.dropna(inplace=True)        
print(cdata.isnull().sum())
cdata.head()

Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Survived       0
firstName      0
designation    0
lastname       0
dtype: int64


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived,firstName,designation,lastname
0,3,1,22.0,1,0,522,7.25,2,0,73,17,339
1,1,0,38.0,1,0,595,71.2833,0,1,136,18,223
2,3,0,26.0,0,0,668,7.925,2,1,251,14,249
3,1,0,35.0,1,0,48,53.1,2,1,198,18,208
4,3,1,35.0,0,0,471,8.05,2,0,11,17,430


In [78]:
from sklearn.cluster import KMeans
cdata1 = cdata[["Pclass","Sex","Age","SibSp","Parch","Ticket","Fare","Embarked"]]
cluster = KMeans(n_clusters=2)
# slice matrix so we only include the 0/1 indicator columns in the clustering
cdata1['cluster'] = cluster.fit_predict(cdata1)
cdata1.cluster.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0    459
1    430
Name: cluster, dtype: int64

In [80]:
data.dropna(inplace=True)
from sklearn.metrics import homogeneity_completeness_v_measure
homogeneity_completeness_v_measure(cdata1.cluster,cdata.Survived)
# homogeneity : score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling
# completeness : score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
# v_measure : harmonic mean of the first two (homogeneity, completeness)

(0.017448694657994979, 0.01816639548313433, 0.017800313662861281)

To learn more about Segmentation

- Customer Analysis and Segmentation

http://inseaddataanalytics.github.io/INSEADAnalytics/Report_s45.html

- Customer Segmentation @Bain and Company

Assignment dataset

http://archive.ics.uci.edu/ml/machine-learning-databases/sponge/sponge.info

http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant