# 군집화 실습 - 고객 세그먼테이션 

데이터 세트 로딩 및 클렌징

In [2]:
import pandas as pd
import datetime
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

retail_df=pd.read_excel(io='Online Retail.xlsx')
retail_df.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


In [3]:
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [4]:
retail_df=retail_df[retail_df['Quantity']>0]
retail_df=retail_df[retail_df['UnitPrice']>0]
retail_df=retail_df[retail_df['CustomerID'].notnull()]
print(retail_df.shape)
retail_df.isnull().sum()

(397884, 8)


InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [None]:
retail_df['Country'].value_counts()[:5]

In [None]:
retail_df=retail_df[retail_df['Country']=='United Kingdom']
print(retail_df.shape)

RFM 기반 데이터 가공

In [None]:
retail_df['sale_amount']=retail_df['Quantity']*retail_df['UnitPrice']
retail_df['CustomerID']=retail_df['CustomerID'].astype(int)
print(retail_df['CustomerID'].value_counts().head(5))
print(retail_df.groupby('CustomerID')['sale_amount'].sum().sort_values(ascending=False)[:5])

In [None]:
retail_df.groupby(["InvoiceNo","StockCode"])["InvoiceNo"].count().mean()

In [None]:

aggregations={
    'InvoiceDate':'max',
    'InvoiceNo':'count',
    'sale_amount':'sum'
}
cust_df=retail_df.groupby('CustomerID').agg(aggregations)
cust_df=cust_df.rename(columns={
    'InvoiceDate':'Recency',
    'InvoiceNo':'Frequency',
    'sale_amount':'Monetary'
})
cust_df=cust_df.reset_index()
cust_df.head(3)

In [None]:
import datetime as dt
cust_df['Recency']=dt.datetime(2011,12,10)-cust_df['Recency']
cust_df['Recency']=cust_df['Recency'].apply(lambda x:x.days+1)
print('Cust_df 로우와 컬럼 건수는',cust_df.shape)
cust_df.head(3)

rfm 기반 고객 세그멘테이션

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(figsize=(12,4),nrows=1,ncols=3)
ax1.set_title('Recency Histogram')
ax1.hist(cust_df['Recency'])
ax2.set_title('Frequency Histogram')
ax2.hist(cust_df['Frequency'])
ax3.set_title('Monetary Histogram')
ax3.hist(cust_df['Monetary'])

In [None]:
cust_df[['Recency','Frequency','Monetary']].describe()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,silhouette_samples

x_features=cust_df[['Recency','Frequency','Monetary']].values
x_features_scaled=StandardScaler().fit_transform(x_features)

kmeans=KMeans(n_clusters=3,random_state=0)
labels=kmeans.fit_predict(x_features_scaled)
cust_df['cluster_label']=labels
print('실루엣 스코어는 :[0.3f]'.format(silhouette_score(x_features_scaled,labels)))

In [None]:
cust_df['Recency_log']=np.log1p(cust_df['Recency'])
cust_df['Frequency_log']=np.log1p(cust_df['Frequency'])
cust_df['Monetary_log']=np.log1p(cust_df['Monetary'])

x_features=cust_df[['Recency_log','Frequency_log','Monetary_log']].values
x_features_scaled=StandardScaler().fit_transform(x_features)
kmeans=KMeans(n_clusters=3,random_state=0)
labels=kmeans.fit_predict(x_features_scaled)
cust_df['cluster_label']=labels
print('실루엣 스코어는 :[0.3f]'.format(silhouette_score(x_features_scaled,labels)))