- 군집으로 묶기 위해서는 거리 혹은 유사도의 개념이 필요
- 거리 척도와 유사도 척도는 수치형 변수에 대해 정의되므로, 범주형 변수를 수치형 변수로 변환시키는 작업 필요
- 범주형 변수의 수치화 = 더미화 = pd.get_dummies
    - 숫자로 표현된 범주형 변수(시간, 월, 숫자로 코드화된 각종 문자 등)를 더미화하려면 astype(str)을 이용해 str타입으로 변경 후 사용해야 함
- 거리/유사도 척도
    1. 유클리디안 거리: 가장 흔하게 사용되는 거리 척도로 빛이 가는 거리로 정의됨
    2. 맨하탄 거리: 정수형 데이터(리커트 척도 등)에 적합한 거리 척도로, 수직/수평으로만 이동한 거리의 합으로 정의됨
    3. 코사인 유사도: 스케일을 고려하지 않고 방향 유사도를 측정하는 상황(상품 추천 시스템 등)에 사용
    4. 매칭 유사도: 이진형 데이터에 적합한 유사도 척도로 전체 특징 중 일치하는 비율을 고려함
    5. 자카드 유사도: 이진형 데이터에 적합한 유사도 척도로 둘 중 하나라도 1을 가지는 특징 중 일치하는 비율을 고려함. 희소한 이진형 데이터에 적합한 유사도 척도

# 1. 계층적 군집화

- 개별 샘플을 군집으로 간주하여, 거리가 가장 가까운 두 군집을 순차적으로 묶는 방식으로 큰 군집을 생성
- 군집 간 거리 측정 방법
    1. 최단 연결법: 이상치에 민감, 계산량 많음
    2. 최장 연결법: 이상치에 민감, 계산량 많음
    3. 평균 연결법: 이상치에 둔감, 계산량 많음
    4. 중심 연결법: 이상치에 둔감, 계산량 적음
    5. 와드 연결법: 이상치에 매우 둔감, 계산량 매우 많음, 군집 크기 비슷함
    
- 장점
    1. 덴드로그램을 이용한 군집화 과정 확인 가능
    2. 거리/유사도 행렬만 있으면 군집화 가능
    3. 다양한 거리 척도 활용 가능
    4. 수행할 때마다 같은 결과를 냄
- 단점
    1. 상대적으로 많은 계산량
    2. 군집 개수 설정에 대한 제약 존재

## 덴드로그램

- 계층 군집화 과정을 트리 형태로 보여주는 그래프
- 덴드로그램은 샘플 수가 많은 경우 해석이 불가능할 정도로 복잡해짐

In [1]:
import pandas as pd

In [5]:
df = pd.read_csv('Telco_customer_info.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [6]:
df.set_index('customerID', inplace = True)
df.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [7]:
# 범주형 변수 더미변수화

df = pd.get_dummies(df, drop_first = True)
df.head()

Unnamed: 0_level_0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7590-VHVEG,0,1,29.85,29.85,0,1,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
5575-GNVDE,0,34,56.95,1889.5,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
3668-QPYBK,0,2,53.85,108.15,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1
7795-CFOCW,0,45,42.3,1840.75,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
9237-HQITU,0,2,70.7,151.65,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0


In [8]:
df.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
      dtype='object')

In [9]:
# 군집화

from sklearn.cluster import AgglomerativeClustering

clusters = AgglomerativeClustering(n_clusters = 3,     # 군집개수 = 3개
                                  affinity = 'euclidean',     # 거리척도: euclidean, manhattan, cosine, precomputed
                                                              # precoputed: 거리 혹은 유사도행렬을 입력으로 하는 경우에 설정하는 값
                                  linkage = 'ward').fit(df)  # linkage: 군집간 거리: ward(ward로 선택 시 거리척도는 유클리디안만 가능)
                                            # linkage 종류: ward, complete(최장연결법), average(평균연결법), single(최단연결법)

In [10]:
df['군집정보'] = clusters.labels_     # labels_: fitting한 데이터에 있는 샘플들이 속한 군집 정보(ndarray 형태)
df['군집정보'].head()

customerID
7590-VHVEG    2
5575-GNVDE    1
3668-QPYBK    2
7795-CFOCW    1
9237-HQITU    2
Name: 군집정보, dtype: int64

In [11]:
df.head()

Unnamed: 0_level_0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,군집정보
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7590-VHVEG,0,1,29.85,29.85,0,1,0,0,1,0,...,0,0,0,0,0,1,0,1,0,2
5575-GNVDE,0,34,56.95,1889.5,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,1
3668-QPYBK,0,2,53.85,108.15,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,1,2
7795-CFOCW,0,45,42.3,1840.75,1,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
9237-HQITU,0,2,70.7,151.65,0,0,0,1,0,0,...,0,0,0,0,0,1,0,1,0,2


In [12]:
df.groupby(['군집정보'])[['MonthlyCharges', 'TotalCharges']].mean()

Unnamed: 0_level_0,MonthlyCharges,TotalCharges
군집정보,Unnamed: 1_level_1,Unnamed: 2_level_1
0,92.846384,5615.733243
1,65.558747,2192.519269
2,48.428814,444.712194


In [13]:
df.groupby(['군집정보'])[['SeniorCitizen', 'StreamingTV_Yes']].mean()

Unnamed: 0_level_0,SeniorCitizen,StreamingTV_Yes
군집정보,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.213708,0.744738
1,0.181723,0.390078
2,0.121936,0.176471


#### 구매기록 기준 고객 군집화

In [15]:
df = pd.read_csv('베스트셀러_도서구매기록.txt', sep = '\t', encoding = 'cp949')
df.head()

Unnamed: 0,회원번호,책제목
0,75111,정글만리 1
1,48022,정글만리 1
2,3063,해커스 토익 Reading
3,84128,뱃살부터 빼셔야겠습니다
4,77611,장하준의 경제학 강의


In [16]:
matrix_df = pd.crosstab(index = df['회원번호'],
                       columns = df['책제목'])
matrix_df.head()

책제목,1cm+,21세기 자본,EBS FM 라디오 고교 영어듣기 (2014년),EBS N제 국어영역 국어 270제 A형 (2014년),EBS N제 국어영역 국어 270제 B형 (2014년),EBS N제 영어영역 영어 280제 (2014년),EBS 수능완성 국어영역 국어 A형 유형편+실전편 (2014년),EBS 수능완성 국어영역 국어 B형 유형편+실전편 (2014년),EBS 수능완성 수학영역 기하와 벡터 (2014년),EBS 수능완성 수학영역 미적분과 통계 기본 유형편+실전편 A형 (2014년),...,정글만리 3,제3인류 3,창문 넘어 도망친 100세 노인,책은 도끼다,"총, 균, 쇠",칼 비테의 자녀교육 불변의 법칙,코스모스,해커스 토익 Listening,해커스 토익 Reading,해커스 토익 보카 Vocabulary
회원번호,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
808,0,0,1,0,1,1,0,3,0,1,...,0,0,1,0,0,0,0,0,0,0
1101,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1479,0,0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1805,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2011,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# 군집화

from sklearn.cluster import AgglomerativeClustering as AC

clustering_model = AC(n_clusters = 10,
                     affinity = 'jaccard',
                     linkage = 'average')
clustering_model.fit(matrix_df)

AgglomerativeClustering(affinity='jaccard', compute_full_tree='auto',
                        connectivity=None, distance_threshold=None,
                        linkage='average', memory=None, n_clusters=10)

In [19]:
cluster_labels = clustering_model.labels_
cluster_labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [20]:
# 회원별 소속 군집 확인

cluster_info = pd.DataFrame({'회원ID': matrix_df.index, '소속군집': cluster_labels})
cluster_info.head()

Unnamed: 0,회원ID,소속군집
0,808,0
1,1101,0
2,1479,0
3,1805,0
4,2011,0


In [21]:
cluster_info['소속군집'].value_counts()

0    665
1     12
2      6
3      4
4      2
8      2
7      1
9      1
5      1
6      1
Name: 소속군집, dtype: int64

- 대부분이 0번 군집에 속함

In [22]:
# matrix_df 데이터의 인덱스와 cluster_info의 회원ID컬럼을 기준으로 merge

matrix_df_with_cluster_info = pd.merge(matrix_df, cluster_info, left_index = True, right_on = '회원ID')
matrix_df_with_cluster_info.head()

Unnamed: 0,1cm+,21세기 자본,EBS FM 라디오 고교 영어듣기 (2014년),EBS N제 국어영역 국어 270제 A형 (2014년),EBS N제 국어영역 국어 270제 B형 (2014년),EBS N제 영어영역 영어 280제 (2014년),EBS 수능완성 국어영역 국어 A형 유형편+실전편 (2014년),EBS 수능완성 국어영역 국어 B형 유형편+실전편 (2014년),EBS 수능완성 수학영역 기하와 벡터 (2014년),EBS 수능완성 수학영역 미적분과 통계 기본 유형편+실전편 A형 (2014년),...,창문 넘어 도망친 100세 노인,책은 도끼다,"총, 균, 쇠",칼 비테의 자녀교육 불변의 법칙,코스모스,해커스 토익 Listening,해커스 토익 Reading,해커스 토익 보카 Vocabulary,회원ID,소속군집
0,0,0,1,0,1,1,0,3,0,1,...,1,0,0,0,0,0,0,0,808,0
1,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1101,0
2,0,0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1479,0
3,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1805,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2011,0


In [23]:
matrix_df_with_cluster_info.groupby(['소속군집'])[matrix_df.columns].mean()

Unnamed: 0_level_0,1cm+,21세기 자본,EBS FM 라디오 고교 영어듣기 (2014년),EBS N제 국어영역 국어 270제 A형 (2014년),EBS N제 국어영역 국어 270제 B형 (2014년),EBS N제 영어영역 영어 280제 (2014년),EBS 수능완성 국어영역 국어 A형 유형편+실전편 (2014년),EBS 수능완성 국어영역 국어 B형 유형편+실전편 (2014년),EBS 수능완성 수학영역 기하와 벡터 (2014년),EBS 수능완성 수학영역 미적분과 통계 기본 유형편+실전편 A형 (2014년),...,정글만리 3,제3인류 3,창문 넘어 도망친 100세 노인,책은 도끼다,"총, 균, 쇠",칼 비테의 자녀교육 불변의 법칙,코스모스,해커스 토익 Listening,해커스 토익 Reading,해커스 토익 보카 Vocabulary
소속군집,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.063158,0.039098,0.440602,0.264662,0.354887,0.619549,0.312782,0.372932,0.288722,0.324812,...,0.099248,0.033083,0.156391,0.042105,0.097744,0.034586,0.010526,0.03609,0.039098,0.04812
1,0.5,0.5,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,1.0,0.333333,1.166667,0.5,0.0,0.0,0.0,0.0,0.0,0.25
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0
3,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# 군집별로 가장 많이 구매하는 책 목록 확인

matrix_df_with_cluster_info.groupby(['소속군집'])[matrix_df.columns].mean().idxmax(axis = 1)

소속군집
0    EBS 수능특강 영어영역 영어 (2014년)
1           창문 넘어 도망친 100세 노인
2                 나미야 잡화점의 기적
3              어떻게 원하는 것을 얻는가
4    EBS 수능특강 영어영역 영어 (2014년)
5                   강신주의 감정수업
6                    마법천자문 27
7                 꾸뻬 씨의 행복 여행
8              나는 까칠하게 살기로 했다
9                    월급쟁이 부자들
dtype: object

# 2. k-평균 군집화

- k개의 중심점을 설정하고 샘플을 할당한 다음 중심점 업데이트를 반복하는 방식으로 k개의 군집을 생성
- 장점
    1. 상대적으로 적은 계산량
    2. 군집 개수 설정에 제약이 없고 쉬움
    
- 단점
    1. 초기 중심 설정에 따라 수행할 때마다 다른 결과를 낼 가능성 존재
    2. 데이터 분포가 특이하거나 군집별 밀도 차이가 존재하면 좋은 성능을 내기 어려움
    3. 유클리디안 거리만 사용 가능
    4. 수렴하지 않을 가능성 존재

In [26]:
import pandas as pd
df = pd.read_csv('Telco_customer_info.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [27]:
df.set_index('customerID', inplace = True)
df.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65


In [28]:
# 범주형 변수 더미화

df = pd.get_dummies(df, drop_first = True)
df.head()

Unnamed: 0_level_0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7590-VHVEG,0,1,29.85,29.85,0,1,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
5575-GNVDE,0,34,56.95,1889.5,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
3668-QPYBK,0,2,53.85,108.15,1,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1
7795-CFOCW,0,45,42.3,1840.75,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
9237-HQITU,0,2,70.7,151.65,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,1,0


In [29]:
df.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
      dtype='object')

In [30]:
# kmeans 군집화

from sklearn.cluster import KMeans

clusters = KMeans(n_clusters = 3,     # n_clusters = 군집개수
                 max_iter = 50).fit(df)     # max_iter = 최대 반복 횟수

In [31]:
clusters.labels_     # fitting한 데이터에 있는 샘플들이 속한 군집 정보(ndarray 형태)

array([0, 0, 0, ..., 0, 0, 2])

In [32]:
# 군집 중심 확인 및 이름 할당

pd.DataFrame(clusters.cluster_centers_,     # fitting한 데이터에 있는 샘플들이 속한 군집 중심점(ndarray 형태)
            columns = df.columns,
            index = range(3))

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.128661,18.264522,49.767331,688.070319,0.5012,0.37614,0.285166,0.891263,0.108737,0.269803,...,0.3643783,0.194911,0.3643783,0.195391,0.145943,0.165627,0.526644,0.164186,0.346375,0.324532
1,0.207325,44.132216,77.818963,3281.164184,0.517691,0.556797,0.303538,0.860335,0.139665,0.497827,...,0.001241465,0.545003,0.001241465,0.553073,0.301676,0.214773,0.671633,0.260708,0.34761,0.116077
2,0.216733,64.384861,97.979243,6297.778685,0.499602,0.740239,0.336255,0.998406,0.001594,0.829482,...,4.524159e-15,0.807171,4.524159e-15,0.81753,0.301195,0.517131,0.710757,0.332271,0.288446,0.051793
