# 通过聚类识别登陆养卡群体

### Silhouette Coefficient 轮廓系数
* 轮廓系数 <br>
轮廓系数适用于实际类别信息未知的情况。对于单个样本，设$a$是与它同类别中其他样本的平均距离（类似类内局），$b$是与它距离最近不同的类别中样本的距离（类似类间距）<br>
* $$s = \frac{b-a}{max(a,b)}$$
<br>
对于一个样本集合，它的轮廓系数是所有样本轮廓系数的平均值。轮廓系数的取值范围[-1,1],同类别样本距离越相近并且不同样本类别距离越远，轮廓系数的得分越大.<br>


In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics
kmeans_model = KMeans(n_clusters=3,random_state=0).fit(X)
labels = kmeans_model.labels_
print(metrics.silhouette_score(X,labels,metric='euclidean'))

###  Calinski-Harabaz Index
这个计算简单直接，简单的来说 Calinski-Harabaz 分值越大则说明聚类效果越好。<br>
$$s(k)=\frac{tr(B_k)}{tr(W_k)}\times\frac{m-k}{k-1}$$
<br>其中$m$表示集群样本数,$k$表示类别数,$B_k$为类别之间的协方差矩阵,$W_k$类别内部样本的协方差,$tr()表示矩阵的迹$，也就是说类内样本协方差小，类间样本协方差大，则 Calinski-Harabaz 分值高。在scikit-learn中， Calinski-Harabasz Index对应的方法是metrics.calinski_harabaz_score.

$$W_k = \sum\limits_{q=1}^k{\sum\limits_{x \in C_q}(x-c_q)(x-c_q)^T}$$
<br>
$$B_k=\sum\limits_q n_q(c_q-c)(c_q-c)T$$
<br>
$C_q$为cluster q中的点集，$c$为E的中心，$n_q$为cluster q中的点数

In [None]:
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import MultipleLocator,FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## 连接hive 数据库

In [None]:
from pyhive import hive
cursor = hive.connect(host='192.168.1.106',post=10000,
                username='root',database='zjyd3').cursor()
cursor.execute('select * from temp_wqf_log_info_04')
a = cursor.fetchall()
alldata = pd.DataFrame(a,columns=['phonenumber','sl','ips_pro','imeis_pro','timecv','fail_pro'])
del a 

## 数据标准化

In [None]:
from sklearn.preprocessing import MinMaxScaler
train_data = alldata[['sl','ips_pro','imeis_pro','timecv']]
train_data.fillna(0,inplace=True)
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)

## 模型选择优化

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabaz_score
n_class = []
ch_values = []
for i in range(10):
    n_cluster = i + 2
    print(n_class)
    clf = KMeans(n_clusters=n_cluster,random_state=0)
    label = clf.fit_predict(train_data)
    ch_values.append(calinski_harabaz_score(train_data,label))
    n_class.append(n_cluster)
plt.plot(n_class,ch_values,label='First line',linewidth=3,
         color='r',marker='o',markerfacecolor='blue',
         markersize=12)
ax = plt.subplot(111)
x_major_Locator = MultipleLocator(1)
x_major_Formatter = FormatStrFormatter('%5.0f')
ax.xaxis.set_major_locator(x_major_Locator)
ax.xaxis.set_major_formatter(x_major_Formatter)
plt.xlabel('n_class')
plt.ylabel('ch values')