# K-means (分群 找重心) 
ref: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

- 鳶尾花資料集 來源： (User Guide 可以看欄位簡介)
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
1. Classes：3  (分3種花)
2. Samples per class：50  (每一種類50筆)
3. Samples total：150 (共150筆)
4. Dimensionality：4 (維度4，每一筆特徵有4個：花萼長寬.花蕊長寬)
5. 目標：用前面4欄的特性，去看應該是屬於哪一類別的花

In [15]:
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

kmeans = KMeans(n_clusters=3) #指定分3群
kmeans = kmeans.fit(X)
labels = kmeans.predict(X)  #得到被分群的結果

centroids = kmeans.cluster_centers_  #回傳每一群重心的座標

print('centroids:',centroids)
print('prediction on each data:',labels)


#非監督式學習 都是跟label無關，不會給標準答案，所以以上建模訓練都沒有用到y
#有個疑問?資料中早就給了類別，所以我們就偷偷的測測看到底K-means分類多準
accuracy = accuracy_score(y, labels)
num_correct_samples = accuracy_score(y, labels, normalize=False)

print('accuracy:',accuracy)
print('number of correct sample:',num_correct_samples)

centroids: [[5.006      3.428      1.462      0.246     ]
 [5.9016129  2.7483871  4.39354839 1.43387097]
 [6.85       3.07368421 5.74210526 2.07105263]]
prediction on each data: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
 2 1]
accuracy: 0.8933333333333333
number of correct sample: 134


# DBSCAN (密度分群)
- ref: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
- DBscan演算法特色：
1. 事前 不知道結果會分幾群
2. 會有邊緣人(雜訊點)不被分在哪一群


- 鳶尾花資料集 來源： (User Guide 可以看欄位簡介)
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
1. Classes：3  (分3種花)
2. Samples per class：50  (每一種類50筆)
3. Samples total：150 (共150筆)
4. Dimensionality：4 (維度4，每一筆特徵有4個：花萼長寬.花蕊長寬)
5. 目標：用前面4欄的特性，去看應該是屬於哪一類別的花

In [22]:
from sklearn import datasets
from sklearn.cluster import DBSCAN
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

#           半徑    MinPts最少要框住幾個點
db = DBSCAN(eps=0.3, min_samples=5).fit(X)
labels = db.labels_ #回傳分群結果


# 提醒： -1 為雜訊點，不為任何一群
#以結果來說有太多-1，所以可以調整半徑或最少點 讓結果更好
print('cluster on X',labels)


# 計算有幾群                   #如果labels有-1的話 回傳1，否則0
n_clusters = len(set(labels)) - (1 if -1 in labels else 0) 

print('number of clusters:',n_clusters)

cluster on X [ 0  0  0  0  0 -1  0  0  0  0  0  0  0  0 -1 -1 -1  0 -1  0 -1  0 -1  0
  0  0  0  0  0  0  0 -1 -1 -1  0  0 -1  0  0  0  0 -1  0  0 -1  0  0  0
  0  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1  1  2 -1
 -1 -1 -1 -1 -1 -1 -1 -1  1  1  1 -1 -1 -1 -1 -1  1  1 -1 -1  1 -1  1  1
  1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1  2  2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  2 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1  2]
number of clusters: 3


# EM(GMM)  最大期望算法
- Expectation–Maximization
- 抽樣平均 = 母體平均(前提是要符合高斯常態分佈)
- ref: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

- 鳶尾花資料集 來源： (User Guide 可以看欄位簡介)
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
1. Classes：3  (分3種花)
2. Samples per class：50  (每一種類50筆)
3. Samples total：150 (共150筆)
4. Dimensionality：4 (維度4，每一筆特徵有4個：花萼長寬.花蕊長寬)
5. 目標：用前面4欄的特性，去看應該是屬於哪一類別的花

In [10]:
from sklearn import datasets
from sklearn import mixture

iris = datasets.load_iris()
X = iris.data
y = iris.target

                #高斯常態分佈      要分幾群
gmm = mixture.GaussianMixture(n_components=3).fit(X)
X_pred = gmm.predict(X)

print(X_pred)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2
 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1]
