#Unsupervised learning: seeking representations of the data

##1.Clustering: grouping observations together
###K-means clustering

In [7]:
from sklearn import cluster,datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris)#进行kmeans计算
print k_means.labels_[::10]  #的带kmeans标签
print y_iris[::10]

[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]


Application example: vector quantization

Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars to compress the information. The problem is sometimes known as vector quantization. For instance, this can be used to posterize an image:

In [21]:
import scipy as sp
import numpy as np
try:
    lena = sp.lena()
except AttributeError:
    from scipy import misc
    lena = misc.lena()
X = lena.reshape((-1, 1))#不使用矩阵大小而将其变为一排（只是一列而不是一维）的方式
k_means = cluster.KMeans(n_clusters=5, n_init=1)
k_means.fit(X) 
values = k_means.cluster_centers_.squeeze()#得到聚类中心
labels = k_means.labels_#得到所属类别

In [28]:
lena_compressed = np.choose(labels, values)#label的顺序就是值得顺序，只需要从valus中的奥对应的类中心就得到的压缩图像。
lena_compressed.shape = lena.shape
import matplotlib.pyplot as plt
plt.imshow(lena_compressed)
plt.show()

###Hierarchical agglomerative clustering: Ward
- Agglomerative - bottom-up approaches: each observation starts in its own cluster, and clusters are iterativelly merged in such a way to minimize a linkage criterion. This approach is particularly interesting when the clusters of interest are made of only a few observations. When the number of clusters is large, it is much more computationally efficient than k-means.

- Divisive - top-down approaches: all observations start in one cluster, which is iteratively split as one moves down the hierarchy. For estimating large numbers of clusters, this approach is both slow (due to all observations starting as one cluster, which it splits recursively) and statistically ill-posed.


In [42]:
from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering
import time
lena = sp.misc.lena()
# Downsample the image by a factor of 4
lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
X = np.reshape(lena, (-1, 1))
connectivity = grid_to_graph(*lena.shape) #得到邻接关系，和权重？

# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15  # number of regions
ward = AgglomerativeClustering(n_clusters=n_clusters,
        linkage='ward', connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_, lena.shape)

print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)

Compute structured hierarchical clustering...
('Elapsed time: ', 6.3459999561309814)
('Number of pixels: ', 65536)
('Number of clusters: ', 15)


In [45]:
plt.imshow(label)
plt.show()

###Feature agglomeration?

##2.Decompositions: from a signal to components and loadings
###Principal component analysis: PCA

In [47]:
# Create a signal with only 2 useful dimensions
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1, x2, x3]

from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X)
print(pca.explained_variance_)  

# As we can see, only the 2 first components are useful
pca.n_components = 2
X_reduced = pca.fit_transform(X)
X_reduced.shape

[  3.26567928e+00   9.02041397e-01   2.59205265e-31]


(100L, 2L)

###Independent Component Analysis: ICA?

In [1]:
np.c_[[8],[33]].T

NameError: name 'np' is not defined