####PCA for Reduced Dimensionality in Clustering [Dataset: segmentation_data.zip]
For this problem you will use an image segmentation data set for clustering. You will experiment with using PCA as an approach to reduce dimensionality and noise in the data. You will compare the results of clustering the data with and without PCA using the provided image class assignments as the ground truth. The data set is divided into three files. The file "segmentation_data.txt" contains data about images with each line corresponding to one image. Each image is represented by 19 features (these are the columns in the data and correspond to the feature names in the file "segmentation_names.txt". The file "segmentation_classes.txt" contains the class labels (the type of image) and a numeric class label for each of the corresponding images in the data file. After clustering the image data, you will use the class labels to measure completeness and homogeneity of the generated clusters. The data set used in this problem is based on the Image Segmentation data set at the UCI Machine Learning Repository.

####1a) Load in the image data matrix (with rows as images and columns as features). Also load in the numeric class labels from the segmentation class file. Using your favorite method (e.g., sklearn's min-max scaler), perform min-max normalization on the data matrix so that each feature is scaled to [0,1] range.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing

In [6]:
seData=pd.read_csv('./segmentation_data/segmentation_data.txt',header=None)
seClass=pd.read_csv('./segmentation_data/segmentation_classes.txt',sep='\t',header=None)
seName=pd.read_csv('./segmentation_data/segmentation_names.txt',header=None)

In [7]:
print(seData.shape)
seData.head(5)

(2100, 19)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,110.0,189.0,9,0.0,0.0,1.0,0.666667,1.222222,1.186342,12.925926,10.888889,9.222222,18.666668,-6.111111,-11.111111,17.222221,18.666668,0.508139,1.910864
1,86.0,187.0,9,0.0,0.0,1.111111,0.720082,1.444444,0.750309,13.740741,11.666667,10.333334,19.222221,-6.222222,-10.222222,16.444445,19.222221,0.463329,1.941465
2,225.0,244.0,9,0.0,0.0,3.388889,2.195113,3.0,1.520234,12.259259,10.333334,9.333334,17.11111,-5.777778,-8.777778,14.555555,17.11111,0.480149,1.987902
3,47.0,232.0,9,0.0,0.0,1.277778,1.254621,1.0,0.894427,12.703704,11.0,9.0,18.11111,-5.111111,-11.111111,16.222221,18.11111,0.500966,1.875362
4,97.0,186.0,9,0.0,0.0,1.166667,0.691215,1.166667,1.00554,15.592592,13.888889,11.777778,21.11111,-5.111111,-11.444445,16.555555,21.11111,0.442661,1.863654


In [8]:
print(seClass.shape)
seClass.head(5)

(2100, 2)


Unnamed: 0,0,1
0,GRASS,0
1,GRASS,0
2,GRASS,0
3,GRASS,0
4,GRASS,0


In [9]:
print(seName.shape)
seName.head(5)

(19, 1)


Unnamed: 0,0
0,REGION-CENTROID-COL
1,REGION-CENTROID-ROW
2,REGION-PIXEL-COUNT
3,SHORT-LINE-DENSITY-5
4,SHORT-LINE-DENSITY-2


In [10]:
np.set_printoptions(precision=2, linewidth=120, suppress=True)

#Use Min_Max function to normalize the data to (0,1) Scale.
min_max_scaler=preprocessing.MinMaxScaler().fit(seData)
scData_norm=min_max_scaler.transform(seData)
print(scData_norm.shape)
scData_norm[0:5]

(2100, 19)


array([[ 0.43,  0.74,  0.  ,  0.  ,  0.  ,  0.03,  0.  ,  0.03,  0.  ,  0.09,  0.08,  0.06,  0.13,  0.73,  0.01,  0.87,
         0.12,  0.51,  0.83],
       [ 0.34,  0.73,  0.  ,  0.  ,  0.  ,  0.04,  0.  ,  0.03,  0.  ,  0.1 ,  0.09,  0.07,  0.13,  0.73,  0.02,  0.86,
         0.13,  0.46,  0.84],
       [ 0.89,  0.97,  0.  ,  0.  ,  0.  ,  0.12,  0.  ,  0.07,  0.  ,  0.09,  0.08,  0.06,  0.12,  0.74,  0.04,  0.83,
         0.11,  0.48,  0.84],
       [ 0.18,  0.92,  0.  ,  0.  ,  0.  ,  0.04,  0.  ,  0.02,  0.  ,  0.09,  0.08,  0.06,  0.13,  0.75,  0.01,  0.86,
         0.12,  0.5 ,  0.83],
       [ 0.38,  0.73,  0.  ,  0.  ,  0.  ,  0.04,  0.  ,  0.03,  0.  ,  0.11,  0.1 ,  0.08,  0.15,  0.75,  0.01,  0.86,
         0.14,  0.44,  0.82]])

####1b) Next, Perform Kmeans clustering (for this problem, use the Kmeans implementation in scikit-learn) on the image data (since there are a total 7 pre-assigned image classes, you should use K = 7 in your clustering). Use Euclidean distance as your distance measure for the clustering. Print the cluster centroids (use some formatting so that they are visually understandable). Compare your 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

In [11]:
from sklearn.cluster import KMeans

In [12]:
kmeans=KMeans(n_clusters=7,max_iter=500,verbose=1) #initialize clusters parameters
#Plug in the normalized data into the 
kmeans.fit(scData_norm)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 422.148596285
start iteration
done sorting
end inner loop
Iteration 1, inertia 396.994995473
start iteration
done sorting
end inner loop
Iteration 2, inertia 391.819994747
start iteration
done sorting
end inner loop
Iteration 3, inertia 387.729886358
start iteration
done sorting
end inner loop
Iteration 4, inertia 382.326221358
start iteration
done sorting
end inner loop
Iteration 5, inertia 375.311176533
start iteration
done sorting
end inner loop
Iteration 6, inertia 371.580994732
start iteration
done sorting
end inner loop
Iteration 7, inertia 370.887452334
start iteration
done sorting
end inner loop
Iteration 8, inertia 370.759093143
start iteration
done sorting
end inner loop
Iteration 9, inertia 370.673169002
start iteration
done sorting
end inner loop
Iteration 10, inertia 370.639209634
start iteration
done sorting
end inner loop
Iteration 11, inertia 370.634115984
start iteration
done sorti

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 452.945079383
start iteration
done sorting
end inner loop
Iteration 1, inertia 422.464949846
start iteration
done sorting
end inner loop
Iteration 2, inertia 412.620256579
start iteration
done sorting
end inner loop
Iteration 3, inertia 403.467993513
start iteration
done sorting
end inner loop
Iteration 4, inertia 396.022443619
start iteration
done sorting
end inner loop
Iteration 5, inertia 390.129252289
start iteration
done sorting
end inner loop
Iteration 6, inertia 387.705284133
start iteration
done sorting
end inner loop
Iteration 7, inertia 386.221684217
start iteration
done sorting
end inner loop
Iteration 8, inertia 385.010260861
start iteration
done sorting
end inner loop
Iteration 9, inertia 382.911233497
start iteration
done sorting
end inner loop
Iteration 10, inertia 380.534258771
start iteration
done sorting
end inner loop
Iteration 11, inertia 378.59037556
start iteration
done sortin

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=500,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)

In [13]:
#Predict the data points falling into the Clusters
clusters=kmeans.predict(scData_norm)
clusters

array([4, 4, 4, ..., 1, 1, 3], dtype=int32)

In [14]:
clusters.shape

(2100,)

In [15]:
a=seName.T
c=kmeans.cluster_centers_
ac=np.concatenate([a,c]) #Concatenate the clusters centers coordinates and attribute names

In [18]:
c

array([[ 0.3 ,  0.53,  0.  ,  0.05,  0.05,  0.1 ,  0.01,  0.08,  0.01,  0.4 ,  0.37,  0.47,  0.35,  0.5 ,  0.57,  0.21,
         0.47,  0.3 ,  0.16],
       [ 0.77,  0.43,  0.  ,  0.01,  0.02,  0.04,  0.  ,  0.02,  0.  ,  0.04,  0.04,  0.06,  0.03,  0.78,  0.22,  0.49,
         0.06,  0.54,  0.24],
       [ 0.54,  0.15,  0.  ,  0.03,  0.  ,  0.03,  0.  ,  0.03,  0.  ,  0.82,  0.78,  0.89,  0.79,  0.27,  0.67,  0.29,
         0.89,  0.21,  0.13],
       [ 0.26,  0.39,  0.  ,  0.07,  0.02,  0.08,  0.  ,  0.06,  0.  ,  0.15,  0.14,  0.19,  0.12,  0.72,  0.34,  0.36,
         0.19,  0.41,  0.2 ],
       [ 0.51,  0.81,  0.  ,  0.08,  0.01,  0.05,  0.  ,  0.05,  0.  ,  0.11,  0.09,  0.09,  0.14,  0.68,  0.08,  0.82,
         0.13,  0.41,  0.89],
       [ 0.75,  0.53,  0.  ,  0.04,  0.04,  0.11,  0.02,  0.11,  0.02,  0.3 ,  0.28,  0.35,  0.27,  0.59,  0.45,  0.31,
         0.35,  0.3 ,  0.16],
       [ 0.25,  0.46,  0.  ,  0.03,  0.01,  0.04,  0.  ,  0.03,  0.  ,  0.03,  0.02,  0.04,  0.02,  

In [19]:
#Show the Centroid of the 7 Clusters
pd.DataFrame(ac[1:].T, index=ac[0])

Unnamed: 0,0,1,2,3,4,5,6
REGION-CENTROID-COL,0.302506,0.770674,0.535099,0.256103,0.513994,0.750696,0.254169
REGION-CENTROID-ROW,0.530862,0.425215,0.150167,0.393468,0.808937,0.534564,0.459974
REGION-PIXEL-COUNT,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SHORT-LINE-DENSITY-5,0.0522599,0.0139785,0.0277778,0.0745098,0.0774411,0.04,0.0262557
SHORT-LINE-DENSITY-2,0.0466102,0.0225806,0.00166667,0.0191176,0.00505051,0.0384615,0.0136986
VEDGE-MEAN,0.100817,0.0402367,0.0302281,0.0773429,0.0544738,0.114419,0.0372741
VEDGE-SD,0.00942022,0.00298876,0.000542888,0.00410042,0.00140719,0.0193006,0.00236373
HEDGE-MEAN,0.083972,0.0231216,0.026766,0.0605736,0.046335,0.10924,0.0278737
HEDGE-SD,0.0110433,0.0020884,0.000586662,0.00496749,0.00140097,0.0179989,0.00201698
INTENSITY-MEAN,0.400608,0.0411384,0.823246,0.148187,0.10879,0.300955,0.0260125


In [21]:
from sklearn.metrics import completeness_score, homogeneity_score

#Compute the Completness Score and Homogeneity Score
print('Completness Score is',completeness_score(clusters,np.array(seClass[1])))
print('Homogeneity Score is ',homogeneity_score(clusters, np.array(seClass[1])))
    

Completness Score is 0.611502116337
Homogeneity Score is  0.613187012485


The Completness Score and Homogeneity Score are not very high score, but their values are very balance which is also a good sign for the analysis.

####2c) Perform PCA on the normalized image data matrix. You may use the linear algebra package in Numpy or the Decomposition module in scikit-learn (the latter is much more efficient). Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. Then use these r components as features to transform the data into a reduced dimension space. [See the PCA Clustering Notebook from class for an example of how these steps are performed.]

In [22]:
#import decomposition function from skilearn for PCA analysis
from sklearn import decomposition

In [23]:
#Set the PCA to the first 7 components
n_items=7
pca=decomposition.PCA(n_components=n_items)
scData_norm_TD=pca.fit(scData_norm).transform(scData_norm)

In [24]:
print(scData_norm_TD.shape)

(2100, 7)


In [26]:
np.set_printoptions(precision=2, linewidth=120, suppress=True)

#Let's check the explained Varanice presented by the first 7 components
print(pca.explained_variance_ratio_)
#sum(pca.explained_variance_ratio_)
print('''The first %d components capture %0.2f percent of variance in the data''' %(n_items,sum(pca.explained_variance_ratio_)*100))

[ 0.61  0.13  0.1   0.05  0.04  0.02  0.02]
The first 7 components capture 96.01 percent of variance in the data


In [27]:
#Display the transformed 7 components from the original dataset
scData_norm_TD[0:5]

array([[-0.69,  0.53,  0.25, -0.2 , -0.08,  0.05, -0.05],
       [-0.67,  0.51,  0.34, -0.17, -0.04,  0.06, -0.04],
       [-0.71,  0.77, -0.16, -0.01, -0.17,  0.04, -0.06],
       [-0.73,  0.51,  0.5 , -0.06, -0.14,  0.03, -0.1 ],
       [-0.64,  0.53,  0.3 , -0.18, -0.02,  0.05, -0.06]])

####1d) Perform Kmeans again, but this time on the lower dimensional transformed data. Then, compute the Completeness and Homogeneity values of the new clusters.

In [29]:
kmeans.fit(scData_norm_TD)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 344.791210072
start iteration
done sorting
end inner loop
Iteration 1, inertia 333.661757566
start iteration
done sorting
end inner loop
Iteration 2, inertia 331.015720231
start iteration
done sorting
end inner loop
Iteration 3, inertia 329.811354999
start iteration
done sorting
end inner loop
Iteration 4, inertia 328.879968132
start iteration
done sorting
end inner loop
Iteration 5, inertia 328.266231846
start iteration
done sorting
end inner loop
Iteration 6, inertia 328.042568558
start iteration
done sorting
end inner loop
Iteration 7, inertia 327.978412564
start iteration
done sorting
end inner loop
Iteration 8, inertia 327.919118519
start iteration
done sorting
end inner loop
Iteration 9, inertia 327.883068487
start iteration
done sorting
end inner loop
Iteration 10, inertia 327.814552919
start iteration
done sorting
end inner loop
Iteration 11, inertia 327.69357173
start iteration
done sortin

Iteration 5, inertia 317.958871815
start iteration
done sorting
end inner loop
Iteration 6, inertia 316.54017609
start iteration
done sorting
end inner loop
Iteration 7, inertia 315.703592176
start iteration
done sorting
end inner loop
Iteration 8, inertia 314.545078871
start iteration
done sorting
end inner loop
Iteration 9, inertia 313.417268114
start iteration
done sorting
end inner loop
Iteration 10, inertia 311.736003372
start iteration
done sorting
end inner loop
Iteration 11, inertia 310.448033676
start iteration
done sorting
end inner loop
Iteration 12, inertia 309.426663269
start iteration
done sorting
end inner loop
Iteration 13, inertia 308.52746527
start iteration
done sorting
end inner loop
Iteration 14, inertia 307.93568216
start iteration
done sorting
end inner loop
Iteration 15, inertia 307.610009473
start iteration
done sorting
end inner loop
Iteration 16, inertia 307.52265344
start iteration
done sorting
end inner loop
Iteration 17, inertia 307.460642208
start iterati

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=500,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)

In [30]:
clusters=kmeans.predict(scData_norm_TD)
clusters

array([3, 3, 3, ..., 0, 0, 4], dtype=int32)

In [31]:
#Compute the Completness Score and Homogeneity Score
print('Completness Score is',completeness_score(clusters,np.array(seClass[1])))
print('Homogeneity Score is ',homogeneity_score(clusters, np.array(seClass[1])))

Completness Score is 0.609136404973
Homogeneity Score is  0.610795506369


####1e) Discuss your observations based on the comparison of the two clustering results.

The Completness Score and Homogeneity Score of decomposited (PCA processed) which are very similar to the scores of the original dataset are around 0.6 and 0.61. In this case, we could consider to utilize the decomposted model to evaluate the future same structure dataset due to saving computational cost and resources with less components from PCA.

In [4]:
import apriori2 as ap