####PCA for Reduced Dimensionality in Clustering [Dataset: segmentation_data.zip]
For this problem you will use an image segmentation data set for clustering. You will experiment with using PCA as an approach to reduce dimensionality and noise in the data. You will compare the results of clustering the data with and without PCA using the provided image class assignments as the ground truth. The data set is divided into three files. The file "segmentation_data.txt" contains data about images with each line corresponding to one image. Each image is represented by 19 features (these are the columns in the data and correspond to the feature names in the file "segmentation_names.txt". The file "segmentation_classes.txt" contains the class labels (the type of image) and a numeric class label for each of the corresponding images in the data file. After clustering the image data, you will use the class labels to measure completeness and homogeneity of the generated clusters. The data set used in this problem is based on the Image Segmentation data set at the UCI Machine Learning Repository.

####1a) Load in the image data matrix (with rows as images and columns as features). Also load in the numeric class labels from the segmentation class file. Using your favorite method (e.g., sklearn's min-max scaler), perform min-max normalization on the data matrix so that each feature is scaled to [0,1] range.

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing

In [24]:
seData=pd.read_csv('./segmentation_data/segmentation_data.txt',header=None)
seClass=pd.read_csv('./segmentation_data/segmentation_classes.txt',sep='\t',header=None)
seName=pd.read_csv('./segmentation_data/segmentation_names.txt',header=None)

In [25]:
print(seData.shape)
seData.head(5)

(2100, 19)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,110.0,189.0,9,0.0,0.0,1.0,0.666667,1.222222,1.186342,12.925926,10.888889,9.222222,18.666668,-6.111111,-11.111111,17.222221,18.666668,0.508139,1.910864
1,86.0,187.0,9,0.0,0.0,1.111111,0.720082,1.444444,0.750309,13.740741,11.666667,10.333334,19.222221,-6.222222,-10.222222,16.444445,19.222221,0.463329,1.941465
2,225.0,244.0,9,0.0,0.0,3.388889,2.195113,3.0,1.520234,12.259259,10.333334,9.333334,17.11111,-5.777778,-8.777778,14.555555,17.11111,0.480149,1.987902
3,47.0,232.0,9,0.0,0.0,1.277778,1.254621,1.0,0.894427,12.703704,11.0,9.0,18.11111,-5.111111,-11.111111,16.222221,18.11111,0.500966,1.875362
4,97.0,186.0,9,0.0,0.0,1.166667,0.691215,1.166667,1.00554,15.592592,13.888889,11.777778,21.11111,-5.111111,-11.444445,16.555555,21.11111,0.442661,1.863654


In [26]:
print(seClass.shape)
seClass.head(5)

(2100, 2)


Unnamed: 0,0,1
0,GRASS,0
1,GRASS,0
2,GRASS,0
3,GRASS,0
4,GRASS,0


In [27]:
print(seName.shape)
seName.head(5)

(19, 1)


Unnamed: 0,0
0,REGION-CENTROID-COL
1,REGION-CENTROID-ROW
2,REGION-PIXEL-COUNT
3,SHORT-LINE-DENSITY-5
4,SHORT-LINE-DENSITY-2


In [35]:
np.set_printoptions(precision=2, linewidth=120, suppress=True)

#Use Min_Max function to normalize the data to (0,1) Scale.
min_max_scaler=preprocessing.MinMaxScaler().fit(seData)
scData_norm=min_max_scaler.transform(seData)
print(scData_norm.shape)
scData_norm[0:5]

(2100, 19)


array([[ 0.43,  0.74,  0.  ,  0.  ,  0.  ,  0.03,  0.  ,  0.03,  0.  ,  0.09,  0.08,  0.06,  0.13,  0.73,  0.01,  0.87,
         0.12,  0.51,  0.83],
       [ 0.34,  0.73,  0.  ,  0.  ,  0.  ,  0.04,  0.  ,  0.03,  0.  ,  0.1 ,  0.09,  0.07,  0.13,  0.73,  0.02,  0.86,
         0.13,  0.46,  0.84],
       [ 0.89,  0.97,  0.  ,  0.  ,  0.  ,  0.12,  0.  ,  0.07,  0.  ,  0.09,  0.08,  0.06,  0.12,  0.74,  0.04,  0.83,
         0.11,  0.48,  0.84],
       [ 0.18,  0.92,  0.  ,  0.  ,  0.  ,  0.04,  0.  ,  0.02,  0.  ,  0.09,  0.08,  0.06,  0.13,  0.75,  0.01,  0.86,
         0.12,  0.5 ,  0.83],
       [ 0.38,  0.73,  0.  ,  0.  ,  0.  ,  0.04,  0.  ,  0.03,  0.  ,  0.11,  0.1 ,  0.08,  0.15,  0.75,  0.01,  0.86,
         0.14,  0.44,  0.82]])

####1b) Next, Perform Kmeans clustering (for this problem, use the Kmeans implementation in scikit-learn) on the image data (since there are a total 7 pre-assigned image classes, you should use K = 7 in your clustering). Use Euclidean distance as your distance measure for the clustering. Print the cluster centroids (use some formatting so that they are visually understandable). Compare your 7 clusters to the 7 pre-assigned classes by computing the Completeness and Homogeneity values of the generated clusters.

In [36]:
from sklearn.cluster import KMeans

In [38]:
kmeans=KMeans(n_clusters=7,max_iter=500,verbose=1) #initialize clusters parameters

In [39]:
kmeans.fit(scData_norm)

Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 395.5731939
start iteration
done sorting
end inner loop
Iteration 1, inertia 373.054526929
start iteration
done sorting
end inner loop
Iteration 2, inertia 370.703204752
start iteration
done sorting
end inner loop
Iteration 3, inertia 369.903944657
start iteration
done sorting
end inner loop
Iteration 4, inertia 369.710605028
start iteration
done sorting
end inner loop
Iteration 5, inertia 369.654450949
start iteration
done sorting
end inner loop
Iteration 6, inertia 369.648618323
start iteration
done sorting
end inner loop
Iteration 7, inertia 369.646726618
start iteration
done sorting
end inner loop
Iteration 8, inertia 369.644171171
start iteration
done sorting
end inner loop
Iteration 9, inertia 369.644171171
center shift 0.000000e+00 within tolerance 4.150157e-06
Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 364.000927367
start iteration
done sorting


end inner loop
Iteration 5, inertia 374.696382823
start iteration
done sorting
end inner loop
Iteration 6, inertia 374.606777006
start iteration
done sorting
end inner loop
Iteration 7, inertia 374.500847753
start iteration
done sorting
end inner loop
Iteration 8, inertia 374.489939998
start iteration
done sorting
end inner loop
Iteration 9, inertia 374.489939998
center shift 0.000000e+00 within tolerance 4.150157e-06
Initialization complete
start iteration
done sorting
end inner loop
Iteration 0, inertia 428.485232608
start iteration
done sorting
end inner loop
Iteration 1, inertia 407.002913005
start iteration
done sorting
end inner loop
Iteration 2, inertia 401.16341963
start iteration
done sorting
end inner loop
Iteration 3, inertia 396.322813784
start iteration
done sorting
end inner loop
Iteration 4, inertia 394.819516438
start iteration
done sorting
end inner loop
Iteration 5, inertia 392.42242501
start iteration
done sorting
end inner loop
Iteration 6, inertia 387.815528471
sta

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=500,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)

In [40]:
clusters=kmeans.predict(scData_norm)

In [46]:
clusters.shape

(2100,)

In [71]:
a=seName.T
c=kmeans.cluster_centers_
ac=np.concatenate([a,c])

In [89]:
c

array([[ 0.51,  0.81,  0.  ,  0.08,  0.01,  0.05,  0.  ,  0.05,  0.  ,  0.11,  0.09,  0.09,  0.14,  0.68,  0.08,  0.82,
         0.13,  0.41,  0.89],
       [ 0.25,  0.39,  0.  ,  0.08,  0.02,  0.08,  0.  ,  0.06,  0.01,  0.15,  0.14,  0.18,  0.12,  0.72,  0.34,  0.35,
         0.18,  0.41,  0.2 ],
       [ 0.54,  0.15,  0.  ,  0.03,  0.  ,  0.03,  0.  ,  0.03,  0.  ,  0.82,  0.78,  0.89,  0.79,  0.27,  0.67,  0.29,
         0.89,  0.21,  0.13],
       [ 0.75,  0.53,  0.  ,  0.04,  0.04,  0.11,  0.02,  0.11,  0.02,  0.3 ,  0.28,  0.35,  0.26,  0.59,  0.45,  0.31,
         0.35,  0.3 ,  0.16],
       [ 0.25,  0.46,  0.  ,  0.03,  0.01,  0.04,  0.  ,  0.03,  0.  ,  0.03,  0.02,  0.04,  0.02,  0.77,  0.22,  0.51,
         0.04,  0.8 ,  0.18],
       [ 0.3 ,  0.53,  0.  ,  0.05,  0.05,  0.1 ,  0.01,  0.08,  0.01,  0.4 ,  0.37,  0.47,  0.35,  0.5 ,  0.57,  0.21,
         0.47,  0.3 ,  0.16],
       [ 0.77,  0.43,  0.  ,  0.01,  0.02,  0.04,  0.  ,  0.02,  0.  ,  0.04,  0.03,  0.06,  0.03,  

In [88]:
#Show the Centroid of the 7 Clusters
pd.DataFrame(ac[1:].T, index=ac[0])

Unnamed: 0,0,1,2,3,4,5,6
REGION-CENTROID-COL,0.513994,0.251678,0.535099,0.748274,0.253603,0.302506,0.769063
REGION-CENTROID-ROW,0.808937,0.392749,0.150167,0.532041,0.459865,0.530862,0.42593
REGION-PIXEL-COUNT,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SHORT-LINE-DENSITY-5,0.0774411,0.0756219,0.0277778,0.0391566,0.0263459,0.0522599,0.0140237
SHORT-LINE-DENSITY-2,0.00505051,0.019403,0.00166667,0.0376506,0.0137457,0.0466102,0.0226537
VEDGE-MEAN,0.0544738,0.0776573,0.0302281,0.11353,0.0373368,0.100817,0.0397025
VEDGE-SD,0.00140719,0.00414943,0.000542888,0.0189224,0.0023699,0.00942022,0.00298261
HEDGE-MEAN,0.046335,0.0612404,0.026766,0.107311,0.0279012,0.083972,0.023116
HEDGE-SD,0.00140097,0.00503684,0.000586662,0.017627,0.00202174,0.0110433,0.00209423
INTENSITY-MEAN,0.10879,0.147428,0.823246,0.298573,0.0259422,0.400608,0.040385


####2c) Perform PCA on the normalized image data matrix. You may use the linear algebra package in Numpy or the Decomposition module in scikit-learn (the latter is much more efficient). Analyze the principal components to determine the number, r, of PCs needed to capture at least 95% of variance in the data. Then use these r components as features to transform the data into a reduced dimension space. [See the PCA Clustering Notebook from class for an example of how these steps are performed.]