# K-Means Clustering with scikit-learn

We are going to use the implementation for k-means from scikit-learn, see [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit) for a documentation. 

In [1]:
from sklearn.cluster import KMeans

When using k-means from scikit-learn, we recommend you that your data be stored as a numpy array. Create it or convert your data into a numpy array as follows.

In [2]:
import numpy as np

#create a numpy array
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])

#convert a list to a numpy array
a=[]
for i in range(0,10):
    p=[i,2*i]
    a.append(p)

Y=np.array(a, dtype='float32')


The following execute the k-means algorithm on the points in X. Make sure you understand the parameters see [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit)

In [5]:
kmeans = KMeans(init='random', n_clusters=2, max_iter=10000, n_init=100).fit(X)

The following code shows for each data points its cluster (0 or 1)

In [6]:
kmeans.labels_

array([1, 1, 1, 0, 0, 0], dtype=int32)

The following code computes the clusters for the points [0,0] and [4,4]. In this case, [0,0] is placed in cluster labeled 0 and [4,4] in the cluster labeled 1.

In [7]:
kmeans.predict([[0, 0], [4, 4]])

array([1, 0], dtype=int32)

The following code shows the centroids (in this case called centers ) of the two clusters.

In [8]:
kmeans.cluster_centers_

array([[4., 2.],
       [1., 2.]])

# Question 1

In [5]:
import pandas 

data_frame = pandas.read_csv('./data.csv')

data = np.array(data_frame.iloc[:,1:26])

kmeans2 = KMeans(init='random',n_clusters = 8,max_iter=10000,n_init=100).fit(data)

data_frame

Unnamed: 0,StockName,1/28/2011,4/29/2011,5/20/2011,4/1/2011,5/27/2011,6/17/2011,4/15/2011,2/18/2011,3/18/2011,...,1/14/2011,4/8/2011,4/21/2011,3/4/2011,3/25/2011,2/4/2011,1/7/2011,2/25/2011,5/13/2011,1/21/2011
0,American Express,-4.7557,4.00509,3.58155,-0.395257,0.768624,1.12594,-0.237274,-1.91728,0.706794,...,4.63801,1.46898,2.74809,-0.022868,1.87709,-0.70247,2.44804,-3.13752,-1.13863,-0.065175
1,Boeing,-3.2019,5.65488,-1.44928,0.693878,0.574788,1.50561,-1.42566,0.467675,-2.90853,...,0.93633,0.122649,3.74037,-0.92452,4.33917,3.06093,4.88284,-0.069109,-0.353045,1.15721
2,Chevron,-0.55384,1.92791,0.529256,1.80451,2.05676,-0.869652,-3.18936,3.37173,3.67084,...,2.06707,1.0505,3.03001,1.43723,2.81148,3.47363,-0.512765,2.89227,-0.83293,0.903809
3,Cisco Systems,0.431862,3.48494,-1.72414,-1.84332,0.304692,-1.12285,-3.83964,0.053079,-3.76193,...,1.2894,3.76249,0.35545,-1.18153,-0.346021,5.35117,2.54279,-0.480513,-3.70793,-2.35627
4,DuPont,3.81916,1.97522,0.0,1.99593,1.56522,-0.919448,-1.00992,2.8288,-0.357277,...,3.10559,-0.18018,3.00295,-0.645518,0.557621,4.74576,-0.579421,-1.60146,-3.69494,-2.38239
5,Kraft,-2.73973,0.418535,1.1194,0.829346,-0.629111,1.65094,5.13709,0.782524,-1.65027,...,1.39114,-0.126143,2.01711,-0.754243,1.55945,2.16181,-1.79471,3.22266,3.19432,-0.191022
6,Caterpillar,3.20354,5.64811,-1.45461,3.26821,3.25765,-1.01104,-2.55408,2.22093,2.40764,...,0.858277,-3.45495,3.63705,0.311526,2.04864,3.59929,-0.688705,-2.72745,-4.0343,-1.49745
7,Bank of America,-4.5614,-0.324675,-2.60723,-0.372578,1.91805,-1.92837,-5.03704,-0.13541,-1.54278,...,7.62174,0.597015,-2.22399,-1.05116,-6.05634,4.23049,2.88809,-1.25174,-2.85016,-5.50398
8,Verizon,2.15023,1.9153,-0.295223,2.75107,-0.217687,0.794777,0.264901,0.853759,0.279799,...,-2.98222,-2.10226,-1.83511,0.055463,1.71849,1.87991,-0.36051,-0.745033,-0.053648,-1.21538
9,Microsoft,-0.963597,1.40845,-1.88301,-0.701481,2.27179,1.97562,-3.13097,-0.514706,-2.70694,...,0.35461,2.43615,1.67331,-2.77257,1.74742,0.0,1.96078,-0.85885,-2.9845,-0.497159


In [10]:
labels = kmeans2.labels_
labels

array([1, 1, 3, 5, 4, 0, 4, 6, 3, 1, 7, 7, 1, 0, 4, 1, 2, 7, 7, 7, 3, 3,
       1, 1, 7, 6, 7, 7, 1, 0], dtype=int32)

In [11]:
centers = kmeans2.cluster_centers_
centers

array([[ -3.13183   ,   2.18530833,   0.36177667,   1.11041533,
          0.29878833,   0.5558588 ,   3.44432333,   0.23432033,
         -1.305863  ,  -0.89022733,  -1.30307233,  -0.24790133,
         -0.499212  ,   0.1628367 ,  -1.71756333,   1.187734  ,
         -0.20922097,   2.459097  ,  -0.03160767,   0.65219833,
          0.71405667,  -0.5572588 ,   0.03741   ,   2.82863333,
          0.309458  ],
       [ -1.15227563,   3.70385875,  -0.13408487,  -0.13648712,
          0.70085739,   0.428377  ,  -1.442583  ,   0.69754985,
         -1.98951638,  -1.66958151,  -0.84906491,   2.25623166,
         -1.426021  ,  -1.11346488,  -3.973015  ,   1.98446562,
          0.52460175,   4.09113125,  -0.60342695,   2.6314125 ,
          1.59164763,   1.74438625,  -0.97728331,  -1.14265225,
          0.70651289],
       [ -2.52731   ,  -1.68047   , -10.4975    ,  -3.39463   ,
          3.87858   ,  -0.370054  ,  -1.10538   ,   0.454076  ,
          0.0242072 ,  -2.72727   ,   0.316183  ,   1.9492

In [12]:
# SSE : 

SSE = 0

for i in range(30):
    j=labels[i]
    SSE += np.linalg.norm(data[i]-centers[j])**2
SSE

1627.9144098107424

# Question 2 

We want to decrease the SSE.
We should change de parameters n_init and the tolerance. n_init because : we would then run the k-means algorithm more times and we would have more chance to get a best result in terms of inertia. The tolerance because if we decrease it the frobenius norm would be smaller. But after some tests we can notice that lowering the tolerance dont affect that much the final sse. 


In [None]:
kmeans3 = KMeans(n_init=1000,tol = 1e-9).fit(data)
label2 = kmeans3.labels_
centers2 = kmeans3.cluster_centers_

sse = 0

for i in range(30):
    j=label2[i]
    sse += np.linalg.norm(data[i]-centers2[j])**2
sse

In [7]:
data_frame

Unnamed: 0,StockName,1/28/2011,4/29/2011,5/20/2011,4/1/2011,5/27/2011,6/17/2011,4/15/2011,2/18/2011,3/18/2011,...,1/14/2011,4/8/2011,4/21/2011,3/4/2011,3/25/2011,2/4/2011,1/7/2011,2/25/2011,5/13/2011,1/21/2011
0,American Express,-4.7557,4.00509,3.58155,-0.395257,0.768624,1.12594,-0.237274,-1.91728,0.706794,...,4.63801,1.46898,2.74809,-0.022868,1.87709,-0.70247,2.44804,-3.13752,-1.13863,-0.065175
1,Boeing,-3.2019,5.65488,-1.44928,0.693878,0.574788,1.50561,-1.42566,0.467675,-2.90853,...,0.93633,0.122649,3.74037,-0.92452,4.33917,3.06093,4.88284,-0.069109,-0.353045,1.15721
2,Chevron,-0.55384,1.92791,0.529256,1.80451,2.05676,-0.869652,-3.18936,3.37173,3.67084,...,2.06707,1.0505,3.03001,1.43723,2.81148,3.47363,-0.512765,2.89227,-0.83293,0.903809
3,Cisco Systems,0.431862,3.48494,-1.72414,-1.84332,0.304692,-1.12285,-3.83964,0.053079,-3.76193,...,1.2894,3.76249,0.35545,-1.18153,-0.346021,5.35117,2.54279,-0.480513,-3.70793,-2.35627
4,DuPont,3.81916,1.97522,0.0,1.99593,1.56522,-0.919448,-1.00992,2.8288,-0.357277,...,3.10559,-0.18018,3.00295,-0.645518,0.557621,4.74576,-0.579421,-1.60146,-3.69494,-2.38239
5,Kraft,-2.73973,0.418535,1.1194,0.829346,-0.629111,1.65094,5.13709,0.782524,-1.65027,...,1.39114,-0.126143,2.01711,-0.754243,1.55945,2.16181,-1.79471,3.22266,3.19432,-0.191022
6,Caterpillar,3.20354,5.64811,-1.45461,3.26821,3.25765,-1.01104,-2.55408,2.22093,2.40764,...,0.858277,-3.45495,3.63705,0.311526,2.04864,3.59929,-0.688705,-2.72745,-4.0343,-1.49745
7,Bank of America,-4.5614,-0.324675,-2.60723,-0.372578,1.91805,-1.92837,-5.03704,-0.13541,-1.54278,...,7.62174,0.597015,-2.22399,-1.05116,-6.05634,4.23049,2.88809,-1.25174,-2.85016,-5.50398
8,Verizon,2.15023,1.9153,-0.295223,2.75107,-0.217687,0.794777,0.264901,0.853759,0.279799,...,-2.98222,-2.10226,-1.83511,0.055463,1.71849,1.87991,-0.36051,-0.745033,-0.053648,-1.21538
9,Microsoft,-0.963597,1.40845,-1.88301,-0.701481,2.27179,1.97562,-3.13097,-0.514706,-2.70694,...,0.35461,2.43615,1.67331,-2.77257,1.74742,0.0,1.96078,-0.85885,-2.9845,-0.497159


In [9]:
kmeans2.labels_

array([1, 1, 0, 3, 7, 2, 7, 4, 7, 7, 6, 6, 1, 2, 5, 6, 1, 6, 6, 2, 2, 0,
       6, 0, 6, 6, 2, 2, 6, 2], dtype=int32)

Clusters : 

C_0 : {2 : Chevron,21 : Pfizer,23 : ExxonMobil} -> Oil Industry, Pfizer may use oil products 
C_1 : {0 : Amex,1 : Boeing,12 : Disney,16 : Hewlett-Packard} -> Maybe travel with Boeing and Disney. Furthermore American Express is often used by tourists 
C_2 : {5 : Kraft,13 : Procter & Gamble,19 : AT&T,20 : Merck,26 : McDonalds,27 : Coca-Cola,29 : Johnson&Johnson} -> food industry (and cosmetics)
C_3 : {3 : Cisco} should logicaly be in tech 
C_4 : {7 : Bank of America} bank 
C_5 : {14 : Alcoa} : steal company 
C_6 : {10 : IBM,11 : Home Depot,15 : Intel,17 : Wal-Mart,18 : General Electric,22 : United Tech,24 : Travelers,25 : JP Morgan Chase,28 : 3M} -> Technology 
C_7 : {4 : DuPont,6 : Caterpillar,8 : Verizon,9 : Microsoft} -> construction industry 

We can notice that there are some weird clusters : for example the cluster C1 with Boeing and The Walt Disney company that don't have (at all) the same field of activities. 
Also in some clusters we can find companies that don't fit the theme of the cluster : for example Pfizer.
Finaly there are some companies that are alone such that : Cisco with is a Tech company and the Bank of America which should logicaly be with American Express.