# Learning WORD2VEC: K-Means Clustering
----
Goal: Because the dataset contains variable-length reviews, we need to **transform individual word vectors into a feature set that is the same length for every review.**

WORD2VEC creates clusters of semantically related words, so another possible approach is to exploit the similarity of words within a cluster. Grouping vectors in this way is known as **vector quantization**. To accomplish vector quantization, we first need to find the centers of the word clusters, which we can do using a [clustering algorithm](http://scikit-learn.org/stable/modules/clustering.html) such as [K-Means Clustering Algorithm](http://en.wikipedia.org/wiki/K-means_clustering).

## [K-Means Clustering](http://en.wikipedia.org/wiki/K-means_clustering)
In K-Means, the one parameter we need to set is **K --the number of clusters**. 

How we decide how many clusters to create? Trial and error suggested that **small clusters, with an average of 5 words per cluster,** give better than large clusters with many words. 

We use [scikit-learn to perform our K-Means clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

K-Means clustering with large K can be very slow; the following code can take ~40 minutes. A time around the K-Means function shows how long it takes.

In [2]:
# Load the model that we created in Part 2
from gensim.models import Word2Vec

model = Word2Vec.load("300features_40minwords_10context")

print type(model.syn0)
print model.syn0.shape

<type 'numpy.ndarray'>
(16490, 300)


### Initialize a k-means object and use it to extract centroids

In [8]:
from sklearn.cluster import KMeans
import time

# Start time --to demonstrate how long K-Means takes
start_time   = time.time()

# Set "k" (num_clusters) to be 1/5th of the vocabulary size
# or an average of 5 words per cluster
word_vectors = model.syn0
num_clusters = word_vectors.shape[0] / 5

kmeans_clustering = KMeans(n_clusters=num_clusters)
idx               = kmeans_clustering.fit_predict(word_vectors)

# Get end time
end_time     = time.time()
elapsed_time = end_time - start_time
print "Time taken for K-Means Clustering: %d seconds" % (elapsed_time)

Time taken for K-Means Clustering: 537 seconds


The **cluster asssignment** is stored in `idx`, and the vocabulary from our original Word2Vec model is still stored in `model.index2word`. 

For convenience, we zip the cluster assigment and vocabulary into on dictionary: 

| | |
|----|----|
| [**zip**](https://docs.python.org/3/library/functions.html#zip) | Makes an iterator that aggregates the items from each input-iterable *(built-in)* |
| [**dict**](https://docs.python.org/3/library/stdtypes.html#typesmapping) | Makes a dictionary that maps hashtable values to arbitrary objects |

In [16]:
# Create a Word / Index dictionary --mapping each vocabulary word to a cluster number
word_centroid_map = dict(zip(model.index2word, idx))

This is a little abstract, so let's take a closer look at what our clusters contain.  Here is a loop that prints out the words flor cluters 0 through 9:

In [22]:
# Note: Your clusters may differ, as `Word2Vec` relies on a random number seed.


# For the first 10 clusters
for cluster in xrange(0,10):
    
    # Print the cluster number
    print "\nCluster %d" % cluster
    
    # Find all words for that cluster number, print them out
    # words = []
    for i in xrange(0, len(word_centroid_map.values())):
        if( word_centroid_map.values()[i] == cluster):
            print "\t" + word_centroid_map.keys()[i]
    
    # print words
        


Cluster 0
	altogether
	setup
	potentially

Cluster 1
	coin
	beside

Cluster 2
	caron
	nielsen
	leslie

Cluster 3
	devout
	occult
	pervert
	buddhist

Cluster 4
	purists

Cluster 5
	chong
	cheech
	marin

Cluster 6
	tossed
	sped
	creeps
	chalk
	messes
	ticks
	slapped
	pumped

Cluster 7
	speedy
	canned
	droning
	trendy
	ghastly
	drenched
	kitsch
	saturated

Cluster 8
	revisit
	distribute
	executives
	execs
	compete
	acquire
	dared
	promotion
	financing
	releasing
	companies
	distributors
	collectors
	access
	marketing
	promote
	publicity
	profit

Cluster 9
	luzhin
	franchot
	befuddled
	consummate
	dyan
	jacob
	willed
	loveable
	foil
	scheider
	stoic
	amiable
	bravado
	devilish
	affable
	macgregor
	mcgregor
	hilt
	shrewd
	channing


We can see that the clusters are of varying quality. Some make sense 

- Cluster ? mostly contain names
- Cluster ? mostly contain related adjectives
- Cluster ? however is mystifying. What do the words have in common?

Perhaps our algorithm works best on adjectives

At any rate, now we have a cluster (or *centroid*) assignment for each word. Now, we can define a function to convert reviews into **bags-of-centroids**. This works just like Bag-of-Words but uses semantically related clusters instead of individual words:

In [None]:
def create_bag_of_centroids(wordlist, word_centroid_map):
    #
    
    # The number of clusters is equal to the highest cluster index
    # in the word/centroid map    
    num_centroids = max(word_centroid_map.values()) + 1
    
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros(num_centroids, dtype="float32")
    

[1596,
 132,
 67,
 3287,
 1240,
 861,
 240,
 3076,
 1317,
 910,
 116,
 968,
 2904,
 1230,
 749,
 1363,
 285,
 945,
 1911,
 1913,
 2504,
 225,
 2679,
 2045,
 1669,
 1578,
 212,
 165,
 204,
 1073,
 2831,
 2768,
 2518,
 1774,
 2049,
 524,
 2160,
 278,
 183,
 920,
 254,
 1857,
 748,
 1341,
 443,
 1279,
 291,
 1204,
 2537,
 669,
 37,
 830,
 655,
 1219,
 254,
 1147,
 994,
 210,
 2459,
 1068,
 527,
 1219,
 3290,
 1320,
 1708,
 1978,
 3118,
 3077,
 763,
 2065,
 695,
 2043,
 867,
 2400,
 164,
 3171,
 613,
 1065,
 1419,
 947,
 612,
 933,
 989,
 220,
 2429,
 783,
 546,
 1119,
 2404,
 2029,
 2519,
 2466,
 1445,
 230,
 921,
 175,
 2197,
 2511,
 937,
 1686,
 1624,
 816,
 1694,
 2071,
 1983,
 749,
 724,
 26,
 3135,
 447,
 3101,
 378,
 3169,
 865,
 2379,
 2756,
 188,
 942,
 317,
 102,
 386,
 892,
 441,
 759,
 208,
 1219,
 990,
 1202,
 962,
 1668,
 358,
 1749,
 3037,
 1324,
 358,
 232,
 2013,
 1503,
 672,
 217,
 2226,
 2160,
 14,
 2402,
 2042,
 1853,
 918,
 1829,
 240,
 546,
 1915,
 152,
 65,
 1596,
 2