# Discovering Groups

We will expand on the previous ideas of making recommendations and learn data clustering, a method for discovering and visualizing groups of things, people, or ideas that are all closely related.

## Goals:
1. How to prepare data from a variety of sources.
2. Two different clustering algorithms
3. More on distance metrics.
4. Simple graphical visualization code for viewing the generated groups.
5. A method for projecting very complicated datasets into two dimensions.

## Word Vectors
- In the first dataset, the items that will be clustered are a set of 120 of the top blogs, and the data they’ll be clustered on is the number of times a particular set of words appears in each blog’s feed.

- By clustering blogs based on word frequencies, it might be possible to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles. Such a result could be very useful in searching, cataloging, and discovering the huge number of blogs that are currently online.

## Hierarchical Clustering
- Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups.
- Each of these groups starts as a single item, in this case an individual blog.
- In each iteration this method calculates the distances between every pair of groups, and the closest ones are merged together to form a new group.

In [3]:
'''
print the clusters
'''
import clusters
blognames, words, data = clusters.readfile('blogdata.txt')
clust = clusters.hcluster(data)
#clusters.printclust(clust, labels=blognames)

### Drawing a Dendogram
- After hierarchical clustering is completed, you usually view the results in a type of graph called a dendrogram, which displays the nodes arranged into their hierarchy.
- Dendrogram not only uses connections to show which items ended up in each cluster, it also uses the distance to show how far apart the items were.
- Rendering a dendogram can help us determine how similar the items within a cluster are, which could be interpreted as the tightness of the cluster.

In [7]:
clusters.drawdendogram(clust, blognames, jpeg="visualizations/blogclust.jpg")

### Column Clustering
- In the blog dataset, the columns represent words, and it’s potentially interesting to see which words are commonly used together.
- The easiest way is to rotate the entire dataset so that the columns (the words) become rows, each with a list of numbers indicating how many times that particular word appears in each of the blogs.

In [5]:
rdata = clusters.rotatematrix(data)
wordclust = clusters.hcluster(rdata)
clusters.drawdendogram(wordclust, labels=words, jpeg='visualizations/wordclust.jpg')

### Drawbacks
1. The tree view doesn't really break the data into distinct groups without additional work, and the algorithm is computationally intensive.
2. Since, the relationship between every pair of items must be calculated(-> pearson score in our case) and then recalculated when the items are merged, the algorithm will run very slowly on large datasets.

 ## K-Means Clustering
 - Considering all above mentioned drawbacks. Let's try another approach.
 - In this algorithm, it is told in advance how many distinct clusters to generate. The algorithm will determine the size of clusters based on the structure of data.
<br>**So, How does it works?**
    1. It begins with 'k' randomly placed *centroids*, and assigns every item to the nearest one.
    2. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone.
    3. This process repeats until the assignments stop changing.

In [6]:
reload(clusters)
kclust = clusters.kcluster(data, k=10)

Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6


## Clustering of real data
We will go through the process of creating a dataset from the Zebo web site and carrying out k-means clustering on it.<br />
*The code is present in clusters.py*