# Discovering Groups

We will expand on the previous ideas of making recommendations and learn data clustering, a method for discovering and visualizing groups of things, people, or ideas that are all closely related.

## Goals:
1. How to prepare data from a variety of sources.
2. Two different clustering algorithms
3. More on distance metrics.
4. Simple graphical visualization code for viewing the generated groups.
5. A method for projecting very complicated datasets into two dimensions.

## Word Vectors
- In the first dataset, the items that will be clustered are a set of 120 of the top blogs, and the data they’ll be clustered on is the number of times a particular set of words appears in each blog’s feed.

- By clustering blogs based on word frequencies, it might be possible to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles. Such a result could be very useful in searching, cataloging, and discovering the huge number of blogs that are currently online.

## Hierarchical Clustering
- Hierarchical clustering builds up a hierarchy of groups by continuously merging the two most similar groups.
- Each of these groups starts as a single item, in this case an individual blog.
- In each iteration this method calculates the distances between every pair of groups, and the closest ones are merged together to form a new group.

In [4]:
'''
print the clusters
'''
import clusters
blognames, words, data = clusters.readfile('blogdata.txt')
clust = clusters.hcluster(data)
#clusters.printclust(clust, labels=blognames)

### Drawing a Dendogram
- After hierarchical clustering is completed, you usually view the results in a type of graph called a dendrogram, which displays the nodes arranged into their hierarchy.
- Dendrogram not only uses connections to show which items ended up in each cluster, it also uses the distance to show how far apart the items were.
- Rendering the graph this way can help you determine how similar the items within a cluster are, which could be interpreted as the tightness of the cluster.

In [7]:
clusters.drawdendogram(clust, blognames, jpeg="visualizations/blogclust.jpg")