Skip to content
/ bfr Public

BFR(Bradley-Fayyad-Reina) is a variant of k-means designed to handle very large (disk-resident) data sets.

Notifications You must be signed in to change notification settings

laiola/bfr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

[Download data for testing] (https://drive.google.com/file/d/0BzNv2kZu4HnebXNBQnVqNHVPdG8/view?usp=sharing)

bfr

BFR(Bradley-Fayyad-Reina) is a variant of k-means designed to handle very large (disk-resident) data sets.

Here's BFR “Galaxies” Picture

alt text

Three Classes of Points

  • Discard set (DS):
    • Points close enough to a centroid to be summarized
  • Compression set (CS):
    • Groups of points that are close together but not close to any existing centroid. These points are summarized, but not assigned to a cluster
  • Retained set (RS):
    • Isolated points waiting to be assigned to a compression set

Sufficient Statistic

  • The number of points, N
  • The vector SUM, whose ith component is the sum of the coordinates of the points in the ith dimension
  • The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension

Average in each dimension (the centroid) can be calculated as SUMi / N, where SUMi = ith component of SUM. Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) – (SUMi / N)^2. And standard deviation is the square root of that.

Processing of points

  1. Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS
  • These points are so close to the centroid that they can be summarized and then discarded
  1. Use any main-memory clustering algorithm to cluster the remaining points and the old RS
  • Clusters go to the CS; outlying points to the RS
  1. DS set: Adjust statistics of the clusters to account for the new points
  • Add Ns, SUMs, SUMSQs
  1. Consider merging compressed sets in the CS
  2. If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

[Read more] (http://infolab.stanford.edu/~ullman/mmds/book.pdf)

About

BFR(Bradley-Fayyad-Reina) is a variant of k-means designed to handle very large (disk-resident) data sets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages