bfr

[Download data for testing] (https://drive.google.com/file/d/0BzNv2kZu4HnebXNBQnVqNHVPdG8/view?usp=sharing)

bfr

BFR(Bradley-Fayyad-Reina) is a variant of k-means designed to handle very large (disk-resident) data sets.

Here's BFR “Galaxies” Picture

Three Classes of Points

Discard set (DS):
- Points close enough to a centroid to be summarized
Compression set (CS):
- Groups of points that are close together but not close to any existing centroid. These points are summarized, but not assigned to a cluster
Retained set (RS):
- Isolated points waiting to be assigned to a compression set

Sufficient Statistic

The number of points, N
The vector SUM, whose ith component is the sum of the coordinates of the points in the ith dimension
The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension

Average in each dimension (the centroid) can be calculated as SUMi / N, where SUMi = ith component of SUM. Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) – (SUMi / N)^2. And standard deviation is the square root of that.

Processing of points

Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS

These points are so close to the centroid that they can be summarized and then discarded

Use any main-memory clustering algorithm to cluster the remaining points and the old RS

Clusters go to the CS; outlying points to the RS

DS set: Adjust statistics of the clusters to account for the new points

Add Ns, SUMs, SUMSQs

Consider merging compressed sets in the CS
If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

[Read more] (http://infolab.stanford.edu/~ullman/mmds/book.pdf)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

bfr

Three Classes of Points

Sufficient Statistic

Processing of points

About

Releases

Packages

Languages

laiola/bfr

Folders and files

Latest commit

History

Repository files navigation

bfr

Three Classes of Points

Sufficient Statistic

Processing of points

About

Resources

Stars

Watchers

Forks

Languages