This projekt visualizes the k-means clustering algorithm and hierarchical clustering in p5, clustering an arbitrary amount of 2D-Points.
The sketch has been bootstrapped with the generator-p5-webpack for Yeoman.
In data mining clustering (also cluster analysis) techniques are used to group together similar data points in a set of data. In this specific project, data points are represented as points in a 2D context.
The k-means algorithms esentially works this way:
- Generate data points or load data points from any source.
- Select initial center point of cluster for every cluster you want to make.
- For each data point:
- Calculate the eudclidian distance from the point to every cluster's center
- Assign the point to the cluster with the lowest distance to the center.
- Update the center of each cluster to the mean of the positions of all of the points in the cluster.
- If at least one point has been assigned to a new cluster, jump to step 3.
- Point creation
- Initially the number of points you specified as
NUM_POINTSis randomly generated. - You can always add new points to the system by clicking in the canvas or dragging the mouse over the canvas.
- By clicking "Finish creating points" the cluster creation mode is entered.
- Initially the number of points you specified as
- Cluster creation
- Initially the nuber of initial clusters you specified as
NUM_CLUSTERSare generated by randomly picking a point as the cluster's center. - You can click on any point to create additional clusters (if you want to set all of the clusters manually, set
NUM_CLUSTERSto0). - By clicking "Start clustering" the k-means algorithm starts.
- Initially the nuber of initial clusters you specified as
- Clustering
- As the points are assigned to clusters their color changes to the color of the cluster they have been assigned to.
- You can tweak the speed of the algorithm by changing the
ACTIONS_PER_FRAMEparameter. - During clustering you can still add new points and see how the clusters change.
| Name | Description |
|---|---|
NUM_CLUSTERS |
The number of clusters to be automatically created |
NUM_POINTS |
The number of points to be randomly created |
NUM_HOTSPOTS |
The number of hotspots for random point creation |
HOTSPOT_NOISE |
The noise value for points outside a hotspot |
ACTIONS_PER_FRAME |
The number of points being assigned to a cluster per animation frame |
CANVAS_SIZE |
The canvas' width and height in pixels |
Wikipedia: Hierarchical clustering
The hierarchical clustering algorithm essentially works this way:
- Generate data points or load data points from any source.
- Generate a cluster for each point and put them in a set of top-level clusters.
- While the number of top-level clusters is higher than the desired number of clusters.
- Calculate the proximities of all clusters to each other according to the specified proximity method.
- Take the pair of clusters with the lowest distance (highest proximity).
- Merge the two clusters into a new one.
- Insert the new merged cluster into the set of top-level clusters and remove clusters than have been merged from the set of top-level clusters.
- Point creation
- Initially the number of points you specified as
NUM_POINTSis randomly generated (for hierarchical clustering the default value ofNUM_POINTSis0). - You can add new points to the system by clicking in the canvas or dragging the mouse over the canvas.
- By clicking "Start clustering" the hierarchical clustering algorithm stars.
- Initially the number of points you specified as
- Clustering
- The algorithm merges the two clusters with the lowest distance and visualizes them according to the
DRAW_MODE. - The number above the canvas indicates the number of top-level clusters currently existing.
- The algorithm stops when the number of top-level clusters is equal to the value of the
NUM_CLUSTERSparameter.
- The algorithm merges the two clusters with the lowest distance and visualizes them according to the
| Name | Description |
|---|---|
DRAW_MODE |
The mode to draw the clusters |
PROXIMITY_METHOD |
The proximity method to use for proximity calculation |
NUM_CLUSTERS |
The number of target clusters |
NUM_POINTS |
The number of points to be randomly created |
NUM_HOTSPOTS |
The number of hotspots for random point creation |
HOTSPOT_NOISE |
The noise value for points outside a hotspot |
CANVAS_SIZE |
The canvas' width and height in pixels |
Two draw modes are available to draw the generated clusters during the clustering process:
| Name | Description |
|---|---|
CIRCLES |
All points are drawn as regular points with their initial color. All intermediate clusters and top-level clusters are drawn as semi-opaque circles^1. |
COLORS |
All points are drawn as regular points. The points' colors change according to which top-level cluster they belong. When the clustering algorithm has finished all top-level clusters are sorrounded by a circle^1. |
^1 The circle drawn around a cluster have their center at the cluster's center and the radius equals the distance from the center of the cluster to the position of the point in the cluster's sub-tree that is furthest away from the cluster's center. This somtimes lets the circles overlap, though the actual clusters do not actually overlap themselves.
| Name | Description |
|---|---|
| Single link | The distance of two clusteres is determined by the lowest distance of all pairs^1 of points in the two clusters. |
| Complete link | The distance of two clusteres is determined by the highest distance of all pairs^1 of points in the two clusters. |
| Average | The distance of two clusters is determined by the averagy distance of each pair^1 of points in the two clusters. |
| Centroid | The distance of two clusters is determined by the distance of the centers of the two clusters |
^1 A pair always consists of one point from the first and one from the second cluster. There must not be two points from the same cluster in a pair.
For the creation of random points three parameters are relevant: NUM_POINTS, NUM_HOTSPOTS, HOTSPOT_NOISE.
The points are randomly generated in the following way:
- If
NUM_HOTSPOTSis greater than0, the specified number of hotspots are generated with a random location and a random radius.
The radiuses of the hotspots are smaller the more hotspots there are. NUM_POINTSpoints are generated- If there are no hotspots, each point will be assigned a random location.
- If there are hotspots, points will be assigned a location close to a random hotspot.
HOTSPOT_NOISEdefines the probability at which a point is assigned a random location regardless of the number of hotspots. When set to0there won't be any points outside a hotspot radius, whereas when set to1all of the points will be assigned random locations (same as settingNUM_HOTSPOTSto0).
- Clone the repository from GitHub
- run
yarn install - run
yarn start
Instead of yarn you can for sure use npm
GPL-3.0 © Jannik Portz