Hopkins-Statistic-Clustering-Tendency

A python implementation for computing the Hopkins' statistic (Lawson and Jurs (1990)) for measuring clustering tendency of data.

Clustering Tendency

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

However, clustering algorithms will locate and specify clusters in data even if none are present. It is therefore appropriate to measure the clustering tendency or randomness of a data set before subjecting it to a clustering algorithm.

To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering.

Hopkins' Statistic

Hopkins’ statistic is a simple measure of clustering tendency. It is based on the difference between the distance from a real point to its nearest neighbor, U, and the distance from a randomly chosen point within thedata space to the nearest real data point, W.

Algorithm

Let X be the set of n data points.
Consider a random sample (without replacement) of m<<n data points with members $x_{i}$ . (Lawson and Jurs (1990)) suggest choosing 5% of the data points so that the nearest-neighbor distances will be independent and thus approximate a Beta distribution.
Generate a set Y of m uniformly randomly distributed data points.
Define two distance measures,
- $u_{i}$ the distance of $y_{i}$ in Y from its nearest neighbour in X, and
- $w_{i}$ the distance of $x_{i}$ in X from its nearest neighbour in X.
if the data is d dimensional, then the Hopkins statistic is defined as:

$H = \frac{\sum_{i=1}^{m} u_{i}^{d}}{\sum_{i=1}^{m} u_{i}^{d} + \sum_{i=1}^{m} w_{i}^{d}}$

Measuring Clustering Tendency with Hopkins' Statistic

If X were uniformly distributed, then $\sum_{i=1}^{m}u_{i}$ and $\sum_{i=1}^{m}w_{i}$ would be close to each other, and thus H would be about 0.5. However, if clusters are present in X, then the distances for artificial points would be substantially larger than for the real ones in expectation, and thus the value of H will increase .

A value for H higher than 0.75 indicates a clustering tendency at the 90% confidence level.

The null and the alternative hypotheses are defined as follow:

Null hypothesis: the data set X is uniformly distributed (i.e., no meaningful clusters)
Alternative hypothesis: the data set X is not uniformly distributed (i.e., contains meaningful clusters)

Therefore, we can interpret Hopkins' statistic in the following manner:

If the value is between {0.01, ...,0.3}, the data is regularly spaced.
If the value is around 0.5, it is random.
If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster.

Clustering Tendency for the Iris Dataset

Hopkins' Statistic was calculated for the iris dataset which averaged at about 0.83 (>0.75). Thus, we reject the null hypothesis, and conclude the iris dataset is significantly a clusterable data.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
Hopkins-Statistic-Clustering-Tendency.ipynb		Hopkins-Statistic-Clustering-Tendency.ipynb
README.md		README.md
iris.data		iris.data
iris.names		iris.names

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hopkins-Statistic-Clustering-Tendency

Clustering Tendency

Hopkins' Statistic

Algorithm

Measuring Clustering Tendency with Hopkins' Statistic

Clustering Tendency for the Iris Dataset

References

About

Releases

Packages

Languages

prathmachowksey/Hopkins-Statistic-Clustering-Tendency

Folders and files

Latest commit

History

Repository files navigation

Hopkins-Statistic-Clustering-Tendency

Clustering Tendency

Hopkins' Statistic

Algorithm

Measuring Clustering Tendency with Hopkins' Statistic

Clustering Tendency for the Iris Dataset

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages