# Measuring Distance Between Clusters

Distance between clusters is computed by identifying which features to measure distance between (Linkage Metric) and measuring the actual distance between the identified features using the Minkowski method.

- **Single-Linkage** considers the distance between clusters using the shortest distance from any member (example) of one cluster to any member of the other cluster.

    ![](https://github.com/harperd/machine-learning/blob/master/images/single-linkage.jpg?raw=1)

- **Complete-Linkage** considers the distance between clusters using the greatest distance from any member (example) of one cluster to any member of the other cluster.

     ![](https://github.com/harperd/machine-learning/blob/master/images/complete-linkage.jpg?raw=1)

- **Average-Linkage** considers the distance between clusters using the average distance from any member (example) of one cluster to any member of the other cluster.

    ![](https://github.com/harperd/machine-learning/blob/master/images/average-linkage.jpg?raw=1)

# Minkowski Metric

This is the basis for finding distances between Features (or Examples) for *Euclidean*, which is the most common, and *Manhattan* distances:

   ![](https://github.com/harperd/machine-learning/blob/master/images/minkowski.jpg?raw=1)

Two different ways to use the equation. For *Euclidean Distance* (straight line between two points) use p = 2, For *Manhattan Distance* (along axises only) use p = 1.

  ![](https://github.com/harperd/machine-learning/blob/master/images/manhattan-euclidean.jpg?raw=1)

> **NOTE** Euclidean is mostly used. Manhattan is sometimes used when different Features (or Dimensions) are not comparable or higher dimensional data. For example, age vs. other Features that are binary in nature, such as “wears glasses”.

**Other distance metrics:** Chebyshev, Cosine, Canberra

In [6]:
# For computing distance between vectors and clusters
import scipy.spatial.distance as dist

# For creating random clusters
from sklearn.datasets.samples_generator import make_blobs

# For K-Means clustering
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=2, random_state=0)

predictor = KMeans(
        # The number of clusters to form as well as the number of centroids to generate.
        n_clusters = 2, 
        # Method for initialization, defaults to ‘k-means++’:
        # ‘k-means++’: selects initial cluster centers for k-mean clustering in a smart way to 
        #              speed up convergence.
        # ‘random’: choose k observations (rows) at random from data for the initial centroids.
        init = 'k-means++',
        # Maximum number of iterations of the k-means algorithm for a single run.
        max_iter = 300,
        # Number of time the k-means algorithm will be run with different centroid seeds. 
        # The final results will be the best output of n_init consecutive runs in terms of inertia.
        n_init = 10,
        # Determines random number generation for centroid initialization. Use an int to 
        # make the randomness deterministic. 
        random_state = 0)
predictor.fit(X)

# Create some clusters with random data
cluster_A = predictor.cluster_centers_[0]
cluster_B = predictor.cluster_centers_[1]
dimensions = len(cluster_A)

print(f'Distance measurements with {dimensions}-dimensional vectors')
print()
print('Euclidean distance is', dist.euclidean(cluster_A, cluster_B))
print('Manhattan distance is', dist.cityblock(cluster_A, cluster_B))
print('Chebyshev distance is', dist.chebyshev(cluster_A, cluster_B))
print('Canberra distance is', dist.canberra(cluster_A, cluster_B))
print('Cosine distance is', dist.cosine(cluster_A, cluster_B))

Distance measurements with 2-dimensional vectors

Euclidean distance is 3.9109690600209945
Manhattan distance is 5.412665104904623
Chebyshev distance is 3.2751942313846394
Canberra distance is 1.531430912244757
Cosine distance is 0.614950275085963
