# Lab 9 - Cluster Validation

In this last notebook of the Winter 2025 series, we consider cluster validation measures. These can be classified as *internal* validation measures, such as the measures of fit we already considered, as well as *external* validation measures, which provide a way to compare cluster solutions to a given (known, or assumed known) reference.

In addition to measures of fit, we also consider indicators of the balance of cluster solutions, i.e., the evenness of the number of observations in each cluster. Such measures include *entropy* and *Simpson's index*. Other measures, introduced in Anselin (2024) are based on the *spatial* properties of the clusters. These include the *join count ratio*, an indicator of how many *neighbors* of each observation in a cluster are also members of the cluster. For a spatially compact cluster solution, this measure should equal one (except for boundary effects). For non spatially constrained clusters, it indicates how closely it approximates a spatial solution.

For spatially constrained cluster solutions, compactness is a key characteristic. This can be quantified by means of the *isoperimeter quotient (IPQ)*, the ratio of the area of a cluster shape to that of a circle with equal perimeter. A final measure of compactness introduced in Anselin (2024) is the *diameter* of the unweighted graph representation of the spatial weights matrix. To obtain a relative measure, the diameter is rescaled by the number of observations in the cluster. The latter measures are only applicable to spatially constrained clusters.

In addition, we also consider two classic indicators of external validity, i.e., the *Adjusted Rand Index (ARI)*, based on counting pairs, and the *Normalized Information Distance (NID)*, derived from measures of entropy.

The material is part of the Spatial Cluster Analysis course taught at the University of Chicago in the Winter Quarter of 2025.

Prepared by: Luc Anselin (anselin@uchicago.edu) and Pedro Amaral (pedroamaral@uchicago.edu)

## Preliminaries
The empirical illustration presented here is based on the material in Chapter 12 of the Spatial Cluster book. However, it is adapted to reflect the max-p solution of p=13 obtained with `pygeoda`, which is different from the p=12 listed in the Spatial Clustering book.

### Required packages

The conda enviroment used for this exercise was created from a yml file with the same specification as in the previous notebooks:

In addition to the usual `numpy`, `pandas` and `geopandas`, we also import several specialized packages from scikit-learn and `pygeoda` to carry out the cluster analysis. Specifically, to carry out variable standardization we import `StandardScaler` from `sklearn.preprocessing`. The specific clustering methods are `AgglomerativeClustering` and `KMeans` from `sklearn.cluster`. The other clustering solutions are obtained with `pygeoda`. The external validation measures are contained in `sklearn.metrics.adjusted_rand_score` and `sklearn.metrics.adjusted_mutual_info_score`.

The new internal validation measures are based on `pygeoda.spatial_validation`. Several helper functions contained in the `spatial_cluster_course` module extract the relevant information and present it as a pandas data frame: `cluster_fragmentation`, `cluster_joincount`, `cluster_compactness` and `cluster_diameter`.


For the spatially constrained clustering methods, make sure the latest version of `pygeoda` is installed by using `pip install -U pygeoda`.

In addition to the helper functions mentioned above, we also import:

- `cluster_stats`
- `cluster_fit`

In [2]:
import geopandas as gpd
import pandas as pd
import numpy as np

from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

from spatial_cluster_course import cluster_stats, cluster_fit, cluster_fragmentation
from spatial_cluster_course import cluster_joincount, cluster_compactness, cluster_diameter

import pygeoda

### Load data

For this final exercise, we will again use a data set on Zika and Microcephaly infections and socio-economic profiles for 2013-2016 in municipalities in the State of Ceará, Brazil. This is also a GeoDa sample data set. Detailed source and information available at https://geodacenter.github.io/data-and-lab/Ceara-Zika/

The following files will be used:
- **ceara.shp,shx,dbf,prj**: shape file (four files) for 184 municipalities

We follow the usual practice of setting a path (if needed), reading the data from the shape file and a quick check of its contents (`head`).

In [None]:
# Setting working folder:
#path = "/your/path/to/data/"
path = ""

# Load the Ceará data:
dfs = gpd.read_file(path+"ceara/ceara.shp")
print(dfs.shape)
dfs.head(3)

(184, 36)


Unnamed: 0,code7,mun_name,state_init,area_km2,state_code,micro_code,micro_name,inc_mic_4q,inc_zik_3q,inc_zik_2q,...,gdp,pop,gdpcap,popdens,zik_1q,ziq_2q,ziq_3q,zika_d,mic_d,geometry
0,2300101.0,Abaiara,CE,180.833,23,23019,19Âª RegiÃ£o Brejo Santo,0.0,0.0,0.0,...,35974.0,10496.0,3.427,58.043,0.0,0.0,0.0,0.0,0.0,"POLYGON ((5433729.65 9186242.97, 5433688.546 9..."
1,2300150.0,Acarape,CE,130.002,23,23003,3Âª RegiÃ£o MaracanaÃº,6.380399,0.0,0.0,...,68314.0,15338.0,4.454,117.983,0.0,0.0,0.0,0.0,1.0,"POLYGON ((5476916.288 9533405.667, 5476798.561..."
2,2300200.0,AcaraÃº,CE,842.471,23,23012,12Âª RegiÃ£o AcaraÃº,0.0,0.0,1.63,...,309490.0,57551.0,5.378,68.312,0.0,1.0,0.0,1.0,0.0,"POLYGON ((5294389.783 9689469.144, 5294494.499..."


#### Selecting variables and checking their correlation

Following Chapter 12 of Anselin (2024) (https://lanselin.github.io/introbook_vol2/CHclustervalidation.html), we select the following variables from the Ceará sample data set.

List of variables:
| Column Name | Description                                      |
|-------------|--------------------------------------------------|
| mobility    | Mobility index                                  |
| environ     | Environment index                               |
| housing     | Housing index                                  |
| sanitation  | Sanitation index                               |
| infra       | Infrastructure index                           |
| gdpcap      | GDP per capita                                  | 

We carry out the by now familiar manipulation to create the required input data for the cluster routines.

In [4]:
varlist = ['mobility', 'environ', 'housing', 'sanitation', 'infra', 'gdpcap']
ceara_g = pygeoda.open(dfs)
queen_w = pygeoda.queen_weights(ceara_g)
print(queen_w)
data = dfs[varlist]
data_g = ceara_g[varlist]
n_clusters = 13

Weights Meta-data:
 number of observations:                  184
           is symmetric:                 True
               sparsity:  0.02953686200378072
        # min neighbors:                    1
        # max neighbors:                   13
       # mean neighbors:    5.434782608695652
     # median neighbors:                  5.0
           has isolates:                False



## Cluster Solutions

Before considering the validation measures, we compute the cluster solutions for hierarchical clustering (using `sklearn.AgglomerativeClustering`), K-Means (using `sklearn.KMeans`), and the spatially constrained clustering methods using `pygeoda`. For details on the arguments and helper functions, see the previous notebooks.

For each cluster, we extract the `labels` and the `fit` using the respective helper functions.

To illustrate the internal validity measures, we will focus on Ward's agglomerative clustering as an example of a standard method and on AZP with SCHC initial solution as an example of a spatially constrained cluster solution. For this, we also generate the cluster cardinalities. This is not pursued for the other cluster solutions, but can be readily implemented.

### Hierarchical clustering

In [5]:
method = 'ward'
X = StandardScaler().fit_transform(data)

agg_clusters = AgglomerativeClustering(n_clusters=n_clusters, linkage=method, compute_distances=True)
agg_clusters.fit(X)
agg_labels = tuple(int(label) for label in agg_clusters.labels_)

agg_clusters_fit = cluster_fit(data=data,clustlabels=agg_clusters.labels_,
                 n_clusters=n_clusters, printopt=False)
agg_stats = cluster_stats(agg_labels)
print(agg_stats.to_string(index=False))

 Labels  Cardinality
      0           16
      1           26
      2           29
      3            9
      4           14
      5            3
      6           26
      7           11
      8           20
      9            4
     10           15
     11           10
     12            1


### K-means clustering

In [6]:
kmeans_clusters = KMeans(n_clusters=n_clusters, n_init=150, random_state=1234567).fit(X) 
kmeans_labels = tuple(int(label) for label in kmeans_clusters.labels_)
kmeans_clusters_fit = cluster_fit(data=data,clustlabels=kmeans_clusters.labels_,
                 n_clusters=n_clusters, printopt=False)

### SCHC with Ward’s linkage

In [7]:
schc_clusters = pygeoda.schc(n_clusters, queen_w, data_g, "ward")
schc_labels = schc_clusters['Clusters']

### SKATER

In [8]:
skater_clusters = pygeoda.skater(n_clusters, queen_w, data_g)
skater_labels = skater_clusters['Clusters']

### REDCAP

In [9]:
redcap_clusters = pygeoda.redcap(n_clusters, queen_w, data_g, method='fullorder-wardlinkage')
redcap_labels = redcap_clusters['Clusters']

### AZP with simulated annealing

In [10]:
azp_sa_clusters = pygeoda.azp_sa(n_clusters, queen_w, data_g, cooling_rate=0.8, sa_maxit=5)
azp_sa_labels = azp_sa_clusters['Clusters']

### AZP with SCHC as initial solution

In [11]:
azp_schc_clusters = pygeoda.azp_sa(n_clusters, queen_w, data_g, cooling_rate=0.8, sa_maxit=5,
                                init_regions=schc_labels)
azp_schc_labels = azp_schc_clusters['Clusters']
azp_schc_stats = cluster_stats(azp_schc_labels)
print(azp_schc_stats.to_string(index=False))

 Labels  Cardinality
      1           89
      2           43
      3           15
      4           14
      5            6
      6            4
      7            4
      8            3
      9            2
     10            1
     11            1
     12            1
     13            1


### Max-p Regions

In [12]:
maxp_sa_clusters = pygeoda.maxp_sa(queen_w, data_g, 
                                      bound_variable=dfs['pop'], 
                                      min_bound=dfs['pop'].sum()*0.05,
                                      iterations=9999,
                                      cooling_rate=0.9,
                                      sa_maxit=5)
maxp_sa_labels = maxp_sa_clusters['Clusters']

## Internal Validation Measures

As mentioned, in addition to the classic measures of fit, we also consider *fragmentation*, the *join count ratio*, and, for spatially constrained cluster solutions, the *compactness* and *diameter*.

These measures are provided as attributes in the solution object created by `pygeoda.spatial_validation`. This requires the `pygeoda` data set, the cluster labels and the spatial weights as arguments.

We illustrate this for Ward's agglomerative clustering, with `agg_labels` as the cluster labels, and for AZP-SCHC, with `azp_schc_labels` as the cluster labels. For both, the data set is `ceara_g` and the spatial weights are contained in `queen_w`.

We store the results in, respectively, `agg_validation` and `azp_schc_validation`. These objects will then be used as arguments to the helper functions.

In [13]:
agg_validation = pygeoda.spatial_validation(ceara_g, agg_labels, queen_w)
azp_schc_validation = pygeoda.spatial_validation(ceara_g, azp_schc_labels, queen_w)

### Fragmentation

The fragmentation measures are computed from the makeup of the cluster components. An ideally balanced cluster is when each component has the same number of observations. This is quantified by means of entropy and its standardized counterpart, as well as by Simpson's index and its standardized counterpart. For entropy, larger values suggest a greater balance, whereas for Simpson's index, it is the other way around.

The `pygeoda.spatial_validation` return object includes the fragmentation information in two attributes: `fragmentation` and `cluster_fragmentation`. The first contains the overall measures, as well as the number of clusters, in `fragmentation.n`, `fragmentation.entropy`, `fragmentation.std_entropy`, `fragmentation.simpson` and `fragmentation.std_simpson`. The second has the same information, organized as a list by cluster. It shows the within-cluster fragmentation for clusters that are not spatially constrained.

The `cluster_fragmentation` helper function, takes a data frame with the labels and cardinalities (created by `cluster_stats`), and the `cluster_fragmentation`, `fragmentation` and `spatially_constrained` attributes from the validation object. The `spatially_constrained` flag is used to limit the fragmentation output to the totals only for spatially constrained clusters.

This is illustrated for Ward's agglomerative clustering and AZP-SCHC. Note that for the latter, only the totals are given, since it is a spatially constrained solution.

The other cluster solutions can be analyzed in the same way.

#### Agglomerative clustering

In [14]:
agg_frag = cluster_fragmentation(agg_stats,agg_validation.cluster_fragmentation,
                   agg_validation.fragmentation,agg_validation.spatially_constrained)

Fragmentation
Label   N Sub  Entropy  Entropy*  Simpson  Simpson*
    0  16   9 1.751176  0.796994 0.250892  2.258026
    1  26  14 2.397937  0.908634 0.115385  1.615385
    2  29  14 2.425806  0.919194 0.109467  1.532544
    3   9  11 2.250260  0.938431 0.120000  1.320000
    4  14   9 1.923066  0.875225 0.187500  1.687500
    5   3   8 1.933810  0.929966 0.164444  1.315556
    6  26  10 2.168223  0.941647 0.132653  1.326531
    7  11   9 2.145842  0.976615 0.123967  1.115702
    8  20   6 1.609438  0.898244 0.240000  1.440000
    9   4   7 1.831020  0.940958 0.185185  1.296296
   10  15   0 0.000000  0.000000 0.000000  0.000000
   11  10   3 1.098612  1.000000 0.333333  1.000000
   12   1   0 0.000000  0.000000 0.000000  0.000000
  All 184     2.351159  0.916649 0.106274  1.381557


#### AZP-SCHC

In [15]:
azp_schc_frag = cluster_fragmentation(azp_schc_stats,azp_schc_validation.cluster_fragmentation,
                   azp_schc_validation.fragmentation,azp_schc_validation.spatially_constrained)

Fragmentation
Label   N Sub  Entropy  Entropy*  Simpson  Simpson*
  All 184     1.599116  0.623449 0.303521   3.94577


### Join Count Ratio

The join count ratio is a *spatial* measure of the degree of internal connectedness in a cluster solution. It is computed for each cluster separately as well as for the cluster solution as a whole. It is a count of how many neighbors of observations in a cluster are also members of that cluster.

The relevant measures are included in the `joincount_ratio` and `all_joincount_ratio` attributes of the `pygeoda.spatial_validation` solution object. The former is a list with k entries, which are themselves objects, containing attributes `n`, `neighbors`, `join_count` and `ratio`. The latter is the same for the overall cluster solution.

The result is provided by the `cluster_joincount` helper function. It takes the data frame with cluster cardinalities and the `joincount_ratio` and `all_joincount_ratio` attributes from the validation solution.

We use the same examples in the illustration below.

#### Agglomerative clustering

In [16]:
agg_jc = cluster_joincount(agg_stats,agg_validation.joincount_ratio,
                   agg_validation.all_joincount_ratio)

Join Count Ratio
Label   N  Neighbors  Join Count  Ratio
    0  16         96          20  0.208
    1  26        134          26  0.194
    2  29        158          64  0.405
    3   9         47           6  0.128
    4  14         80           8  0.100
    5   3         15           0  0.000
    6  26        158          30  0.190
    7  11         43           4  0.093
    8  20        116          18  0.155
    9   4         21          10  0.476
   10  15         83          18  0.217
   11  10         46           8  0.174
   12   1          3           0  0.000
  All 184       1000         212  0.212


#### AZP-SCHC

In [17]:
azp_schc_jc = cluster_joincount(azp_schc_stats,azp_schc_validation.joincount_ratio,
                   azp_schc_validation.all_joincount_ratio)

Join Count Ratio
Label   N  Neighbors  Join Count  Ratio
    0  89        474         342  0.722
    1  43        225         128  0.569
    2  15         96          42  0.438
    3  14         78          44  0.564
    4   6         37          10  0.270
    5   4         17           6  0.353
    6   4         21          10  0.476
    7   3         22           4  0.182
    8   2         12           2  0.167
    9   1          3           0  0.000
   10   1          3           0  0.000
   11   1          7           0  0.000
   12   1          5           0  0.000
  All 184       1000         588  0.588


### Compactness

Compactness is a criterion that is only applicable to spatially constrained cluster solutions. It measures the ratio of the perimeter of the cluster to that of a circle with the same area. The `compactness` attribute of the `spatial_validation` object is a list with k items, each an object with attributes `area`, `perimeter` and `isoperimeter_quotient`. The closer the IPQ is to one, the more compact is the cluster shape.

The helper function `cluster_compactness` extracts this information. Its attributes are the data frame with cluster cardinalities (from `cluster_stats`), the `compactness` attribute, and the `spatially_constrained` attribute. Compactness is not relevant for clusters that are not spatially constrained and therefore the helper function will yield an error message when this happens.

We continue with the same two examples. Note that for Ward's agglomerative cluster, an error message is generated.

#### Agglomerative clustering

In [18]:
agg_compactness = cluster_compactness(agg_stats,agg_validation.compactness,
                                      agg_validation.spatially_constrained)

Error: Compactness is only applicable to spatially constrained clusters


#### AZP-SCHC

In [19]:
azp_schc_compactness = cluster_compactness(azp_schc_stats,azp_schc_validation.compactness,
                                      azp_schc_validation.spatially_constrained)

Compactness
 Label  N         Area    Perimeter      IPQ
     0 89 8.672968e+10 1.536222e+07 0.004618
     1 43 2.797625e+10 6.198588e+06 0.009150
     2 15 1.053291e+10 2.220946e+06 0.026834
     3 14 1.138704e+10 2.122643e+06 0.031759
     4  6 4.038275e+09 9.257796e+05 0.059209
     5  4 3.085985e+09 5.499120e+05 0.128238
     6  4 1.779537e+09 4.240068e+05 0.124386
     7  3 9.969357e+08 3.168568e+05 0.124782
     8  2 1.474872e+09 3.229358e+05 0.177718
     9  1 6.171934e+08 1.199243e+05 0.539283
    10  1 7.908164e+07 5.056877e+04 0.388616
    11  1 8.449201e+08 1.739887e+05 0.350738
    12  1 4.124364e+08 1.255397e+05 0.328855


### Diameter

The diameter of a spatially constrained cluster is an alternative measure of compactness, based on the network structure reflected in the spatial weights. The diameter of a cluster is the number of steps in the spatial weights graph that corresponds with the longest shortest path between any pair of observations (Newman 2018). Since this number will increase with cluster size, it is also standardized by dividing by the number of cluster members. Note that when a cluster is a singleton, the diameter will be zero.

The `diameter` attribute of the `pygeoda.spatial_validation` object is a list with k items, one for each cluster, as an object with attributes `steps` and `ratio`. The helper function `cluster_diameter` extracts this information as a data frame. It takes as arguments the cluster cardinalities (from `cluster_stats`), and the `diameter` and `spatially_constrained` attributes from the `pygeoda.spatial_validation` object.

As in the case of compactness, an error message is generated for clusters that are not spatially constrained.

Again, we use the same two examples.

#### Agglomerative clustering

In [20]:
agg_diam = cluster_diameter(agg_stats,agg_validation.diameter,
                            agg_validation.spatially_constrained)

Error: Diameter is only applicable to spatially constrained clusters


#### AZP-SCHC

In [21]:
azp_schc_diam = cluster_diameter(azp_schc_stats,azp_schc_validation.diameter,
                            azp_schc_validation.spatially_constrained)

Diameter
 Label  N  Steps    Ratio
     0 89     22 0.247191
     1 43     17 0.395349
     2 15      6 0.400000
     3 14      7 0.500000
     4  6      3 0.500000
     5  4      3 0.750000
     6  4      2 0.500000
     7  3      2 0.666667
     8  2      1 0.500000
     9  1      0 0.000000
    10  1      0 0.000000
    11  1      0 0.000000
    12  1      0 0.000000


### Overall Comparison of Internal Validation Measures

We conclude our discussion of internal validation measures with an overview of the main non-spatial measures for all the cluster solutions considered above.

In [22]:
clusters = [
    agg_clusters_fit, kmeans_clusters_fit, schc_clusters, skater_clusters,
    redcap_clusters, azp_sa_clusters, azp_schc_clusters, maxp_sa_clusters
]
labels = [
    agg_labels, kmeans_labels, schc_labels, skater_labels,
    redcap_labels, azp_sa_labels, azp_schc_labels, maxp_sa_labels
]
label_names = [
    'Hierarchical', 'K-Means', 'SCHC', 'SKATER',
    'REDCAP', 'AZP', 'AZP_Initial', 'Max-p'
]

results = []

# Run pygeoda.spatial_validation for each label set
for cluster, label, name in zip(clusters, labels, label_names):
    result = pygeoda.spatial_validation(ceara_g, label, queen_w)
    
    try:
        wss = np.round(cluster['Total within-cluster sum of squares'],4)
        bss_tss = np.round(cluster['The ratio of between to total sum of squares'],4)
    except:
        try:
            wss = np.round(cluster["WSS"],2)
            bss_tss = np.round(cluster["Ratio"],2)
        except:
            wss = None
            bss_tss = None

    spatially_constrained = result.spatially_constrained
    all_join_count_ratio = np.round(result.all_joincount_ratio.ratio,4)
    entropy = np.round(result.fragmentation.entropy,4)
    simpson = np.round(result.fragmentation.simpson,4)
    
    results.append({
        'Method': name,
        'Spat. Const.': spatially_constrained,
        'WSS': wss,
        'BSS/TSS': bss_tss,
        'Join Count': all_join_count_ratio,
        'Entropy': entropy,
        'Simpson': simpson
    })

validation = pd.DataFrame(results)
print(validation.to_string(index=False))


      Method  Spat. Const.      WSS  BSS/TSS  Join Count  Entropy  Simpson
Hierarchical         False 349.0000   0.6800       0.212   2.3512   0.1063
     K-Means         False 335.7500   0.7000       0.210   2.3712   0.1015
        SCHC          True 568.0173   0.4827       0.668   1.6394   0.2730
      SKATER          True 604.2209   0.4497       0.784   1.3661   0.3901
      REDCAP          True 562.7003   0.4875       0.660   1.6109   0.2769
         AZP          True 617.0010   0.4381       0.526   1.9521   0.2016
 AZP_Initial          True 538.3477   0.5097       0.588   1.5991   0.3035
       Max-p          True 745.2474   0.3213       0.496   2.4175   0.0959


## External Validation Measures

External validation measures are designed to compare a cluster solution to a *truth*, but they can also be employed to compare several cluster solutions to each other. The validation indices reveal how close the cluster solutions are. We consider two measures, the *adjusted rand index* and the *normalized information distance*.

### Adjusted Rand Index (ARI)

The adjusted rand index is based on counting how many pairs of observations are in the same grouping for two cluster solutions. It can be computed by `sklearn.metrics.adjusted_rand_score`. The two arguments are numpy arrays of the labels of the *reference* (first argument) and the labels of the cluster to be compared.

Note that we need to convert our `labels` solution to a numpy array to make this work.

For example, the ARI between Ward's agglomerative solution and K-Means is found by passing numpy arrays for `agg_labels` and `kmeans_labels`. The result of 0.414 suggests only low correspondence.

In [23]:
ari = adjusted_rand_score(np.array(agg_labels),np.array(kmeans_labels))
print(np.round(ari,3))

0.414


We can now compute all pairwise indices with a simple loop. We first recreate the `labels` and `label_names` lists from above (so this can be run without the internal validation measures).

In [24]:
labels = [
    agg_labels, kmeans_labels, schc_labels, skater_labels,
    redcap_labels, azp_sa_labels, azp_schc_labels, maxp_sa_labels
]
label_names = [
    'Hierarchical', 'K-Means', 'SCHC', 'SKATER',
    'REDCAP', 'AZP', 'AZP_Initial', 'Max-p'
]

In a simple loop, we compute all pairwise indices and populate a matrix. This is then turned into a data frame and printed. Note that the matrix is symmetric and the diagonal values of 1.0 can be ignored.

In [25]:
h = len(labels)
allari = np.zeros((h,h))
for i in range(h):
    labi = np.array(labels[i])
    for j in range(h):
        labj = np.array(labels[j])
        allari[i,j] = adjusted_rand_score(labi,labj)
dfari = pd.DataFrame(allari, columns = label_names, index = label_names)
print(np.round(dfari,3))

              Hierarchical  K-Means   SCHC  SKATER  REDCAP    AZP  \
Hierarchical         1.000    0.414  0.131   0.079   0.145  0.160   
K-Means              0.414    1.000  0.151   0.073   0.155  0.166   
SCHC                 0.131    0.151  1.000   0.424   0.918  0.504   
SKATER               0.079    0.073  0.424   1.000   0.384  0.258   
REDCAP               0.145    0.155  0.918   0.384   1.000  0.516   
AZP                  0.160    0.166  0.504   0.258   0.516  1.000   
AZP_Initial          0.155    0.147  0.735   0.452   0.727  0.577   
Max-p                0.089    0.100  0.194   0.173   0.185  0.218   

              AZP_Initial  Max-p  
Hierarchical        0.155  0.089  
K-Means             0.147  0.100  
SCHC                0.735  0.194  
SKATER              0.452  0.173  
REDCAP              0.727  0.185  
AZP                 0.577  0.218  
AZP_Initial         1.000  0.163  
Max-p               0.163  1.000  


As in Chapter 12, the matrix reveals a much closer correspondence among the non-spatial solutions and the spatial solutions respectively. The greatest correspondence is between SCHC and Redcap, with an ARI of 0.918.

### Normalized Information Distance (NID)

The second external validation measure is based on information-theoretic considerations, such as entropy. In Chapter 12, the normalized information distance is introduced (NID). A close counterpart can be computed by means of `sklearn.metrics.adjusted_mutual_info_score`. The arguments are the same as for ARI. However, in contrast to NID as presented in Chapter 12, a higher value for the adjusted mutual information score indicates closer similarity.

We first illustrate this for Ward's agglomerative clustering and K-Means, passing numpy arrays of `agg_labels` and `kmeans_labels`.

In [26]:
nid = adjusted_mutual_info_score(np.array(agg_labels),np.array(kmeans_labels))
print(np.round(nid,3))

0.601


Finally, we run the same loop as for ARI to compute all pairwise NID scores.

In [27]:
allnid = np.zeros((h,h))
for i in range(h):
    labi = np.array(labels[i])
    for j in range(h):
        labj = np.array(labels[j])
        allnid[i,j] = adjusted_mutual_info_score(labi,labj)
dfnid = pd.DataFrame(allnid, columns = label_names, index = label_names)
print(np.round(dfnid,3))

              Hierarchical  K-Means   SCHC  SKATER  REDCAP    AZP  \
Hierarchical         1.000    0.601  0.257   0.219   0.267  0.306   
K-Means              0.601    1.000  0.299   0.220   0.297  0.299   
SCHC                 0.257    0.299  1.000   0.548   0.897  0.581   
SKATER               0.219    0.220  0.548   1.000   0.496  0.425   
REDCAP               0.267    0.297  0.897   0.496   1.000  0.595   
AZP                  0.306    0.299  0.581   0.425   0.595  1.000   
AZP_Initial          0.317    0.320  0.727   0.545   0.727  0.646   
Max-p                0.177    0.207  0.388   0.423   0.372  0.445   

              AZP_Initial  Max-p  
Hierarchical        0.317  0.177  
K-Means             0.320  0.207  
SCHC                0.727  0.388  
SKATER              0.545  0.423  
REDCAP              0.727  0.372  
AZP                 0.646  0.445  
AZP_Initial         1.000  0.367  
Max-p               0.367  1.000  


As for ARI, we find the closest correspondence between SCHC and Redcap.