# Methodological Foundation of a Numerical Taxonomy of Urban Form

## Reproducible Python code for validation

Code used to perform validation using additional data sources.

Validation files are expected to be polygon geometries with an attribute column representing target variable.

The reproducible computational environment can be created using Docker container `darribas/gds_py:5.0`.

The same code has been used to analyse all cases.

In [None]:
import pandas as pd
import geopandas as gpd
import scipy.stats as ss
import numpy as np

We load all data and perform spatial join based on building centroids.

In [None]:
clusters = pd.read_csv('files/200218_clusters_complete_n20.csv', index_col=0)  # cluster labels

In [None]:
validation = gpd.read_file("validation_file_path")  # validation data

In [None]:
buildings = gpd.read_file('files/geometry.gpkg', layer='buildings')  # building geometry

In [None]:
buildings['cent'] = buildings.centroid
buildings = buildings.set_geometry('cent')

In [None]:
buildings = buildings.to_crs(validation.crs)

In [None]:
joined = gpd.sjoin(buildings, validation, how='left')

In [None]:
joined = joined.merge(clusters, how='left', on='uID')

In [None]:
joined = joined.set_geometry('geometry')

Resulting DataFrame contains an attribute column with cluster labels and with target variable. Now we can measure Cramer's V and Chi-squared statistics.

In [None]:
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

In [None]:
cramers_v(joined.cluster, joined["validation_data"])

In [None]:
confusion_matrix = pd.crosstab(joined.cluster, joined["validation_data"])
chi, p, dof, exp = ss.chi2_contingency(confusion_matrix)