# Analysis of distribution of shape statistics

This notebook uses indices capturing relationship between shape metrics and area of polygons across all FUAs, identifies peaks and valleys in the distribution using KDE and assesses performance of each shape metric in distinguishing between face polygons and face artifacts.

## Whole dataset

In [1]:
import geopandas
import pandas
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
from palettable.cartocolors.qualitative import Bold_6

from scipy.signal import find_peaks
from scipy.stats import gaussian_kde

import os

import pickle
from collections import Counter


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas


Set default plotting theme.

In [2]:
sns.set_theme(
    context="paper",
    style="ticks",
    rc={
        "patch.force_edgecolor": False,
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.grid": True,
    },
    palette=Bold_6.hex_colors,
)

Load the data and combine them to a single GeoDataFrame.

In [4]:
sample = geopandas.read_parquet("../data/sample.parquet")

all_poly = []
for i, row in sample.iterrows():
    fua = geopandas.read_parquet(f"../data/{int(row.eFUA_ID)}/polygons/")
    fua["continent"] = row.continent
    fua["country"] = row.Cntry_name
    fua["name"] = row.eFUA_name
    fua.crs = None
    all_poly.append(fua)
all_poly_data = pandas.concat(all_poly).reset_index(drop=True)


Set colors for continents.

In [5]:
### @MARTIN IS THERE A MORE STRAIGHTFORWARD WAY TO DO THIS?
# (to make sure that we always assign the same color to the same continent)
# make color dictionary for continents to access seaborn palette by index
continents = numpy.unique(all_poly_data["continent"])
coldict = {}
for i, cont in enumerate(continents):
    coldict[cont] = i
# to access like :
# sns.color_palette(n_colors = 6)[coldict["Africa"]]

In [6]:
# Get a list of cities and a list of options.

cities = numpy.unique(all_poly_data.name)

options = [
    "circular_compactness_index",
    "isoperimetric_quotient_index",
    "isoareal_quotient_index",
    "radii_ratio_index",
    "diameter_ratio_index"
    ]

Now,
- Find peaks in frequency distribution of each `option` for each `city`
- Plot all options for each city for comparison

In [160]:
# LINSPACE ADJUSTED

# Fits a Gaussian kernel density estimation with Silverman bandwith method
# Finds peaks in the estimated probability density function; 
# Parameters as little restrictive as possible;
# Linspace adjusted to number of data points in input data

# initiate dict to store results
results_linspace_adjusted = {}

for city in cities:

    print("finding peaks for", city)

    # to store results
    results_linspace_adjusted[city] = {}

    # initiate plot
    fig, ax = plt.subplots(1,5,figsize = (20,5))
    
    # find kde pdf and peaks for each of the options
    for i, option in enumerate(options):
        
        results_linspace_adjusted[city][option] = {}
        
        # get data
        fua = all_poly_data[all_poly_data.name == city]
        data = numpy.log(fua[option])

        # adjust linspace to be sparser than observations (linspace contains 10x less datapoints than original data)
        n = int(len(fua)/10)
        mylinspace = numpy.linspace(data.min(), data.max(), n)

        # fit Gaussian KDE
        kde = gaussian_kde(data, bw_method="silverman")
        pdf = kde.pdf(mylinspace)

        # find peaks
        peaks, d = find_peaks(
            x = -pdf +1,
            height = (0,.995),
            threshold = None,
            distance = None,
            prominence = 0.0005,
            width = 1,
            plateau_size = None)
        
        # store results
        results_linspace_adjusted[city][option]["mylinspace"] = mylinspace
        results_linspace_adjusted[city][option]["pdf"] = pdf
        results_linspace_adjusted[city][option]["peaks"] = peaks
        results_linspace_adjusted[city][option]["d"] = d

        # add subplot

        ax[i].plot(pdf, color = "grey")
        ax[i].scatter(
            x = peaks, 
            y = pdf[peaks], 
            color = sns.color_palette(n_colors = 6)[coldict[numpy.max(fua["continent"])]], 
            s = 8, 
            alpha = 0.7);
        ax[i].set_xlabel(option)
        ax[i].set_ylim([0,1])
    
    plt.suptitle(city)

    # store plot for this city
    fig.savefig(f"../results/linspace_adjusted/allmins_{city}.png", dpi = 400)
    plt.close()

finding peaks for Abbottabad
finding peaks for Abidjan
finding peaks for Abuja
finding peaks for Accra
finding peaks for Addis Ababa
finding peaks for Adelaide
finding peaks for Agadir
finding peaks for Agra
finding peaks for Al-Zaqaziq‎
finding peaks for Aleppo
finding peaks for Amaigbo
finding peaks for Amsterdam
finding peaks for Athens
finding peaks for Auckland
finding peaks for Barcelona
finding peaks for Basra
finding peaks for Belgrade
finding peaks for Belo Horizonte
finding peaks for Belém
finding peaks for Brisbane
finding peaks for Bucaramanga
finding peaks for Buenos Aires
finding peaks for Cali
finding peaks for Cape Town
finding peaks for Cardiff
finding peaks for Chelyabinsk
finding peaks for Chongqing
finding peaks for Cincinnati
finding peaks for Cochabamba
finding peaks for Cologne
finding peaks for Comilla
finding peaks for Conakry
finding peaks for Curitiba
finding peaks for Dallas
finding peaks for Dhaka
finding peaks for Dortmund
finding peaks for Douala
finding 

In [162]:
# save results as pickle
with open('../results/results_linspace_adjusted.pickle', 'wb') as handle:
    pickle.dump(results_linspace_adjusted, handle, protocol=pickle.HIGHEST_PROTOCOL)

Evaluating which option finds the most banana minima:

In [161]:
for option in options:
    print(option)
    print(Counter([len(results_linspace_adjusted[city][option]["peaks"]) for city in cities]))

circular_compactness_index
Counter({1: 113, 0: 12, 2: 6})
isoperimetric_quotient_index
Counter({1: 102, 0: 21, 2: 7, 3: 1})
isoareal_quotient_index
Counter({1: 75, 0: 45, 2: 11})
radii_ratio_index
Counter({1: 81, 0: 41, 2: 9})
diameter_ratio_index
Counter({1: 100, 0: 19, 2: 12})


**conclusion: the circular compactness index seems to be a clear winner**