# 3_Building footprints classification

Notebook for the 3rd pipeline - building footprints classification. In this notebook, we would like to test the 4-step process:
1. Generate additional features for clustering (rectangularity, polygon turning functions, proximity matrix)
1. Apply Tobler's geography law into clustering footprints (objects that are closed together --> same functions | near things are more relevant than far things)
    1. Proximity-based grouping footprints into building blocks (DBSCAN)
    1. Geography law apply: find footprints with similar shape (turning func, rectangularity) + size (total_area) ==> get majority of types
        1. Same building block
        1. Same area
    1. Apply statistical analysis results to categorize left-over footprints

## Initialization

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

  and should_run_async(code)


In [2]:
import pandas as pd
import numpy as np
import sys
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

import igraph

from geopandas import GeoDataFrame
from pyrosm import OSM

In [3]:
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [4]:
# Self-made modules
import helpers as hp
import gemeindeverz

In [5]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

## Inputs
Define inputs path

In [12]:
buildings_int_path = '../data/02_intermediate/buildings_data/'
plz_ags_csv = '../data/01_raw/zuordnung_plz_ort_landkreis.csv'

# Demographics
ags_living_csv = '../data/01_raw/de_living_2019.csv'
ags_population_csv = '../data/01_raw/de_population_2019.csv'

# Separation by population density (rural / suburban / urban)
# https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html

ags_urban_rural_csv = '../data/01_raw/de_rural_urban_2019.csv'

Demographics data

In [15]:
ags_living = pd.read_csv(ags_living_csv, 
                         sep = ';', 
                         encoding = 'cp1250', 
                         dtype= {'1_Auspraegung_Code':str},
                         low_memory = False)
ags_population = pd.read_csv(ags_population_csv, 
                             sep = ';', 
                             dtype= {'1_Auspraegung_Code':str},
                             encoding = 'cp1250', 
                             low_memory = False)


Geographic data

In [8]:
# Contain local AGS codes (no regional)
plz_ags = pd.read_csv(plz_ags_csv, dtype= {'plz': str, 'ags': str})

In [16]:
ags_rural_urban = pd.read_csv(ags_urban_rural_csv,
                             sep = ';',
                             dtype = {'AGS':str})
ags_rural_urban.head()

Unnamed: 0,AGS,Area type code,Description
0,1001000,1,dicht besiedelt
1,1002000,1,dicht besiedelt
2,1003000,1,dicht besiedelt
3,1004000,1,dicht besiedelt
4,1051001,3,gering besiedelt


In [10]:
# Community directory dataframe
GV_path = '../data/01_raw/GV/GV100AD_301120.asc'

# Use this file to manually get ags code for region available on Geofabrik (inside state)
com_dir_df = gemeindeverz.einlesen(GV_path)

In [11]:
com_dir_df[com_dir_df.plz == '85540']

Unnamed: 0,satzart,stand,ags,gemeinde_verb,gemeinde_bez,schluesselfelder,flaeche_ha,bevoelkerung_ges,bevoelkerung_maennl,plz,plz_eindeutig,finanzamts_bezirk,gerichtsbarkeit,arbeitsagentur_bezirk,bundestagswahlkreise_von,bundestagswahlkreise_bis,bemerkungen,ars
9232,60,2020-11-30T00:00:00.000000000,9184123,123,Haar,64.0,1290.0,21476.0,10570.0,85540,False,9147,2601,84301,,,,91840123123


## Building blocks segmentation (DBSCAN)

In the paper ["Proximity-based grouping of buildings in urban blocks"](https://www.researchgate.net/publication/271901065_Proximity-based_grouping_of_buildings_in_urban_blocks_A_comparison_of_four_algorithms), the authors used 2 different approaches to evaluate 4 algorithms in clustering buildings into urban blocks. It concludes that DBSCAN (Density-based spatial clustering of applications with noise) together with ASCDT (An adaptive spatial clustering algorithm based on delaunay triangulation) performed best and their degree of complexity is not hard to implement. Thus, in this project, I implemented DBSCAN to cluster our OSM footprints into segments.

### DBSCAN

> It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

More details can be found here in this [NICE article](https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html) from KDnuggets

### Algorithm inputs
For our project, we need to generate a **proximity matrix** for all building footprints in the area. Since we have already gathered and cleaned building objects data from ~10k municipalities in Germany (refer to the previous article), it is better to keep the building blocks grouping it the same granularity level (a.k.a municipality-level - AGS key)

There are 2 parameters for DBSCAN we need to set *Epsilon* and *MinPts*. We will start with the baseline from the paper with *Epsilon* = 3 and *MinPts* = 2, then try to optimize them a bit by spliting municipalities into **URBANS & RURALS**

### References
Ester, M. (2019). A density-based algorithm for discovering clusters in large spatial databases with noise. [online] Psu.edu. Available at: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.9220 [Accessed 2 Jan. 2021].