# Data preparation

## Methodology

For each postal code:

1. Classify buildings into types (manual)
1. Calculate total area for all objects
1. Group buildings by type to get for each type (residential, industrial,...etc.)
    1. Rectangularity (area of polygon / area of minimum bounding box of polygon)
    1. Total area

# Initialization

In [1]:
import pandas as pd
import numpy as np
import sys

from datetime import datetime
timestamp = datetime.now().strftime("%d%m%y")

## Load custom modules

In [2]:
pkg_path = '../src/'

sys.path.append(pkg_path)
import data_preparation as dp
import graphics as gp

In [3]:
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)
# pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

## Read 

In [4]:
plz_csv = '../data/01_raw/plz_einwohner.csv'

In [5]:
# Postal code list with population data (2011)
plz_de = pd.read_csv(plz_csv,
                    dtype={'plz': str, 'einwhoner': int}) # specify column types

In [6]:
# Get all existing PLZ done
import os
name_list = os.listdir('../data/01_raw/buildings_plz/')

done_plz = [x.split('.')[0].split('_')[1] for x in name_list if 'buildings' in x]
print(f'Completed crawling for {len(done_plz)} PLZs')

Completed crawling for 8165 PLZs


In [7]:
plz_de = plz_de[plz_de.plz.isin(done_plz) == False].reset_index(drop = True)

1. Manual classify buildings into 7 categories: residential, public, commercial, accessory:storage, accessory:supply, industrial and other.
1. Calculate surface area of all the buildings (floor area * building levels)
1. Extract "residential" & "other" type to a separate dataset for ML modeling (guessing building types?)

In [8]:
plz = done_plz[10]

In [9]:
plz_path = f'../data/01_raw/buildings_plz/buildings_{plz}.csv'
plz_path

'../data/01_raw/buildings_plz/buildings_01157.csv'

In [10]:
df = pd.read_csv(plz_path,
                 dtype={'tags.addr:suburb': 'object',
                        'tags.building:levels': 'object',
                        'tags.source' :str,
                        'tags.addr:postcode':str},
                converters={"nodes": lambda x: x.strip("[]").split(", ")}) # read column as list

# remove empty elements (no lat/lon)
df = df[df['center.lat'].isna()==False].reset_index(drop = True)
# replace NaN in building_levels
df['tags.building:levels'] = df['tags.building:levels'].fillna(1)

# Classify to building types
df['building_types'] = df['tags.building'].apply(lambda x: dp.manual_classify_building(x))

In [20]:
idx = 0
# df['total_area'] = None

In [22]:
# Calculate total area of the building

while idx < len(df):
    df['total_area'].iloc[idx] = dp.get_total_area(df['nodes'].iloc[idx],
                                                   df['tags.building:levels'].iloc[idx])
    if idx % 100 == 0: print(f'{idx}/{len(df)}')
    idx = idx + 1

AttributeError: 'float' object has no attribute 'isna'

In [23]:
df[df.total_area ==0 ].shape

(276, 15)

In [24]:
gp.plot_buildings_plz(df, plz)