# Data Exploration

In this notebook describe your data exploration steps.

## Install dependencies

In [16]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Download the tree data

### WFS Typenames
A WFS endpoint can provide multiple datasets, each accessibly by a typename. 
For this endpoint each WFS enpoint provides only a single dataset.

To access it we first need to query 'GetCapabilities' of the WFS endpoint to receive the typename for our wanted data.

In [1]:
import requests
import geopandas as gpd
from xml.etree import ElementTree as ET

# Get the typename for the trees dataset

trees_url = "https://fbinter.stadt-berlin.de/fb/wfs/data/senstadt/s_wfs_baumbestand_an"
# Define the parameters for the GetCapabilities request
params = {"service": "WFS", "version": "2.0.0", "request": "GetCapabilities"}
# Send request to get the Capabilities document
response = requests.get(trees_url, params=params)

# Check if the request was successful
if response.status_code != 200:
    raise Exception("Request failed with status code: {response.status_code}")

# Parse the XML response
root = ET.fromstring(response.content)
# Find all FeatureType elements
namespaces = {"wfs": "http://www.opengis.net/wfs/2.0"}

for feature_type in root.findall(".//wfs:FeatureType", namespaces=namespaces):
    # Find the Name element within each FeatureType
    name = feature_type.find("wfs:Name", namespaces=namespaces)
    if name is not None:
        print(name.text)

fis:s_wfs_baumbestand_an


### WFS Data
With this typename we can now query the WFS endpoint to receive the data.

In [2]:
# Get the data for the trees dataset by its typename
typename = "fis:s_wfs_baumbestand_an"
params = {
    "service": "WFS",
    "version": "2.0.0",
    "request": "GetFeature",
    "typenames": typename,
    "outputFormat": "application/geo+json",
}

response = requests.get(trees_url, params=params)

# Check if the request was successful
if response.status_code != 200:
    raise Exception(
        f"Request failed with status code: {response.status_code}: {response.text}"
    )

### Accessing the WFS Api response
In order to work with the returned data we need to parse it into a usefull format. For this we use GeoPandas which provides a convenient way to work with geodata. 
We then store the data as GeoJson for caching and further processing.

In [3]:
import io

# Create a file-like object from the response content
data = io.BytesIO(response.content)
gpd_dataframe = gpd.read_file(data)
# save to geojson
gpd_dataframe.to_file("trees.geojson", driver="GeoJSON")

### Look at the first rows

In [4]:
gpd_dataframe.head()

Unnamed: 0,id,baumid,standortnr,kennzeich,namenr,art_dtsch,art_bot,gattung_deutsch,gattung,pflanzjahr,standalter,kronedurch,stammumfg,baumhoehe,bezirk,eigentuemer,geometry
0,00008100:000be4d0,00008100:000be4d0,66,411.232,Weigandufer,Eingriffliger Weissdorn,Crataegus monogyna,WEIßDORN,CRATAEGUS,1989,34.0,,60,,Neukölln,Land Berlin,POINT (394621.783 5815731.410)
1,00008100:000be4d2,00008100:000be4d2,59,411.232,Weigandufer,Hahnensporn-Weissdorn,Crataegus crus-galli,WEIßDORN,CRATAEGUS,1994,29.0,,37,,Neukölln,Land Berlin,POINT (394660.283 5815777.205)
2,00008100:000be4f2,00008100:000be4f2,62,411.232,Weigandufer,Pflaumenblättriger Weiss-Dorn,Crataegus prunifolia,WEIßDORN,CRATAEGUS,1987,36.0,,65,,Neukölln,Land Berlin,POINT (394643.319 5815756.320)
3,00008100:000bf296,00008100:000bf296,14,221.068,Roetepfuhl-Grünanlage,Gemeine Rosskastanie,Aesculus hippocastanum,ROSSKASTANIE,AESCULUS,1985,38.0,,128,,Neukölln,Land Berlin,POINT (393199.923 5811316.447)
4,00008100:000bf297,00008100:000bf297,13,221.068,Roetepfuhl-Grünanlage,Gemeine Rosskastanie,Aesculus hippocastanum,ROSSKASTANIE,AESCULUS,1985,38.0,,112,,Neukölln,Land Berlin,POINT (393205.164 5811322.008)


### Refactoring
In order to keep this notebook short and explorative we refactor the code into pipeline.py.
From here on out we assume data/streets.geojson and data/trees.geojson to exist.

In [6]:
import os
# check if data/trees.geojson exists

if not os.path.exists("data/trees.geojson"):
    raise Exception("data/trees.geojson not found")
if not os.path.exists("data/streets.geojson"):
    raise Exception("data/streets.geojson not found")
# load data
streets_gdf = gpd.read_file("data/streets.geojson")
trees_gdf = gpd.read_file("data/trees.geojson")

### Data exploration
To make use of all the data we need to map each tree to a street.

First trials with naive approaches to map trees to street had exponantial runtime and failed.
To speed things up we use a spatial indexing to speed up the mapping process.

In [12]:
from rtree import index

# Create an R-tree index of the streets
idx = index.Index()
for street_index, street in streets_gdf.iterrows():
    idx.insert(street_index, street["geometry"].bounds)

# Temporary map of tree index to street index
tree_index_street_index_map = {}

length = len(trees_gdf)
percentile = 0
# For each tree, find the closest street
for tree_index, tree in trees_gdf.iterrows():
    # Keep track of progress because this takes a while
    if tree_index % (length // 100) == 0:
            print(f"Done with {percentile}%")
            percentile += 1
    
    closest_street_index = next(idx.nearest(tree.geometry.bounds, 1, objects=True))

Done with 0%
Done with 1%
Done with 2%
Done with 3%
Done with 4%
Done with 5%
Done with 6%
Done with 7%
Done with 8%
Done with 9%
Done with 10%
Done with 11%
Done with 12%
Done with 13%
Done with 14%
Done with 15%
Done with 16%
Done with 17%
Done with 18%
Done with 19%
Done with 20%
Done with 21%
Done with 22%
Done with 23%
Done with 24%
Done with 25%
Done with 26%
Done with 27%
Done with 28%
Done with 29%
Done with 30%
Done with 31%
Done with 32%
Done with 33%
Done with 34%
Done with 35%
Done with 36%
Done with 37%
Done with 38%
Done with 39%
Done with 40%
Done with 41%
Done with 42%
Done with 43%
Done with 44%
Done with 45%
Done with 46%
Done with 47%
Done with 48%
Done with 49%
Done with 50%
Done with 51%
Done with 52%
Done with 53%
Done with 54%
Done with 55%
Done with 56%
Done with 57%
Done with 58%
Done with 59%
Done with 60%
Done with 61%
Done with 62%
Done with 63%
Done with 64%
Done with 65%
Done with 66%
Done with 67%
Done with 68%
Done with 69%
Done with 70%
Done with 71%
Do

In [13]:
# Now we have a map of tree INDEX to street INDEX, but for our analysis we need the tree ID and street ID

# Resulting Map of tree id to street id
tree_id_street_id_map = {}
# loop through all pairs and store a map with their IDs
for tree_index, street_index in tree_index_street_index_map.items():
    tree = trees_gdf.loc[tree_index]
    street = streets_gdf.loc[street_index]
    tree_id_street_id_map[tree["id"]] = street["id"]

In [14]:
import json
# finally store the map in a json file
with open("tree_id_street_id_map.json", "w") as f:
    json.dump(tree_id_street_id_map, f)
    
# and add the street id to the trees dataframe
trees_gdf["street_id"] = trees_gdf["id"].map(tree_id_street_id_map)

### Storing the data in a database
To make the data more accessible we store it in a database. GeoPandas provides a convenient way to do this, but storing geo data in a database is not trivial.
After a lot of trial and error I decided not to store the geometry data in the database.

In [15]:
import sqlite3

trees_gdf_noGeo = trees_gdf.drop(columns=["geometry"])
streets_gdf_noGeo = streets_gdf.drop(columns=["geometry"])

db_path = "data.sqlite"

with sqlite3.connect(db_path) as conn:
    trees_gdf_noGeo.to_sql("trees", conn, if_exists="replace", index=False)
    streets_gdf_noGeo.to_sql("streets", conn, if_exists="replace", index=False)