# Sign processing

This notebook starts by fetching observations from the WFS defined layer, the main idea is to convert the sps scripts to python code, which we could then execute directly via github and/or process without having to have SPSS installed.
The notebook requires `pandas`, as per the instructions in the requirements file.

In [None]:
from owslib.wfs import WebFeatureService
from datetime import datetime
import time
import json
import requests

import numpy as np
import pandas as pd
import pyproj
import folium
from pandas_geojson import to_geojson, write_geojson
from datetime import date

## Configuration
Configuration variables are defined here, this is only temporary since this code will all be converted to scripts.

In [None]:
wfs_url = 'https://opendata.apps.mow.vlaanderen.be/opendata-geoserver/awv/wfs'
vb_type_name = "awv:Verkeersborden.Vlaanderen_Borden"

# Configuration
# Output file where we will store the WFS results
feature_output_file = "../python_output/feature_output.csv"
# Output where the csv file will be stored
signs_csv_output_file = "../python_output/signs_output.csv"
# Output where the geojson file will be stored
geojson_output_file = "../python_output/geojson_output.json"
# Previous processed data, used to filter out previous data
previous_processed_date = "2022-07-31"
# Previous traffic signs
traffic_signs_info = "../find-interesting-signs/road_signs_cleaned.csv"

## Fetch number of features
Fetch all the features for the required layer from the WFS service, we use this later on to query for them all.

In [None]:
def get_total_features_by_type(feature_type):
    response = requests.get(wfs_url, params={
     'service': 'WFS',
     'version': '2.0.0',
     'request': 'GetFeature',
     'typename': feature_type,
     'outputFormat': 'json',
     'count': 1
    })
    j = json.loads(response.content)
    return j['totalFeatures'] 

## Obtain and store the signs
Fetch data from WFS, remove line breaks and store into the defined csv file.
https://opendata.apps.mow.vlaanderen.be/opendata-geoserver/awv/wfs?version=2.0.0&request=GetCapabilities defines the capabilities, `GetFeature` by default has a count of 10M features. There are actual problems fetching the whole dataset from the WFS service as it fails sometimes, the url to fetch the feature is in https://opendata.apps.mow.vlaanderen.be/opendata-geoserver/awv/wfs?version=2.0.0&service=WFS&version=2.0.0&request=GetFeature&typeName=awv%3AVerkeersborden.Vlaanderen_Borden&outputFormat=csv the code below retries until the saved csv has pulled the required features from the WFS service.

The exception below indicates that the underlying datastore has been updated while in the process of pulling data from the WFS server.

```
<ows:Exception exceptionCode="NoApplicableCode">
<ows:ExceptionText>java.lang.RuntimeException: org.postgresql.util.PSQLException: ERROR: canceling statement due to conflict with recovery
  Detail: User query might have needed to see row versions that must be removed.
org.postgresql.util.PSQLException: ERROR: canceling statement due to conflict with recovery
  Detail: User query might have needed to see row versions that must be removed.
ERROR: canceling statement due to conflict with recovery
  Detail: User query might have needed to see row versions that must be removed.</ows:ExceptionText>
</ows:Exception>
</ows:ExceptionReport>
```

In [None]:
def get_and_store_features(file_name, feature_type, max_features):
    response = requests.get(wfs_url, params={
     'service': 'WFS',
     'version': '2.0.0',
     'request': 'GetFeature',
     'typename': feature_type,
     'outputFormat': 'csv',
     'count': max_features
    })
    with open(file=file_name, encoding='UTF-8', mode='w', newline='') as csvfile:
        csvfile.write(response.content.decode('UTF-8'))
        
pending_processing = True
try_count = 0
while pending_processing:
    try_count += 1
    print("{}: {} WFS get feature.".format(datetime.now(), try_count))
    total_features = get_total_features_by_type(vb_type_name)
    print("{}: #features = {}".format(datetime.now(), total_features))
    print("{}: Starting fetching data from WFS service, total features {}".format(datetime.now(), total_features))
    get_and_store_features(feature_output_file, vb_type_name, total_features)
    print("{}: WFS data stored in {}".format(datetime.now(), feature_output_file))
    stored_features_df = pd.read_csv(feature_output_file)
    total_stored_features = len(stored_features_df.index)
    print("{}: Stored {} features in the csv.".format(datetime.now(), total_stored_features))
    pending_processing = total_stored_features < total_features

## Process data

Load the signs data in `panda` dataframes, this data is filtered by the `previous_processed_date` and joined with the signs metadata by `bordcode`.

**Note:** All this code is dataset specific, ideally this should be abstracted away, including column definitions.

In [None]:
feature_df = pd.read_csv(feature_output_file)

### Date filtering

Filter the dataframe for all signs with date greater than the `previous_processed_date` configuration value. This is done by: 1) converting the `datum_plaatsing` to date in the `date` column, and 2) filtering the dataframe.

**note** Filter new dates and previous date.
**note** The filtering is being applied on US format rather than european.

In [None]:
feature_df['date'] = pd.to_datetime(feature_df['datum_plaatsing'], errors = 'coerce', format='%d/%m/%Y')
filter_mask = feature_df['date'].notna() \
    & (feature_df["date"] > previous_processed_date) \
    & (feature_df['date'] < (pd.Timestamp.today() + pd.Timedelta('1D')))
filtered_df = feature_df[filter_mask]
display(filtered_df)

In [None]:
print(f"The file containes {len(feature_df)} features before filtering by date.")
feature_df['date'] = pd.to_datetime(feature_df['datum_plaatsing'], errors = 'coerce', format='%d/%m/%Y')
filter_mask = feature_df['date'].notna() \
    & (feature_df["date"] > previous_processed_date) \
    & (feature_df['date'] < (pd.Timestamp.today() + pd.Timedelta('1D')))
filtered_df = feature_df[filter_mask]
print(f"The file contains {len(filtered_df)} features after filtering by date greater than {previous_processed_date}.")

### Data parsing and conversion

Some small conversion on the `bordcode` field, as per the SPS code. This code also create the identifier removing the string from the `FID` value. Latitude and longitude are converted from [EPSG:31370](https://epsg.io/31370) to [EPSG:4326](https://epsg.io/4326) aka WGS84.

In [None]:
from pyproj import Transformer
transformer = Transformer.from_crs("epsg:31370", "epsg:4326", always_xy=True)

def convertCoords(row):
    # Transform columns based on locatie_x (longitude) and locatie_y (latitude).
    longitude ,latitude = transformer.transform(row['locatie_x'],row['locatie_y'])
    return pd.Series({'longitude': longitude,'latitude': latitude})

# convert coordinates
filtered_df[['longitude','latitude']] = filtered_df.apply(convertCoords,axis=1)
# Bordcode processing, remove Z from it and add (zone) description.
filtered_df['bordcode'] = filtered_df.apply(lambda row: (f"{row['bordcode'][1:]} (zone)" if row['bordcode'].startswith('Z') else row['bordcode']).replace("/", ""), axis=1)
# Replace strings from FID
filtered_df['id'] = filtered_df['FID'].str.replace('Verkeersborden.Vlaanderen_Borden.','')
filtered_df.drop(columns=['FID'])
# This will need require some cleaning on the parameters as well. Probably better to do it before saving.

In [None]:
sign_metadata = pd.read_csv(traffic_signs_info, sep=";", encoding = "ISO-8859-1")
sign_metadata.dtypes

### Join and grouping

Merge the sign metadata with the current dataset based on the `bordcode` field. Then group by `id_aanzicht` to identified clustered signs. After that we get the required values and store them based on `processing_output_file` configuration value.

In [None]:
# Join both datasets by the bordcode
joined_df = filtered_df.join(sign_metadata.set_index("bordcode"), on='bordcode')
# Remove NaN parameters and name
joined_df[['parameters', 'name']] = joined_df[['parameters','name']].fillna('')
joined_df.dtypes
display(joined_df)

In [None]:
grouped_df = joined_df.groupby('id_aanzicht', as_index=False).agg({
     'opinion': 'max', 
     'bordcode': ' | '.join,
     'latitude': 'max',
     'longitude': 'max',
     'parameters': lambda x : '|'.join(y for y in x if y != ''),
     'name': lambda x : '|'.join(y for y in x if y != ''),
     'datum_plaatsing': 'max',
     'id': 'max'})
grouped_df = grouped_df[grouped_df['opinion'] > 0]
print(f"Found {len(grouped_df)} signs after grouping by id_aanzicht")
display(grouped_df)
grouped_df.to_csv(signs_csv_output_file, sep=";")

In [None]:
result = grouped_df.rename(columns = {
    "bordcode": "traffic_sign_code", 
    "parameters": "extra_text",
    "datum_plaatsing": "date_installed",
    "name": "traffic_sign_description"
})[['id', 'traffic_sign_code', 'extra_text', 'traffic_sign_description', 'date_installed', 'longitude', 'latitude']]
display(result)

# Store results
Store the processing results in geojson format using `pandas_geojson`

In [None]:
geo_json = to_geojson(df=result, lat='latitude', lon='longitude',
                 properties=['id','traffic_sign_code','extra_text','traffic_sign_description', 'date_installed' ])
write_geojson(geo_json, filename=geojson_output_file)

## Visualize results
Simple visualization of the geojson results in folium, no custom popup for the time being.

In [None]:
folium_map = folium.Map(
    location=[50.8476, 4.3572],
    zoom_start=8,
)
folium.GeoJson(data=geo_json).add_to(folium_map)
folium_map