# Create all datasets required for creating a Version 3 ML pipeline

## This notebook sets up datasets required for building features and models out of PayByPhone frontend data as well as all data to measure performance in Seattle

## Data preparation for version 3: street-booking assignment 
We perform the following steps in each city in version 3, it's done for Seattle only in this notebook:

1. Extract street network in Seattle using Open Street Map API. Each street gets assigned a unique id
2. Read booking events using PaybyPhone (PBP) frontend data 
3. Assign each booking event to a street (infer where the user is parking)


## Seattle data preparation for evaluation
We perform additional Seattle-specific steps for evaluation purposes:

1. Load geometries from PBP location identifiers
2. Create mapping between PBP locations and streets (usually one street covers 1-4 PBP locations)
3. Evaluate our assignment rule from step 3 od data preparation fpr version 3 (above) using the identifiers
4. Load the Seattle ground truth dataset sampled by the Departement of Transportation
5. Create ground-truth lables for the street network based on Seattle ground-truth data (only if all corresponding PBP locations are occupied the whole street is also occupied, if one location is free, the street is free)


### NOTE

As of August 26th, 2021, this notebook has been run partially to generate data for more radiuses(namely adding radius for 25, 50, and previous added 100, 150, 250) in contrast to 500 radius as before. Since it has been run partially, below points needed to be noted:

   1. It was found in this current run where we query open street again, for the 6 radius, the no. of POI in certain radius has changed greatly(eg. residential of 500 m radius changed from tens to thousands, which does not reflect closely to the real world situation, however, we could not find out why it has changed, therefore we take the current data as it is with the information in mind that open street map might have changed the tags of the POIs) -- One solution could be install fixed version of open street map, and there are two releases in May 2021, since the first time of querying the POI result.
   
   2. There are sections which are run paritally(only first section) just to use the old data merge with new queried result for new radiuses, so please do not run those section when you are generating new data all over again.
   
   3. There are multiple reasons why we decide to run the notebook partially and merge the new data with old data:
       1) save time
       2) at the line *label_df = label_df.merge(streets[different_radius_cols].astype(str).drop_duplicates(), on='street_id')*, it was found the current queried geometry from OSM even generate less overlapping compared to the geometry of the ground truth data

## Set up

In [None]:
!pip install geopandas
!pip install OSMNX
!pip install folium
!pip install pyprobar  # progressbar

import boto3
import pandas as pd
import io
import geopandas as gpd
import osmnx as ox 
import numpy as np
from shapely.geometry import Polygon, Point
import matplotlib.pyplot as plt
from shapely import wkt

import warnings
from pyprobar import bar, probar
import json
import importlib


from groundtruth_helper import  compute_gt_labels

import geolocation_helper
import openstreetmap_helper #import print_pbp_locations_on_map, retrieve_pois, cluster_pois, merge_pois_with_street_network

importlib.reload(openstreetmap_helper) # reload only module
importlib.reload(geolocation_helper)

from openstreetmap_helper import *
from geolocation_helper import merge_df_on_nearest_geometries, define_circle_around_point

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
# read global configuration details
with open('sagemaker_config_v3.json') as config_file:
    config = json.load(config_file)

In [None]:
client= boto3.client('s3')

## Merge the Newly Queried Radius with Previous Dataset(As Alternative to Below Sections)

NOTE: it seems that open street map has changed the tags, because for a radius 500, the number of POIs in each category has increased greatly and as of August 16, 2021, since the querying, open street map has generated 2 major releases in May. To avoid the weird POI count in the radius, we might want to only install a fixed version of open street map before the May releases

In [None]:
original_final = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different_radius_seattle_groundtruth_labels_with_openstreetmap_features.csv', index_col =0)

In [None]:
original_final['geometry'] = original_final['geometry'].apply(wkt.loads)
original_final=gpd.GeoDataFrame(original_final, geometry = original_final.geometry).set_crs(epsg=4326)

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
include_radius_25 = merge_pois_with_final_result(street_network=original_final, pois=pois_seattle_filtered, radius=25)
include_radius_25.to_csv('include_radius_25_raw.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
include_radius_50 = merge_pois_with_final_result(street_network=original_final, pois=pois_seattle_filtered, radius=50)
include_radius_50.to_csv('include_radius_50_raw.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
include_radius_100 = merge_pois_with_final_result(street_network=original_final, pois=pois_seattle_filtered, radius=100)
include_radius_100.to_csv('include_radius_100_raw.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
include_radius_150 = merge_pois_with_final_result(street_network=original_final, pois=pois_seattle_filtered, radius=150)
include_radius_150.to_csv('include_radius_150_raw.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
include_radius_250 = merge_pois_with_final_result(street_network=original_final, pois=pois_seattle_filtered, radius=250)
include_radius_250.to_csv('include_radius_250_raw.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
include_radius_500 = merge_pois_with_final_result(street_network=original_final, pois=pois_seattle_filtered, radius=500)
include_radius_500.to_csv('include_radius_500_raw.csv')

#### Upload the Raw Queried Data to S3
Raw query is the dataframe which directly coming open street map and where we did not change the name of the column, just to save it in case there is a bug later

In [None]:
files = ['include_radius_25_raw.csv', 'include_radius_50_raw.csv', 
         'include_radius_100_raw.csv', 'include_radius_150_raw.csv',
         'include_radius_250_raw.csv', 'include_radius_500_raw.csv']

for file in files:
    print(f'uploading {file} to object input/processed/frontend/different-radius-intermediate/{file} in s3')
    client.upload_file(f'{file}', 'bucket-vwfs-pred-park-global-model-serving-dev', f'input/processed/frontend/different-radius-intermediate/{file}')
    print(f'{file} upload finished')

In [None]:
include_radius_25 = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different-radius-intermediate/include_radius_25_raw.csv', index_col=0)
include_radius_50 = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different-radius-intermediate/include_radius_50_raw.csv', index_col=0)
include_radius_100 = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different-radius-intermediate/include_radius_100_raw.csv', index_col=0)
include_radius_150 = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different-radius-intermediate/include_radius_150_raw.csv', index_col=0)
include_radius_250 = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different-radius-intermediate/include_radius_250_raw.csv',index_col=0)
include_radius_500 = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/processed/frontend/different-radius-intermediate/include_radius_500_raw.csv', index_col=0)

In [None]:
dict_df_radius = {'include_radius_25': include_radius_25,
          'include_radius_50': include_radius_50, 
          'include_radius_100': include_radius_100, 
          'include_radius_150':include_radius_150, 
          'include_radius_250':include_radius_250, 
          'include_radius_500':include_radius_500}

In [None]:
# drop the original radius 100, 150, 250, 500
for key, value in dict_df_radius.items():
    value.drop(["commercial_100", "residential_100", "transportation_100", "schools_100", "eventsites_100",
                "commercial_150", "residential_150", "transportation_150", "schools_150", "eventsites_150", 
                "commercial_250", "residential_250", "transportation_250", "schools_250", "eventsites_250",
                 "commercial_500", "residential_500", "transportation_500", "schools_500", "eventsites_500"], inplace=True, axis = 1)

In [None]:
def rename_cols(df, name):
    df.rename(
        columns = {
            'commercial': f'commercial_{name}',
            'residential': f'residential_{name}',
            'transportation': f'transportation_{name}',
            'schools': f'schools_{name}',
            'eventsites': f'eventsites_{name}'
        }
)
    return df

In [None]:
dict_df_radius.keys()

In [None]:
dict_df_radius['include_radius_25'] = rename_cols(dict_df_radius['include_radius_25'], 25)
dict_df_radius['include_radius_50'] = rename_cols(dict_df_radius['include_radius_50'], 50)
dict_df_radius['include_radius_100'] = rename_cols(dict_df_radius['include_radius_100'], 100)
dict_df_radius['include_radius_150'] = rename_cols(dict_df_radius['include_radius_150'], 150)
dict_df_radius['include_radius_250'] = rename_cols(dict_df_radius['include_radius_250'], 250)
dict_df_radius['include_radius_500'] = rename_cols(dict_df_radius['include_radius_500'], 500)

In [None]:
for key, value in dict_df_radius.items():
    value.to_csv(key + '.csv')

In [None]:
# upload separate files to s3
files = ['include_radius_25.csv', 'include_radius_50.csv', 
         'include_radius_100.csv', 'include_radius_150.csv',
         'include_radius_250.csv', 'include_radius_500.csv']

for file in files:
    print(f'uploading {file} to object input/processed/frontend/different-radius-intermediate/{file} in s3')
    client.upload_file(f'{file}', 'bucket-vwfs-pred-park-global-model-serving-dev', f'input/processed/frontend/different-radius-intermediate/{file}')
    print(f'{file} upload finished')

In [None]:
### load the street data from s3 for different radius
ls_files = ['include_radius_25', 'include_radius_50', 
          'include_radius_100', 'include_radius_150',
          'include_radius_250', 'include_radius_500']
include_radius_data = {}
for file in ls_files:
    csv_obj = client.get_object(Bucket=config.get("global").get("s3_bucket"), 
                                Key=f'input/processed/frontend/different-radius-intermediate/{file}.csv')
    body = csv_obj['Body']
    csv_string = body.read().decode('utf-8')
    read_include_radius = pd.read_csv(io.StringIO(csv_string), index_col=0)
    include_radius_data[file] = read_include_radius

In [None]:
base = include_radius_data['include_radius_25'].drop(["commercial_25", "residential_25", "transportation_25", "schools_25", "eventsites_25"], axis = 1)

In [None]:
# adding it all together
list_df = [base,
           include_radius_data['include_radius_25'][["commercial_25", "residential_25", "transportation_25", "schools_25", "eventsites_25"]], 
           include_radius_data['include_radius_50'][["commercial_50", "residential_50", "transportation_50", "schools_50", "eventsites_50"]], 
           include_radius_data['include_radius_100'][["commercial_100", "residential_100", "transportation_100", "schools_100", "eventsites_100"]], 
           include_radius_data['include_radius_150'][["commercial_150", "residential_150", "transportation_150", "schools_150", "eventsites_150"]], 
           include_radius_data['include_radius_250'][["commercial_250", "residential_250", "transportation_250", "schools_250", "eventsites_250"]], 
           include_radius_data['include_radius_500'][["commercial_500", "residential_500", "transportation_500", "schools_500", "eventsites_500"]], 
          ]
different_radius_6_radius_seattle_groundtruth_labels_with_openstreetmap_features = pd.concat(list_df, axis=1)

In [None]:
different_radius_6_radius_seattle_groundtruth_labels_with_openstreetmap_features.to_csv('different_radius_6_radius_seattle_groundtruth_labels_with_openstreetmap_features.csv')

In [None]:
# upload to s3 -- this is now the final file from this notebook

client.upload_file('different_radius_6_radius_seattle_groundtruth_labels_with_openstreetmap_features.csv', 'bucket-vwfs-pred-park-global-model-serving-dev',
                   'input/processed/frontend/different_radius_6_radius_seattle_groundtruth_labels_with_openstreetmap_features.csv')

## Get Street Network from Seattle using open street maps(Query All New)

In [None]:
place_name = "Seattle, US"
graph = ox.graph_from_place(place_name, network_type='drive')
streets = ox.utils_graph.graph_to_gdfs(ox.get_undirected(graph), nodes=False)

# if nodes=True: streets[0]: intersections (=nodes), streets[1]: streets (=edges)

### Get POIs from Open Street Map 

This gets the new data from open street map as alternative compared to below where we merge with old already queried data

In [None]:
# retrieve POIs--only run when needed!
pois_seattle = retrieve_pois(place_name=place_name)

In [None]:
# cluster POIs--only run when needed! one radius will generate about 5000 api call
pois_seattle_filtered = cluster_pois(pois_seattle) # this line need to be ran for each query
# merge POIs with street network and calculate the number of POIs in a given radius around every street
streets_25 = merge_pois_with_street_network(street_network=streets, pois=pois_seattle_filtered, radius=25)
# save the data--only run when needed!
streets_25.to_csv('different_radius_streets_25.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
streets_50 = merge_pois_with_street_network(street_network=streets, pois=pois_seattle_filtered, radius=50)
streets_50.to_csv('different_radius_streets_50.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
streets_100 = merge_pois_with_street_network(street_network=streets, pois=pois_seattle_filtered, radius=100)
streets_100.to_csv('different_radius_streets_100.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
streets_150 = merge_pois_with_street_network(street_network=streets, pois=pois_seattle_filtered, radius=150)
streets_150.to_csv('different_radius_streets_150.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
streets_250 = merge_pois_with_street_network(street_network=streets, pois=pois_seattle_filtered, radius=250)
streets_250.to_csv('different_radius_streets_250.csv')

In [None]:
pois_seattle_filtered = cluster_pois(pois_seattle)
streets_500 = merge_pois_with_street_network(street_network=streets, pois=pois_seattle_filtered, radius=500)
streets_500.to_csv('different_radius_streets_500.csv')

In [None]:
# upload separate files to s3
files = ['different_radius_streets_25.csv', 'different_radius_streets_50.csv', 
         'different_radius_streets_100.csv', 'different_radius_streets_150.csv',
         'different_radius_streets_250.csv', 'different_radius_streets_500.csv']

for file in files:
    print(f'uploading {file} to object input/processed/frontend/{file} in s3')
    client.upload_file(f'{file}', 'bucket-vwfs-pred-park-global-model-serving-dev', f'input/processed/frontend/{file}')
    print(f'{file} upload finished')

In [None]:
### load the street data from s3 for different radius
client = boto3.client('s3')
ls_files = ['radius_25_filename','radius_50_filename',
            'radius_100_filename', 'radius_150_filename',
            'radius_250_filename', 'radius_500_filename']
radius_data = {}
for file in ls_files:
    csv_obj = client.get_object(Bucket=config.get("global").get("s3_bucket"), 
                                Key=config.get("development").get(file))
    body = csv_obj['Body']
    csv_string = body.read().decode('utf-8')
    read_radius = pd.read_csv(io.StringIO(csv_string))
    radius_data[file] = read_radius

In [None]:
train_data_with_trans = pd.read_csv('s3://bucket-vwfs-pred-park-global-model-serving-dev/input/open_data/seattle/train_data_with_trans.csv', index_col=0)

### Assign unique id to each street

In [None]:
# here we merge the data from first run to current df to get all the columns
for key, df in radius_data.items():
    radius_data[key] = pd.merge(df, train_data_with_trans[['geometry', 'street_id', 'study_area', 'ongoing_trans']].astype(str) ,on ='geometry', how='inner')

### Generate Data

In [None]:
# make a copy of streets 100 and save it to streets as ground truth
streets = radius_data['radius_100_filename']

In [None]:
# rename of the columns
streets.rename(
    columns={"commercial": "commercial_100", 
             "residential": "residential_100",
             "transportation":"transportation_100",
             "schools": "schools_100",
             "eventsites": "eventsites_100"
            },
    inplace=True
)

In [None]:
# get the data
streets_150 = radius_data['radius_150_filename'][["commercial", "residential", "transportation", "schools", "eventsites"]]
streets_250 = radius_data['radius_250_filename'][["commercial", "residential", "transportation", "schools", "eventsites"]]
streets_500 = radius_data['radius_500_filename'][["commercial", "residential", "transportation", "schools", "eventsites"]]
streets_25 = radius_data['radius_25_filename'][["commercial", "residential", "transportation", "schools", "eventsites"]]
streets_50 = radius_data['radius_50_filename'][["commercial", "residential", "transportation", "schools", "eventsites"]]

In [None]:
# rename other columns for other radius
streets_150 = streets_150.rename(
    columns={"commercial": "commercial_150", 
             "residential": "residential_150",
             "transportation":"transportation_150",
             "schools": "schools_150",
             "eventsites": "eventsites_150"
            }
)

streets_250 = streets_250.rename(
    columns={"commercial": "commercial_250", 
             "residential": "residential_250",
             "transportation":"transportation_250",
             "schools": "schools_250",
             "eventsites": "eventsites_250"
            }
)

streets_500 = streets_500.rename(
    columns={"commercial": "commercial_500", 
             "residential": "residential_500",
             "transportation":"transportation_500",
             "schools": "schools_500",
             "eventsites": "eventsites_500"
            }
)

streets_25 = streets_25.rename(
    columns={"commercial": "commercial_25", 
             "residential": "residential_25",
             "transportation":"transportation_25",
             "schools": "schools_25",
             "eventsites": "eventsites_25"
            }
)


streets_50 = streets_50.rename(
    columns={"commercial": "commercial_50", 
             "residential": "residential_50",
             "transportation":"transportation_50",
             "schools": "schools_50",
             "eventsites": "eventsites_50"
            }
)

In [None]:
# Add all the different radius together
list_df = [streets, 
           streets_150[["commercial_150", "residential_150", "transportation_150", "schools_150", "eventsites_150"]], 
           streets_250[["commercial_250", "residential_250", "transportation_250", "schools_250", "eventsites_250"]], 
           streets_500[["commercial_500", "residential_500", "transportation_500", "schools_500", "eventsites_500"]],
           streets_25[["commercial_25", "residential_25", "transportation_25", "schools_25", "eventsites_25"]], 
           streets_50[["commercial_50", "residential_50", "transportation_50", "schools_50", "eventsites_50"]], 
          ]
streets = pd.concat(list_df, axis=1)

### Prep streets

In [None]:
# data for different radius
# reset index 
streets.set_index(['u', 'v'], inplace=True)

# geo_df
streets['geometry'] = streets['geometry'].apply(wkt.loads)
streets = gpd.GeoDataFrame(streets, geometry=streets.geometry)

In [None]:
# data for only 500 meter radius

streets_500_only = streets.drop(["commercial_100", "residential_100", "transportation_100", "schools_100", "eventsites_100",
                                "commercial_150", "residential_150", "transportation_150", "schools_150", "eventsites_150", 
                                "commercial_250", "residential_250", "transportation_250", "schools_250", "eventsites_250",
                                 "commercial_25", "residential_25", "transportation_25", "schools_25", "eventsites_25",
                                 "commercial_50", "residential_50", "transportation_50", "schools_50", "eventsites_50"], axis = 1)

## Read frontend data from PBP

In [None]:
client= boto3.client('s3')
csv_obj = client.get_object(Bucket=config.get("global").get("s3_bucket"), Key=config.get("development").get("paybyphone_frontend_data_filename"))
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

seattle = pd.read_csv(io.StringIO(csv_string),dtype={'vendorId': str, 'adLocationId':str})

### Use only vendorId 4661 (=Seattle) and parking start and extend events

In [None]:
seattle = seattle[seattle.vendorId=="4661"]

In [None]:
parking = gpd.GeoDataFrame(
    seattle, geometry=gpd.points_from_xy(seattle.long, seattle.lat))
parking = parking.loc[parking.action.isin(['PARKING_START','PARKING_EXTEND'])]
parking = parking.set_crs(epsg=4326)

### Convert types

In [None]:
parking.advLocationId = parking.advLocationId.astype(str)
#use utc times
parking.created = pd.to_datetime(parking.created, unit='ms')
parking.expires = pd.to_datetime(parking.expires).dt.tz_localize(None)

### Append assigned street id and information about that street to the events

In [None]:
# assign for the data with different radius
parking_with_assign = merge_df_on_nearest_geometries(parking, streets, gdfB_cols=config.get("development").get("different_radius_street_columns"))

In [None]:
# assign for the data with only 500 meters as radius 
# rename the column back
streets_500_only = streets_500_only.rename(
    columns={"commercial_500": "commercial", 
             "residential_500": "residential",
             "transportation_500": "transportation",
              "schools_500": "schools",
             "eventsites_500": "eventsites" 
            }
)
parking_with_assign_500_only = merge_df_on_nearest_geometries(parking, streets_500_only, gdfB_cols=config.get("development").get("street_columns"))

## We now assigned each transaction to a street. The following code is to enable seattle specific evaluation 

### To check the accuracy of the assignemnt we need the information whether the assigend street id correponds to the locationId of the ticket

### Read the geometry of the location identifiers

In [None]:
pbp_location_geometries = gpd.read_file('s3://{}/{}'.format(config.get("global").get("s3_bucket"), config.get("development").get("paybyphone_location_geometries_filename"))).set_crs(epsg=4326)

In [None]:
mapping = merge_df_on_nearest_geometries(pbp_location_geometries, streets)
mapping = mapping.groupby('street_id').apply(lambda x: x.advLocationId.unique()).reset_index(name='assigned_gt_location_ids')

In [None]:
# different radius merge mapping
parking_with_assign = parking_with_assign.merge(mapping, on='street_id')

In [None]:
# original 500 radius merge mapping
parking_with_assign_500_only = parking_with_assign_500_only.merge(mapping, on='street_id')

### Save data for modelling 

In [None]:
# Save the result for different radius
parking_with_assign.to_csv("different_radius_" + config.get("development").get("parking_and_streets_filename"))

In [None]:
# Save result for original 500 radius only
parking_with_assign_500_only.to_csv("new_" + config.get("development").get("parking_and_streets_filename"))

In [None]:
# upload the different radius
client.upload_file("different_radius_parking_frontend_data_assigned.csv", config.get("global").get("s3_bucket"), "input/processed/frontend/different_radius_{}".format(config.get("development").get("parking_and_streets_filename")))

In [None]:
# upload the 500 radius
client.upload_file("new_parking_frontend_data_assigned.csv", config.get("global").get("s3_bucket"), "input/processed/frontend/new_{}".format(config.get("development").get("parking_and_streets_filename")))

### Check accuracy of assignment

In [None]:
# check accuracy for different radius
np.mean([x.advLocationId in x.assigned_gt_location_ids for _,x in parking_with_assign.iterrows() if x.action=='PARKING_START'])

In [None]:
# check accuracy for 500 radius
np.mean([x.advLocationId in x.assigned_gt_location_ids for _,x in parking_with_assign_500_only.iterrows() if x.action=='PARKING_START'])

## Prepare ground-truth data from Seattle Departement of Transportation

In [None]:
client = boto3.client('s3')
csv_obj = client.get_object(Bucket=config.get("global").get("s3_bucket"), Key=config.get("development").get("seattle_groundtruth_filename"))
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

parking_study = pd.read_csv(io.StringIO(csv_string))

In [None]:
parking_study.rename({'elmntkey':'advLocationId'},axis=1, inplace=True)
parking_study.advLocationId = parking_study.advLocationId.astype(str)

### Use only observations of times for which we have PayByPhone transactions

In [None]:
parking_study.time_stamp = pd.to_datetime(parking_study.time_stamp)
parking_study.date_time = pd.to_datetime(parking_study.date_time)
time_stamp_date_only_mask = parking_study.time_stamp.map(lambda x: x.strftime("%H:%M"))=="00:00"
parking_study.time_stamp[time_stamp_date_only_mask]=parking_study[time_stamp_date_only_mask].date_time
parking_study = parking_study[parking_study.time_stamp>=parking.created.min()]

### Compute labels based on our own parking zones

In [None]:
parking_study = parking_study.merge(pbp_location_geometries[["advLocationId", "geometry"]], on="advLocationId")

In [None]:
# create label df, note here we use streets for both different radius case and 500 radius case, because they generate same result
label_df = compute_gt_labels(pred_geom=streets, groundtruth_data=gpd.GeoDataFrame(parking_study_merged_geo))

In [None]:
label_df = label_df.droplevel(2).reset_index()

### Add street features (we do it here since we do not want to reload the street df later) 

In [None]:
# columns for different radius
different_radius_cols = config.get("development").get("different_radius_street_columns")
different_radius_cols.insert(4, "geometry")

In [None]:
# columns for 500 radius
cols = config.get("development").get("street_columns")
cols.insert(4, "geometry")

NOTE: below merge need to be checked if we run this notebook again, as currently it yields inconsistent merge result

In [None]:
# TODO: should we really use "astype(str)" here as some variables are numbers?
label_df = label_df.merge(streets[different_radius_cols].astype(str).drop_duplicates(), on='street_id')

In [None]:
label_df.maxspeed = [ x.split(' ')[0] for x in label_df.maxspeed.tolist() ]  # extract speedlimit as number

In [None]:
# add weekday and our
label_df_new['hour'] = label_df_new.observation_interval_start.dt.hour
label_df_new['weekday'] = label_df_new.observation_interval_start.dt.weekday

In [None]:
# get the 500 radius only and rename the column for consistency
label_df_500_only = label_df_new.drop(["commercial_100", "residential_100", "transportation_100", "schools_100", "eventsites_100",
                                "commercial_150", "residential_150", "transportation_150", "schools_150", "eventsites_150", 
                                "commercial_250", "residential_250", "transportation_250", "schools_250", "eventsites_250",
                                "commercial_25", "residential_25", "transportation_25", "schools_25", "eventsites_25",
                                "commercial_50", "residential_50", "transportation_50", "schools_50", "eventsites_50"], axis = 1)

label_df_500_only = label_df_500_only.rename(
    columns={"commercial_500": "commercial", 
             "residential_500": "residential",
             "transportation_500": "transportation",
              "schools_500": "schools",
             "eventsites_500": "eventsites" 
            }
)

### Save results

In [None]:
# save the result of different radius
csv_buffer = io.StringIO()
label_df_new.to_csv(csv_buffer)
response = s3_client.put_object( 
    Bucket=config.get("global").get("s3_bucket"),
    Body=csv_buffer.getvalue(),
    Key='input/processed/frontend/different_radius_seattle_groundtruth_labels_with_openstreetmap_features.csv'
)

In [None]:
# save the result for one radius 500
csv_buffer = io.StringIO()
label_df_500_only.to_csv(csv_buffer)
response = s3_client.put_object( 
    Bucket=config.get("global").get("s3_bucket"),
    Body=csv_buffer.getvalue(),
    Key='input/processed/frontend/new_seattle_groundtruth_labels_with_openstreetmap_features.csv'
)

## Catboost - might be moved to new notebook later

In [None]:
!pip install catboost
!pip install pulearn

In [None]:
from sklearn.model_selection import train_test_split
from pulearn import ElkanotoPuClassifier
from catboost import CatBoostClassifier

In [None]:
feature_names = ['length', 'highway', 'commercial', 'residential', 'transportation', 'schools', 'eventsites','hour','weekday']
cat_features = ['highway', 'hour', 'weekday'] #specify which of the features from above are categorical
cat_feat_pos = np.where([feat in cat_features for feat in feature_names])[0] #position of categorical features 

In [None]:
label_df[cat_features] = label_df[cat_features].astype("str")
label_df[[feat for feat in feature_names if feat not in cat_features]] = label_df[[feat for feat in feature_names if feat not in cat_features]].astype(float)

In [None]:
# TODO: some of the features are lists, we should investigate if that's what we want

In [None]:
x_train, x_test, y_train, y_test = train_test_split(label_df[feature_names], label_df['availability'].astype("int"), test_size=0.25)

### Train

In [None]:
base_elkan_model = CatBoostClassifier(cat_features=cat_feat_pos, metric_period=100)
# elkan_model = ElkanotoPuClassifier(base_elkan_model,hold_out_ratio=0.2)
base_elkan_model.fit(x_train.values,y_train.values)

### Evaluate

In [None]:
from sklearn.metrics import recall_score, f1_score, roc_auc_score, precision_score, accuracy_score, matthews_corrcoef

In [None]:
pred=base_elkan_model.predict(x_test)

In [None]:
print(f'recall {recall_score(y_pred=pred,y_true=y_test)}')
print(f'precision {precision_score(y_pred=pred,y_true=y_test)}')
print(f'accuracy {precision_score(y_pred=pred,y_true=y_test)}')
print(f'auc {roc_auc_score(y_score=pred, y_true=y_test)}')
print(f'F1-Score: {f1_score(y_true=y_test, y_pred=pred)}')
print(f'Mathew Correlation: {matthews_corrcoef(y_true=y_test, y_pred=pred)}')

In [None]:
# TODO: what we can do to tweak the model:
# - change the clustering of the POIs
# - change the radius (currently 500m) for the POIS
# - include more/other OpenStreetMap data
# - use another classifier than CatBoost
# - hyperparameter tuning
# - make the problem a regressionn problem (how many free spaces)