# Generate historical, training and test set

In this notebook we create a total of 6 sets of time periods as follows:

1. Two **historical sets** (for scenario 1 and 2) from the year **2017**. 
    - historical_set_scenario_1: 100 low demand time periods (weekday 8am-4pm)
    - historical_set_scenario_2: 100 high-demand time periods (weekend (midnight-8am)

In the multi-objective GA, this is used to select a subset of all historical crimes for these 100 time periods in order to:
    - measure the density of historical incidents/crimes on each street (used to calculate deterrence score)
    - identify the 5 hottest streets to patrol in each beat as part of the patrol route
    - inform the design of configurations targeted towards historically hot beats

There is no measure of deterrence needed in the single-objective GA so no need to collect historical incidents/crimes.

2. Two **training sets** (for scenario 1 and 2) from the year **2018**: 
    - training_set_scenario_1: 100 low demand time periods (weekday 8am-4pm)
    - training_set_scenario_2: 100 high-demand time periods (weekend (midnight-8am)

In the ABM experiments, this is used as a 'historical set' but for the year 2018. The goal is to select a subset of all historical CFS incidents or crimes for these 100 time periods in order to:
    - measure the density of historical incidents/crimes on each street (used to calculate deterrence score)
    - identify the 5 hottest streets to patrol in each beat as part of the patrol route
    - inform the design of configurations targeted towards historically hot beats

In the single-objective and multi-objective GAs, this is used to train the GA ($k$ time periods at a time, where $k$ is the RSS value). In other words, at each generation, we evaluate each individual in the population by running $k$ ABMs (for $k$ time periods randomly sampled at each generations).


3. Two **test sets** (for scenario 1 and 2) from the year **2019**: 
    - testing_set_scenario_1: 100 low demand time periods (weekday 8am-4pm)
    - testing_set_scenario_2: 100 high-demand time periods (weekend (midnight-8am)

In the ABM experiments, this is used to run a simulation for each of these 100 time periods, and present aggregated performance metrics.

In the single-objective and multi-objective GAs, this is used to run a final evaluation of the 'best' solutions identifed by the GA on time periods previously unseen during training. This provide a fair and equal evaluation of all the indivuals on the same 100 time periods.


NB: by design, all three types of set contain different time periods as they belong to different years.

In [17]:
import os

import osmnx as ox
import pandas as pd

from datetime import datetime, timedelta, time
import numpy as np
import scipy as sp

import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns
import matplotlib as mpl
import pickle

import random

import warnings

In [18]:
def getYearData(YEAR) :
    data = pd.read_csv("../data/incidents.csv")
    data.Date_Time = pd.to_datetime(data.Date_Time)
    data.Date_Time = data.Date_Time.dt.tz_localize(None)
   
    data = data[(data['Date_Time'].dt.year == YEAR)]
    return data

In [19]:
def get_start_end_shift(date, start_time):
    duration_hours = 8
    start_time = time(start_time,0)

    SHIFT_START_DT = datetime.combine(date, start_time)
    SHIFT_END_DT = SHIFT_START_DT + timedelta(hours = duration_hours)
    END_TIME = SHIFT_END_DT.time()

    return SHIFT_START_DT, SHIFT_END_DT

In [20]:
def getSetForScenario(data, scenario_num) :
    print('Scenario: {}'.format(scenario_num))
    
    ## WEEKDAYS
    if scenario_num == 1 :
        
        # Select the weekdays (Monday to Friday)
        df_weekdays = data[data.Date_Time.dt.weekday // 5 == 0]
        df_weekdays = df_weekdays[(df_weekdays.Date_Time.dt.hour >= 8) & 
                                        (df_weekdays.Date_Time.dt.hour < 16)]
        
        dates_uniques_weekdays = df_weekdays['Date_Time'].sort_values().dt.date.unique()
        print('There are {} unique dates in dataset'.format(len(dates_uniques_weekdays)))

            
        ## get all weekday shifts
        all_shifts = [get_start_end_shift(date, 8) for date in  dates_uniques_weekdays]

    ## WEEKENDS
    else :
        # Select the weekend days (Saturday and Sunday)
        df_weekends = data[data.Date_Time.dt.weekday // 5 == 1]
        df_weekends = df_weekends[(df_weekends.Date_Time.dt.hour >= 0) & 
                                        (df_weekends.Date_Time.dt.hour < 8)]
        
        dates_uniques_weekends = df_weekends['Date_Time'].sort_values().dt.date.unique()
        print('There are {} unique dates in dataset'.format(len(dates_uniques_weekends)))


        # get all weekend shifts
        all_shifts = [get_start_end_shift(date, 0) for date in  dates_uniques_weekends]


   
    # select 100 shifts at random
    random.seed(222)
    set_for_scenario = random.sample(all_shifts, 100)
    return set_for_scenario

## Test set (2019)

In [15]:
data = getYearData(2019)

In [16]:
testing_set_scenario1 = getSetForScenario(data, 1)
testing_set_scenario2 = getSetForScenario(data, 2)

Scenario: 1
There are 261 unique dates in dataset
Scenario: 2
There are 104 unique dates in dataset


In [7]:
import pickle
with open('../data/testing_set_scenario1.pkl', 'wb') as f:
    pickle.dump(testing_set_scenario1, f)
with open('../data/testing_set_scenario2.pkl', 'wb') as f:
    pickle.dump(testing_set_scenario2, f)

## Training set (2018)

In [13]:
data = getYearData(2018)

In [14]:
training_set_scenario1 = getSetForScenario(data, 1)
training_set_scenario2 = getSetForScenario(data, 2)

Scenario: 1
There are 261 unique dates in dataset
Scenario: 2
There are 104 unique dates in dataset


In [53]:
import pickle
with open('../data/training_set_scenario1.pkl', 'wb') as f:
    pickle.dump(training_set_scenario1, f)
with open('../data/training_set_scenario2.pkl', 'wb') as f:
    pickle.dump(training_set_scenario2, f)

## Historical set (2017)

In [6]:
data = getYearData(2017)

In [12]:
historical_set_scenario1 = getSetForScenario(data, 1)
historical_set_scenario2 = getSetForScenario(data, 2)

Scenario: 1
There are 260 unique dates in dataset
Scenario: 2
There are 105 unique dates in dataset


In [27]:
import pickle
with open('../data/historical_set_scenario1.pkl', 'wb') as f:
    pickle.dump(training_set_scenario1, f)
with open('../data/historical_set_scenario2.pkl', 'wb') as f:
    pickle.dump(training_set_scenario2, f)