In [1]:
import pandas as pd
from sktime.utils.data_processing import from_multi_index_to_nested

In [2]:
data = pd.read_csv('most_0400_0600_1_5.csv', sep=';')

First, let's have an overview of our data.

In [3]:
data.describe()

Unnamed: 0,timestep_time,vehicle_angle,vehicle_pos,vehicle_slope,vehicle_speed,vehicle_x,vehicle_y,vehicle_z,person_angle,person_pos,person_slope,person_speed,person_x,person_y,person_z
count,2067350.0,1228876.0,1228876.0,1228876.0,1228876.0,1228876.0,1228876.0,1228876.0,838473.0,837976.0,838473.0,837977.0,838473.0,838473.0,838473.0
mean,20211.32,170.2282,183.2031,-0.6081239,5.87874,4394.096,2036.82,97.45693,161.218636,181.459266,-0.35345,7.413234,4313.035494,2069.419362,109.380626
std,977.9872,99.15551,578.1648,5.173104,6.913619,1581.162,1031.879,115.775,100.731614,355.82489,6.752039,6.224017,1748.461993,1075.116513,120.228287
min,14400.0,0.0,0.0,-88.73,0.0,-0.21,93.46,0.0,0.0,0.0,-90.0,0.0,-0.21,93.46,-0.08
25%,19525.0,81.79,15.53,-2.75,0.03,3935.52,1313.255,20.73,69.47,13.38,-2.25,1.19,3548.03,1338.41,26.25
50%,20355.0,196.88,43.6,0.0,2.18,4513.73,1859.05,61.08,157.55,51.95,0.0,6.9,4461.74,1935.2,67.8
75%,21025.0,232.31,133.5425,1.09,12.33,5155.55,2665.06,106.98,240.1,159.62,1.53,12.87,5136.77,2852.4,126.25
max,21595.0,359.99,13234.21,90.0,55.47,9976.57,6359.29,597.21,359.99,4183.48,90.0,46.09,9976.57,6356.52,575.2


There are several columns that we won't need; for our problem, the relevant columns will be `vehicle_speed` and `vehicle_angle` (we're making an initial assumption that busses will have a more erratic trajectory than cars since they need to cover several different routes as opposed to private vehicles that usually go from one point to another). Let's drop the unused features.

In [4]:
# keeps only relevant columns
data = data[['vehicle_id', 'timestep_time', 'vehicle_speed', 'vehicle_angle', 'vehicle_type']]

Now let's see how the vehicle categories are defined.

In [5]:
data['vehicle_type'].unique()

array([nan, 'bus', 'train', 'hw_trailer', 'fastbicycle', 'motorcycle',
       'passenger4', 'passenger2b', 'truck', 'delivery', 'coach',
       'passenger2a', 'passenger3', 'passenger1', 'moped', 'trailer',
       'avgbicycle', 'hw_motorcycle', 'hw_passenger1', 'taxi',
       'emergency', 'slowbicycle', 'authority', 'uber', 'hw_coach',
       'hw_passenger2a', 'hw_truck', 'hw_passenger3', 'hw_passenger4',
       'hw_delivery', 'hw_passenger2b', 'army'], dtype=object)

According to the documentation, all the vehicle types that start with `passenger` are a subclass of cars. Therefore, we will unify these labels after we drop all categories that are neither cars nor busses.

In [6]:
# drops all entries that are neither cars nor busses
data = data[(data['vehicle_type'] == 'bus') | (data['vehicle_type'].str.startswith('passenger'))]

# unifies car label for all entries
data['vehicle_type'] = data['vehicle_type'].map(lambda x: 'car' if x.startswith('passenger') else x)

Now let's check if there are any null values on our modified dataset.

In [7]:
data.isna().sum()

vehicle_id       0
timestep_time    0
vehicle_speed    0
vehicle_angle    0
vehicle_type     0
dtype: int64

In order to work with the time-series algorithms from the `sktime` package, it is necessary to transform our dataset into a *nested* one. Since we cannot save a nested dataset, we will transform our dataframe into a multi-indexed one and save it to disk. Multi-indexed dataframes can be turned into nested by using the `from_multi_index_to_nested` method from the `sktime` package.

In [8]:
# multindex dataframe
data.sort_values(by=['vehicle_id', 'timestep_time'], inplace=True)
multiindex = pd.MultiIndex.from_frame(data[['vehicle_id', 'timestep_time']])
data.index = multiindex

# drops columns used in indexing
data.drop(['vehicle_id', 'timestep_time'], axis='columns', inplace=True)

In [9]:
data.to_csv('cleaned_data.csv')