# Data selection

This notebook will explain how and why we selected the data to be used in the classification tasks. All the code below is summarised in the `get_selected_data` function in the `data_handler.py` module. 

Let's load the metadata and see the classes

In [1]:
import numpy as np
from data_handler import load_trajs_metadata, load_trajs_data

metadata = load_trajs_metadata("trajectories/metadata.json")
all_classes = set(traj_md["class"] for traj_md in metadata)
print(all_classes)

{'run', 'motorcycle', 'boat', 'train', 'airplane', 'bike', 'bus', 'taxi', 'subway', 'car', 'walk'}


For this project we will be classifing only between: `car`, `taxi`, `bus`, `run`, `walk`, `bike`, `subway`, `train`

Also, some classes might be treated as the same (e.g., `car` and `taxi`, `train` and `subway`)

Let's remove the unnecessary trajectories and join those who can be treated as the same type.

In [2]:
classes = {"car", "taxi", "run", "bus", "walk", "bike", "subway", "train"}
data = [traj_md for traj_md in metadata if traj_md["class"] in classes]
for traj_md in data:
    if traj_md["class"] == "taxi":
        traj_md["class"] = "car"
    if traj_md["class"] == "subway":
        traj_md["class"] = "train"
classes.remove("taxi")
classes.remove("subway")

Let's see how many trajectories are left and how many of each type

In [3]:
from collections import Counter

def data_info(data):
    print("Total trajectories:", len(data))
    classes_count = Counter([traj_md["class"] for traj_md in data])
    for clsf, ammount in classes_count.items():
        print(f"{clsf}: {ammount}")

data_info(data)

Total trajectories: 9458
train: 790
car: 1291
walk: 3979
bus: 1846
bike: 1548
run: 4


As we can notice, the `run` class has only 4 trajectories. This is not enough data, so it will be removed as well.

In [4]:
data = [traj_md for traj_md in data if traj_md["class"] != "run"]

Now, let's add the data (GPS points and time) of each trajectory using the `load_trajs_data` function.

In [5]:
data = load_trajs_data(data)

100.00%

Let's see some basic statistical information of some trajectories

In [6]:
def traj_mean_dt(traj):
    return np.mean(np.diff(traj[:,2]))

for tmd in data[:10]:
    traj_data = tmd["traj_data"]
    print(
         "--------------------",
        f"Mean dt:  {traj_mean_dt(traj_data):.2f} s",
        f"Traj len: {traj_data.shape[0]}",
        sep="\n"
    )

--------------------
Mean dt:  56.79 s
Traj len: 69
--------------------
Mean dt:  57.35 s
Traj len: 379
--------------------
Mean dt:  65.02 s
Traj len: 801
--------------------
Mean dt:  120.74 s
Traj len: 716
--------------------
Mean dt:  124.10 s
Traj len: 316
--------------------
Mean dt:  80.24 s
Traj len: 502
--------------------
Mean dt:  59.22 s
Traj len: 10
--------------------
Mean dt:  58.73 s
Traj len: 439
--------------------
Mean dt:  59.18 s
Traj len: 12
--------------------
Mean dt:  58.88 s
Traj len: 9


This shows that there are some trajectories with a very long time spacing and very few points. So, let's filter the trajectories by taking thos who have `dt <= 3s` and `len >= 100`. 

In [10]:
data = [tmd for tmd in data if traj_mean_dt(tmd["traj_data"]) <= 3 and tmd["traj_data"].shape[0] >= 100]
data_info(data)

Total trajectories: 4004
walk: 1208
car: 410
train: 303
bus: 989
bike: 1094


As the info shows, the final count for the total trajectories are 4004, and the distribution between classes is not so homogeneus, but this will be taken in account for the classification.