# **Build Time-Series Dataset (Sprint 2 CA2)**

### *Objective*
Prepare a reusable time-series dataset for **time-series models** (e.g., HMM, LSTM, 1D CNN) using the cleaned datasets from Sprint 1.

We will create:
1. **Base time-series dataset**: sensor sequences per trip (bookingID), labelled by `is_dangerous_trip`.
2. **Hybrid time-series dataset (optional)**: base dataset + **static trip-level engineered features** appended to every timestep (if feature file exists).

Outputs are saved to:
- `Datasets/time_series_data/`

### *Why this is separated from modelling*
Time-series formatting (grouping by trip, sorting by time, splitting by bookingID) is **shared** across multiple time-series models.
Keeping it in one place avoids:
- data leakage (splitting by rows instead of by trip)
- inconsistent preprocessing across models
- duplicated logic across notebooks

#

# **Imports + Paths**
---

In [8]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# base directory = project root (assuming notebook is in Sprint 2/Advanced Data Processing/)
BASE_DIR = os.path.dirname(os.getcwd())

CLEAN_DIR = os.path.join(BASE_DIR, "Datasets", "cleaned_datasets")
FEAT_DIR  = os.path.join(BASE_DIR, "Datasets", "ca2_features")  # friend’s features live here
OUT_DIR   = os.path.join(BASE_DIR, "Datasets", "time_series_data")

os.makedirs(OUT_DIR, exist_ok=True)

SENSOR_PATH = os.path.join(CLEAN_DIR, "sensor_data_cleaned.csv")
SAFETY_PATH = os.path.join(CLEAN_DIR, "safety_data_cleaned.csv")
DRIVER_PATH = os.path.join(CLEAN_DIR, "driver_data_cleaned.csv")

# (optional) feature-engineered datasets
FINAL_FEAT_PATH = os.path.join(FEAT_DIR, "final_selected_features.csv")  # most useful if exists
COMBINED_FEAT_PATH = os.path.join(FEAT_DIR, "combined_features.csv")     # fallback if needed

print("BASE_DIR:", BASE_DIR)
print("SENSOR_PATH:", SENSOR_PATH)
print("SAFETY_PATH:", SAFETY_PATH)
print("DRIVER_PATH:", DRIVER_PATH)
print("OUT_DIR:", OUT_DIR)

BASE_DIR: c:\PAI-GoBest-Project\Sprint 2
SENSOR_PATH: c:\PAI-GoBest-Project\Sprint 2\Datasets\cleaned_datasets\sensor_data_cleaned.csv
SAFETY_PATH: c:\PAI-GoBest-Project\Sprint 2\Datasets\cleaned_datasets\safety_data_cleaned.csv
DRIVER_PATH: c:\PAI-GoBest-Project\Sprint 2\Datasets\cleaned_datasets\driver_data_cleaned.csv
OUT_DIR: c:\PAI-GoBest-Project\Sprint 2\Datasets\time_series_data


#

# **Input datasets (from cleaned_datasets)**

##### `sensor_data_cleaned.csv`...
Contains high-frequency telematics sensor records, one row per timestep per trip.
Key fields:
- `bookingID` (trip id)
- `second` (time index)
- sensor channels (acceleration, gyro, speed, etc.)

##### `safety_data_cleaned.csv`...
Contains trip-level labels:
- `bookingID`
- `label` (true/false) or possibly already numeric depending on cleaning

##### `driver_data_cleaned.csv`...
Driver metadata (static per driver). This is **not time-series**, but can be merged as optional static context later.

---

### Target label used in Sprint 2
We will create:  
`is_dangerous_trip` where:
- 0 = safe
- 1 = dangerous

#

# **Load Cleaned Datasets**
---

In [2]:
sensor = pd.read_csv(SENSOR_PATH)
safety = pd.read_csv(SAFETY_PATH)
driver = pd.read_csv(DRIVER_PATH)

print("sensor shape:", sensor.shape)
print("safety shape:", safety.shape)
print("driver shape:", driver.shape)

display(sensor.head(3))
display(safety.head(3))
display(driver.head(3))

sensor shape: (7236579, 12)
safety shape: (20000, 3)
driver shape: (500, 8)


Unnamed: 0,record_id,bookingID,accuracy,bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,speed
0,3394724,0,8.0,143.298294,-1.706207,-9.270792,-1.209448,-0.028965,-0.032652,0.01539,2.0,0.228454
1,436148,0,8.0,143.298294,-1.416705,-9.548032,-1.860977,-0.022413,0.005049,-0.025753,3.0,0.228454
2,5786266,0,8.0,143.298294,-0.346924,-9.532629,-1.204663,0.014962,-0.050033,0.025118,9.0,0.228454


Unnamed: 0,bookingID,driver_id,label
0,0,359,False
1,1,313,True
2,2,27,True


Unnamed: 0,id,name,date_of_birth,no_of_years_driving_exp,gender,car_make,car_model_year,rating
0,1,Sinclair Birmingham,1982-10-17,10,Male,Audi,2010,3.8
1,2,Juline Faulks,1977-11-30,14,Female,BMW,2000,2.8
2,3,Germayne Stit,1976-09-18,13,Male,Mercedes-Benz,1999,3.8


#

# **Label normalization**

The cleaned `safety_data_cleaned.csv` may store labels as:
- `true/false` strings, or
- boolean, or
- already numeric.

For consistency across modelling and GUI, we standardize to:

`is_dangerous_trip` ∈ {0, 1}

---

In [3]:
# safety schema example: bookingID, driver_id, label
# normalize label -> is_dangerous_trip (0/1)

safety = safety.copy()

if "is_dangerous_trip" in safety.columns:
    # already normalized
    safety["is_dangerous_trip"] = safety["is_dangerous_trip"].astype(int)
else:
    if "label" not in safety.columns:
        raise ValueError("Expected safety_data_cleaned.csv to contain 'label' or 'is_dangerous_trip'")

    # handle common encodings
    if safety["label"].dtype == "bool":
        safety["is_dangerous_trip"] = safety["label"].astype(int)
    else:
        # string true/false or mixed
        safety["label_norm"] = safety["label"].astype(str).str.lower().str.strip()
        safety["is_dangerous_trip"] = safety["label_norm"].map({
            "true": 1, "1": 1, "yes": 1,
            "false": 0, "0": 0, "no": 0
        })

        # if any unmapped values remain, surface them early
        bad = safety[safety["is_dangerous_trip"].isna()]["label"].unique()
        if len(bad) > 0:
            raise ValueError(f"Unrecognized label values in safety data: {bad}")

        safety["is_dangerous_trip"] = safety["is_dangerous_trip"].astype(int)

print(safety[["bookingID", "is_dangerous_trip"]].head())
print("label distribution:\n", safety["is_dangerous_trip"].value_counts())

   bookingID  is_dangerous_trip
0          0                  0
1          1                  1
2          2                  1
3          4                  1
4          6                  0
label distribution:
 is_dangerous_trip
0    15007
1     4993
Name: count, dtype: int64


#

# **Merge trip labels into sensor rows**

We join `sensor_data_cleaned` with `safety_data_cleaned` by `bookingID` so that every sensor record inherits the trip label.

This is required because time-series models learn from the **sequence**, but the label is defined at the **trip level**.

---

In [4]:
df = sensor.merge(
    safety[["bookingID", "is_dangerous_trip"]],
    on="bookingID",
    how="inner"
)

print("after merge shape:", df.shape)

# sanity: all trips in df should have a label
missing = df["is_dangerous_trip"].isna().sum()
print("missing labels:", missing)

# check time column exists
if "second" not in df.columns:
    raise ValueError("Expected sensor data to contain 'second' for time ordering.")

# sort for correct sequencing
df = df.sort_values(["bookingID", "second"]).reset_index(drop=True)

display(df.head(5))

after merge shape: (7236579, 13)
missing labels: 0


Unnamed: 0,record_id,bookingID,accuracy,bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,speed,is_dangerous_trip
0,3394724,0,8.0,143.298294,-1.706207,-9.270792,-1.209448,-0.028965,-0.032652,0.01539,2.0,0.228454,0
1,436148,0,8.0,143.298294,-1.416705,-9.548032,-1.860977,-0.022413,0.005049,-0.025753,3.0,0.228454,0
2,5786266,0,8.0,143.298294,-0.346924,-9.532629,-1.204663,0.014962,-0.050033,0.025118,9.0,0.228454,0
3,4176046,0,8.0,143.298294,-0.600986,-9.452029,-2.157507,0.004548,-0.011713,-0.004078,11.0,0.228454,0
4,4528581,0,8.0,143.298294,-0.597546,-9.863403,-1.672711,-0.000401,0.000315,-0.00983,12.0,0.228454,0


#

# **Feature selection for time-series (base channels)**

For time-series models, we keep raw sensor channels as **sequence observations**.
We intentionally avoid rolling statistics or aggregations here, because those belong to feature engineering and/or tabular models.

### Base observation channels (recommended)
- `speed`
- `acceleration_x, acceleration_y, acceleration_z`
- `gyro_x, gyro_y, gyro_z`
Optionally:
- `bearing`, `accuracy` (GPS quality signals)

We will:
1. choose a stable column list
2. drop rows with missing values in those columns
3. keep the dataset in **long format** (one row = one timestep)

---

In [5]:
# choose observation columns that likely exist
candidate_cols = [
    "speed",
    "acceleration_x", "acceleration_y", "acceleration_z",
    "gyro_x", "gyro_y", "gyro_z",
    "bearing", "accuracy"
]

obs_cols = [c for c in candidate_cols if c in df.columns]

if len(obs_cols) < 4:
    raise ValueError(f"Too few observation columns found. Available obs cols: {obs_cols}")

print("Using obs_cols:", obs_cols)

before = len(df)
df = df.dropna(subset=obs_cols)
after = len(df)
print(f"Dropped {before-after} rows due to NaNs in obs cols ({before} -> {after})")

Using obs_cols: ['speed', 'acceleration_x', 'acceleration_y', 'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z', 'bearing', 'accuracy']
Dropped 0 rows due to NaNs in obs cols (7236579 -> 7236579)


#

# **Optional static context features (if applicable)**

Time-series models can optionally use **static features** (per trip) such as:
- driver profile (age, experience, car model year, rating)
- engineered per-trip features (tsfresh/manual aggregates)

### Important note
- For **HMM**, adding many static features is usually NOT helpful because HMM assumes emissions are generated per timestep.
- For neural time-series models (LSTM/CNN), static features can be appended as extra channels repeated over time (constant across timesteps).

So we will build:
1. **Base dataset**: time-series sensor channels only (best for HMM baseline).
2. **Hybrid dataset (optional)**: base + engineered static features repeated per timestep (for neural models later).

If the engineered feature file is missing, we simply skip hybrid output.

---

In [6]:
df_enriched = df.copy()

# ---- Optional driver info (static per driver_id) ----
# safety has driver_id, sensor doesn't. So merge driver_id first.
if "driver_id" in safety.columns:
    df_enriched = df_enriched.merge(
        safety[["bookingID", "driver_id"]],
        on="bookingID",
        how="left"
    )

    # derive driver-level features that are safe to include as static context
    driver2 = driver.copy()
    driver2["date_of_birth"] = pd.to_datetime(driver2["date_of_birth"], errors="coerce")

    # crude age feature based on dataset dates not provided, so we use year only
    # (document this limitation; alternatively omit age entirely)
    driver2["birth_year"] = driver2["date_of_birth"].dt.year

    # categorical encoding left for modelling stage; here we keep raw columns
    driver_keep = ["id", "no_of_years_driving_exp", "gender", "car_make", "car_model_year", "rating", "birth_year"]
    driver_keep = [c for c in driver_keep if c in driver2.columns]

    df_enriched = df_enriched.merge(
        driver2[driver_keep].rename(columns={"id": "driver_id"}),
        on="driver_id",
        how="left"
    )
else:
    print("No driver_id found in safety. Skipping driver enrichment.")

# ---- Optional engineered trip-level features ----
feat_path = None
if os.path.exists(FINAL_FEAT_PATH):
    feat_path = FINAL_FEAT_PATH
elif os.path.exists(COMBINED_FEAT_PATH):
    feat_path = COMBINED_FEAT_PATH

engineered_cols = []
if feat_path:
    feats = pd.read_csv(feat_path)
    if "bookingID" not in feats.columns:
        print(f"Engineered feature file found but no bookingID column: {feat_path}. Skipping.")
    else:
        # Keep only numeric engineered features to avoid exploding the dataset with strings
        numeric_feats = feats.select_dtypes(include=[np.number]).copy()
        numeric_feats["bookingID"] = feats["bookingID"]

        engineered_cols = [c for c in numeric_feats.columns if c != "bookingID"]
        print(f"Loaded engineered features from: {feat_path}")
        print(f"Engineered numeric feature count: {len(engineered_cols)}")

        df_enriched = df_enriched.merge(numeric_feats, on="bookingID", how="left")
else:
    print("No engineered feature CSV found. Hybrid dataset will not include engineered features.")

Loaded engineered features from: c:\PAI-GoBest-Project\Sprint 2\Datasets\ca2_features\final_selected_features.csv
Engineered numeric feature count: 11


#

# **Output design (CSV, long format)**

We save time-series data in **long format**:
- one row = one timestep record
- grouped by `bookingID`
- ordered by `second`

### Files saved
- `timeseries_v1_base.csv`
  - columns: `bookingID`, `second`, `is_dangerous_trip`, obs_cols
- `timeseries_v1_hybrid.csv` (only if engineered features exist)
  - includes base + static context (driver + engineered numeric)

### Why long format?
It stays compatible with:
- HMM (groupby bookingID -> sequences)
- LSTM/CNN (same grouping)
- batch inference in GUI (CSV upload)

We avoid “sequence-in-one-cell” formats because they are fragile and hard to debug.

---

In [7]:
VERSION = "v1"

base_cols = ["bookingID", "second", "is_dangerous_trip"] + obs_cols
base_out = df[base_cols].copy()

base_path = os.path.join(OUT_DIR, f"timeseries_{VERSION}_base.csv")
base_out.to_csv(base_path, index=False)
print("Saved:", base_path, "| shape:", base_out.shape)

# hybrid output only if we actually added anything meaningful
hybrid_path = None
added_cols = [c for c in df_enriched.columns if c not in df.columns]

# we consider hybrid "valid" if it includes engineered features OR driver features
has_engineered = len(engineered_cols) > 0
has_driver = any(c in df_enriched.columns for c in ["no_of_years_driving_exp", "rating", "car_model_year", "birth_year"])

if has_engineered or has_driver:
    # keep a controlled set of columns to prevent huge file bloat
    hybrid_keep = ["bookingID", "second", "is_dangerous_trip"] + obs_cols

    # include driver statics if present
    driver_statics = ["no_of_years_driving_exp", "car_model_year", "rating", "birth_year"]
    driver_statics = [c for c in driver_statics if c in df_enriched.columns]

    # include engineered numeric
    engineered_keep = engineered_cols[:]  # already numeric

    hybrid_keep += driver_statics + engineered_keep

    hybrid_out = df_enriched[hybrid_keep].copy()
    hybrid_path = os.path.join(OUT_DIR, f"timeseries_{VERSION}_hybrid.csv")
    hybrid_out.to_csv(hybrid_path, index=False)
    print("Saved:", hybrid_path, "| shape:", hybrid_out.shape)
else:
    print("Hybrid output skipped (no engineered/driver static features available).")

Saved: c:\PAI-GoBest-Project\Sprint 2\Datasets\time_series_data\timeseries_v1_base.csv | shape: (7236579, 12)
Saved: c:\PAI-GoBest-Project\Sprint 2\Datasets\time_series_data\timeseries_v1_hybrid.csv | shape: (7236579, 27)


#

# **Train/Test split strategy (trip-based)**

Time-series leakage happens when we split **by rows**:
- The same trip contributes timesteps to both train and test
- Models appear unrealistically strong

To prevent leakage, we split by:
- unique `bookingID` (trip)

We save:
- `timeseries_v1_trip_split.csv`

This ensures **all time-series models** (HMM, LSTM, CNN) use the same split for fair comparison.

---

In [9]:
split_seed = 42
test_size = 0.2

trip_labels = base_out[["bookingID", "is_dangerous_trip"]].drop_duplicates()

train_ids, test_ids = train_test_split(
    trip_labels["bookingID"].values,
    test_size=test_size,
    random_state=split_seed,
    stratify=trip_labels["is_dangerous_trip"].values
)

split_df = pd.DataFrame({
    "bookingID": np.concatenate([train_ids, test_ids]),
    "split": (["train"] * len(train_ids)) + (["test"] * len(test_ids))
})

split_path = os.path.join(OUT_DIR, f"timeseries_{VERSION}_trip_split.csv")
split_df.to_csv(split_path, index=False)

print("Saved:", split_path)
print(split_df["split"].value_counts())

Saved: c:\PAI-GoBest-Project\Sprint 2\Datasets\time_series_data\timeseries_v1_trip_split.csv
split
train    15977
test      3995
Name: count, dtype: int64


#

# **Verification checks**

We validate:
1. Every `bookingID` has exactly one label
2. Time ordering is correct (`second` increasing within each trip)
3. Split integrity: no overlap between train and test trip IDs

---

In [10]:
# 1) label uniqueness
label_check = base_out.groupby("bookingID")["is_dangerous_trip"].nunique()
bad_trips = label_check[label_check > 1]
print("Trips with >1 label:", len(bad_trips))
if len(bad_trips) > 0:
    display(bad_trips.head())

# 2) time monotonicity (sample a few trips)
sample_trips = base_out["bookingID"].drop_duplicates().sample(5, random_state=split_seed).tolist()
for bid in sample_trips:
    seconds = base_out.loc[base_out["bookingID"] == bid, "second"].values
    ok = np.all(np.diff(seconds) >= 0)
    print(f"bookingID={bid} | time sorted:", ok)

# 3) split overlap
train_set = set(train_ids)
test_set = set(test_ids)
overlap = train_set.intersection(test_set)
print("Split overlap count:", len(overlap))

Trips with >1 label: 0
bookingID=962072674469 | time sorted: True
bookingID=231928234145 | time sorted: True
bookingID=1597727834251 | time sorted: True
bookingID=755914244122 | time sorted: True
bookingID=1391569403968 | time sorted: True
Split overlap count: 0


#

# **Summary of outputs**

Created time-series datasets stored in `Datasets/time_series_data/`:

- `timeseries_v1_base.csv`  
  Long-format sensor time-series data, labelled per trip. Suitable for HMM baseline and all time-series models.

- `timeseries_v1_hybrid.csv` (optional)  
  If engineered features exist, static trip-level numeric features are merged and repeated per timestep.
  This is mainly intended for neural sequence models (LSTM/CNN) rather than HMM.

- `timeseries_v1_trip_split.csv`  
  Trip-based train/test split (by bookingID) to prevent data leakage and ensure fair comparison across time-series models.

### Key design choices and justification
- Long-format CSV is readable, consistent with the rest of the project, and compatible with batch + real-time processing later.
- Train/test split is performed at trip-level (bookingID) to avoid leakage.
- Engineered trip-level features are included only as optional static context because they are not true sequential observations.
