### Case - [Pierre-Antoine Denarié]

### TL;DR
Objective: predict the time it takes for a bus to travel from first to final stop, at any given point in the future.

Points to work out:
- correlation between checkin data & scheduling ties length/delta in prediction --> create feature delta time
- Join checkin data with scheduling data --> PCA
- correlation between stoparea code & join of number of checkins during said Journey

Plots:
- plot checkins over time

features to create:
- split checkin data
- feature delta time


### Housekeeping & Imports

In [1]:
# Clear all variables from the workspace
%reset -f

In [2]:
### Package Imports
# System
from __future__ import annotations
import os
from pathlib import Path
import re
import logging
from typing import Dict
from functools import partial

# Data
import numpy as np
import pandas as pd

# Own packages
from src.myfunctions import *

# Import visualisation libraries
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport

# Pretty display for notebooks
%matplotlib inline

# Setup Logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")


INFO: Pandas backend loaded 2.3.1
INFO: Numpy backend loaded 2.1.3
INFO: Pyspark backend NOT loaded
INFO: Python backend loaded


In [3]:
# Parameter
DATA_DIR = "Data"
DATA_RAW_DIR = DATA_DIR + "/Raw"
PROCESSED_DIR = DATA_DIR + "/Processed"

REPORTS : str = "Reports"
PROFILING : str = REPORTS + "/Profiling"

### Data Loading

In [4]:
# Clean Data imports
dfs = load_csv_folder(DATA_RAW_DIR)
print(dfs.keys())

INFO: Loaded bus_trips from bus_trips.csv | shape=(591766, 11)
INFO: Loaded check_ins from check_ins.csv | shape=(41923, 2)
INFO: Loaded stops from stops.csv | shape=(5636, 3)


dict_keys(['bus_trips', 'check_ins', 'stops'])


### Data exploration

In [5]:
# Generate profling reports
df_profiles = generate_profiling_reports(dfs, title="Data Profiling Report")

INFO: Generated profiling report for bus_trips
INFO: Generated profiling report for check_ins
INFO: Generated profiling report for stops


In [6]:
# View profiling reports
display_profiling_reports_web(profiles=df_profiles)

INFO: Opened profiling report for bus_trips
INFO: Opened profiling report for check_ins
INFO: Opened profiling report for stops


In [26]:
# Display time series data
pipeline = {
    "check_ins": [
        Step("add_feature", when=has_cols("id"), fn=partial(add_feature, src="id", dest='id_Datetime', function=parse_time_column))
    ]
}

dfs_processed = run_pipeline(dfs, pipeline)

display_profiling_reports_web(generate_profiling_reports(dfs_processed, title="Data Profiling Processed Report", timeseries=True, sortby="id_Datetime"), refresh_results=False)


INFO: ▶ bus_trips: 0 steps
INFO: ▶ check_ins: 1 steps
INFO:   - add_feature: (41923, 2) -> (41923, 3) (165 ms)
INFO: ▶ stops: 0 steps


TypeError: generate_profiling_reports() got an unexpected keyword argument 'sortby'

In [27]:
## closer inspection of duplicate rows
dfs['bus_trips'][(dfs['bus_trips'] == dfs['bus_trips'].iloc[0]).all(axis=1)]

Unnamed: 0,lineplanningnumber,journeynumber,vehiclenumber,userstopcode_start,userstopcode_end,messagetype_begin,messagetype_end,operatingday,departure_time,realized_time,planned_time
0,1,1001,5106,54261502,53490410,DEPARTURE,ARRIVAL,2023-12-12,2023-12-12 06:16:39,00:27:33,00:30:55
70919,1,1001,5106,54261502,53490410,DEPARTURE,ARRIVAL,2023-12-12,2023-12-12 06:16:39,00:27:33,00:30:55
141838,1,1001,5106,54261502,53490410,DEPARTURE,ARRIVAL,2023-12-12,2023-12-12 06:16:39,00:27:33,00:30:55
212757,1,1001,5106,54261502,53490410,DEPARTURE,ARRIVAL,2023-12-12,2023-12-12 06:16:39,00:27:33,00:30:55


In [44]:
dfs['check_ins'].pipe(add_feature, src='id', dest='id_Parsed', function=parse_time_column, overwrite=True)

Unnamed: 0,id,number_of_check_ins,id_Datetime,id_Parsed
0,2020_1_1_0,3.1,2020-01-01 00:00:00,2020-01-01 00:00:00
1,2020_1_1_1,6.6,2020-01-01 01:00:00,2020-01-01 01:00:00
2,2020_1_1_2,7.7,2020-01-01 02:00:00,2020-01-01 02:00:00
3,2020_1_1_3,6.7,2020-01-01 03:00:00,2020-01-01 03:00:00
4,2020_1_1_4,6.6,2020-01-01 04:00:00,2020-01-01 04:00:00
...,...,...,...,...
41918,2024_10_12_19,130.1,2024-10-12 19:00:00,2024-10-12 19:00:00
41919,2024_10_12_20,104.1,2024-10-12 20:00:00,2024-10-12 20:00:00
41920,2024_10_12_21,87.5,2024-10-12 21:00:00,2024-10-12 21:00:00
41921,2024_10_12_22,95.4,2024-10-12 22:00:00,2024-10-12 22:00:00


### Data Cleaning

### Feature Engineering

### Data Staging