RICAVARE I DATI GEOGRAFICI

In [64]:
from pathlib import Path
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os
import googlemaps
import time
from tqdm import tqdm
import plotly.express as px
from plotly.subplots import make_subplots
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

In [65]:
load_dotenv()
api_key = os.getenv('GOOGLE_API_KEY')

In [66]:
gmaps = googlemaps.Client(key=api_key)

In [67]:
INTERIM_PATH = Path("data/interim")
df = pd.read_parquet(INTERIM_PATH / "train_data_cleaned.parquet")

In [68]:
print(df.shape)

(28064057, 15)


In [69]:
unique_stops = df['stop_name'].unique()
print(f"Number of unique stops: {len(unique_stops)}")

Number of unique stops: 2364


In [6]:
def get_coordinates_from_google(stop_name):
    try:
        geocode_result = gmaps.geocode(stop_name, region="it")
        if geocode_result:
            lat = geocode_result[0]['geometry']['location']['lat']
            lon = geocode_result[0]['geometry']['location']['lng']
            return lat, lon
        return None, None
    except Exception as e:
        print(f"Error retrieving for {stop_name}: {e}")
        return None, None

In [7]:
coordinates_list_google = []

for stop in tqdm(unique_stops, desc="Retrieving coordinates", unit="stop"):
    lat, lon = get_coordinates_from_google(stop)
    coordinates_list_google.append({'stop_name': stop, 'latitude': lat, 'longitude': lon})
    
    # Pause between requests to avoid exceeding the request limit
    # time.sleep(1)

Retrieving coordinates: 100%|██████████| 2364/2364 [03:43<00:00, 10.56stop/s]


In [8]:
coordinates_df_google = pd.DataFrame(coordinates_list_google)
print(coordinates_df_google.head())

               stop_name   latitude  longitude
0           BOLOGNA C.LE  44.494887  11.342616
1    S.LAZZARO DI SAVENA  44.468974  11.421816
2     OZZANO DELL'EMILIA  44.446347  11.472402
3  CASTEL S.PIETRO TERME  44.399624  11.589728
4                  IMOLA  44.351305  11.712926


Create map for train stops distribution using density_mapbox

In [10]:
fig_stops = px.density_map(
    coordinates_df_google,
    lat='latitude',
    lon='longitude',
    hover_name="stop_name",
    title="Train Stops Distribution",
    radius=10,
    opacity=0.6,
    zoom=6,
    map_style="carto-positron")
fig_stops.update_layout(height=900)
fig_stops.update_layout(width=1200)

fig_stops.show()

Ci sono svariati errori...

Proviamo con Nominatim per vedere se è più preciso

In [12]:
geolocator = Nominatim(user_agent="train_stops_locator")

def get_coordinates_from_nominatim(stop_name):
    try:
        location = geolocator.geocode(stop_name, country_codes="it", timeout=10)
        if location:
            return location.latitude, location.longitude
        return None, None
    except GeocoderTimedOut:
        print(f"Timeout for {stop_name}")
        return None, None
    except Exception as e:
        print(f"Error retrieving for {stop_name}: {e}")
        return None, None


In [13]:
coordinates_list_nominatim = []

for stop in tqdm(unique_stops, desc="Retrieving coordinates", unit="stop"):
    lat, lon = get_coordinates_from_nominatim(stop)
    coordinates_list_nominatim.append({'stop_name': stop, 'latitude': lat, 'longitude': lon})
    
    # Per evitare di sovraccaricare Nominatim (rispettare le politiche di utilizzo)
    # time.sleep(1)


Retrieving coordinates: 100%|██████████| 2364/2364 [39:24<00:00,  1.00s/stop]


In [14]:
coordinates_df_nominatim = pd.DataFrame(coordinates_list_nominatim)     # Nominatim
print(coordinates_df_nominatim.head())

               stop_name   latitude  longitude
0           BOLOGNA C.LE  44.505878  11.343343
1    S.LAZZARO DI SAVENA  44.471567  11.404859
2     OZZANO DELL'EMILIA  44.444980  11.476050
3  CASTEL S.PIETRO TERME  44.401270  11.585499
4                  IMOLA  44.353515  11.714123


In [None]:
INTERIM_PATH = Path("data/interim")
INTERIM_PATH.mkdir(parents=True, exist_ok=True)

coordinates_df_nominatim.to_parquet(INTERIM_PATH / "coordinates_df_nominatim.parquet", index=False)

print("Coordinates datasets successfully saved in 'data/interim'")


In [15]:
fig_stops = px.density_map(
    coordinates_df_nominatim,
    lat='latitude',
    lon='longitude',
    hover_name="stop_name",
    title="Train Stops Distribution",
    radius=10,
    opacity=0.6,
    zoom=6,
    map_style="carto-positron")
fig_stops.update_layout(height=900)
fig_stops.update_layout(width=1200)

fig_stops.show()

Molto più preciso, mergiamo il dataset delle coordinate con il dataset originale e lo salviamo 

In [71]:
df_with_coordinates = pd.merge(df, coordinates_df_nominatim, on='stop_name', how='left')

In [72]:
df_with_coordinates.head()

Unnamed: 0,train_id,train_number,departure_station,train_departure_delay,arrival_station,train_arrival_delay,train_class,scheduled_departure_time,scheduled_arrival_time,stop_name,stop_arrival_delay,stop_departure_delay,stop_arrival_time,stop_departure_time,is_terminal_stop,latitude,longitude
0,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,BOLOGNA C.LE,0.0,1.0,NaT,2024-09-30 22:38:00,True,44.505878,11.343343
1,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,S.LAZZARO DI SAVENA,2.0,2.0,2024-09-30 22:45:00,2024-09-30 22:46:00,False,44.471567,11.404859
2,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,OZZANO DELL'EMILIA,3.0,2.0,2024-09-30 22:51:00,2024-09-30 22:52:00,False,44.44498,11.47605
3,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,CASTEL S.PIETRO TERME,3.0,4.0,2024-09-30 22:58:00,2024-09-30 22:59:00,False,44.40127,11.585499
4,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,IMOLA,3.0,4.0,2024-09-30 23:07:00,2024-09-30 23:08:00,False,44.353515,11.714123


In [73]:
print(df_with_coordinates.columns)

Index(['train_id', 'train_number', 'departure_station',
       'train_departure_delay', 'arrival_station', 'train_arrival_delay',
       'train_class', 'scheduled_departure_time', 'scheduled_arrival_time',
       'stop_name', 'stop_arrival_delay', 'stop_departure_delay',
       'stop_arrival_time', 'stop_departure_time', 'is_terminal_stop',
       'latitude', 'longitude'],
      dtype='object')


In [74]:
df_with_coordinates.to_parquet(INTERIM_PATH / "train_data_with_coordinates.parquet", index=False)

print("Datasets successfully saved in 'data/interim'")

Datasets successfully saved in 'data/interim'


passiamo alla mappa dei delay

In [75]:
stop_delays = df_with_coordinates.groupby(["stop_name", "latitude", "longitude"])["stop_arrival_delay"].mean().reset_index()

fig_delays = px.density_map(
    stop_delays, 
    lat="latitude", 
    lon="longitude", 
    hover_name="stop_name", 
    title="Average Train Delay Distribution by Stop", 
    radius=10, 
    opacity=0.6, 
    zoom=6,
    map_style="carto-positron",
)
fig_delays.update_layout(height=900)
fig_delays.update_layout(width=1200)

fig_delays.show()

**Time-Based Features**

In [76]:
df_with_coordinates = pd.read_parquet(INTERIM_PATH / "train_data_with_coordinates.parquet")

In [77]:
df_with_coordinates["hour"] = df_with_coordinates["scheduled_departure_time"].dt.hour
df_with_coordinates["day_of_week"] = df_with_coordinates["scheduled_departure_time"].dt.dayofweek  # Monday=0, Sunday=6
df_with_coordinates["is_weekend"] = df_with_coordinates["day_of_week"].isin([5, 6]).astype(int)
df_with_coordinates["month"] = df_with_coordinates["scheduled_departure_time"].dt.month

# Define rush hours (e.g., 7-9 AM, 5-7 PM)
df_with_coordinates["is_rush_hour"] = df_with_coordinates["hour"].isin([7, 8, 9, 17, 18, 19]).astype(int)

In [78]:
df_with_coordinates.head()

Unnamed: 0,train_id,train_number,departure_station,train_departure_delay,arrival_station,train_arrival_delay,train_class,scheduled_departure_time,scheduled_arrival_time,stop_name,...,stop_arrival_time,stop_departure_time,is_terminal_stop,latitude,longitude,hour,day_of_week,is_weekend,month,is_rush_hour
0,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,BOLOGNA C.LE,...,NaT,2024-09-30 22:38:00,True,44.505878,11.343343,22,0,0,9,0
1,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,S.LAZZARO DI SAVENA,...,2024-09-30 22:45:00,2024-09-30 22:46:00,False,44.471567,11.404859,22,0,0,9,0
2,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,OZZANO DELL'EMILIA,...,2024-09-30 22:51:00,2024-09-30 22:52:00,False,44.44498,11.47605,22,0,0,9,0
3,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,CASTEL S.PIETRO TERME,...,2024-09-30 22:58:00,2024-09-30 22:59:00,False,44.40127,11.585499,22,0,0,9,0
4,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,IMOLA,...,2024-09-30 23:07:00,2024-09-30 23:08:00,False,44.353515,11.714123,22,0,0,9,0


**Station-Specific Features**

In [79]:
# Count how often each station appears (proxy for congestion)
station_counts = df_with_coordinates["stop_name"].value_counts()
df_with_coordinates["station_traffic"] = df_with_coordinates["stop_name"].map(station_counts)

# Define high-traffic stations (above median frequency)
median_traffic = df_with_coordinates["station_traffic"].median()
df_with_coordinates["is_high_traffic_station"] = (df_with_coordinates["station_traffic"] >= median_traffic).astype(int)

In [80]:
df_with_coordinates.head()

Unnamed: 0,train_id,train_number,departure_station,train_departure_delay,arrival_station,train_arrival_delay,train_class,scheduled_departure_time,scheduled_arrival_time,stop_name,...,is_terminal_stop,latitude,longitude,hour,day_of_week,is_weekend,month,is_rush_hour,station_traffic,is_high_traffic_station
0,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,BOLOGNA C.LE,...,True,44.505878,11.343343,22,0,0,9,0,161611,1
1,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,S.LAZZARO DI SAVENA,...,False,44.471567,11.404859,22,0,0,9,0,14415,0
2,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,OZZANO DELL'EMILIA,...,False,44.44498,11.47605,22,0,0,9,0,14957,0
3,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,CASTEL S.PIETRO TERME,...,False,44.40127,11.585499,22,0,0,9,0,27446,1
4,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,IMOLA,...,False,44.353515,11.714123,22,0,0,9,0,42632,1


**Delay Propagation Features**

In [81]:
# Compute difference between arrival and departure delay
df_with_coordinates["delay_change"] = df_with_coordinates["stop_arrival_delay"] - df_with_coordinates["stop_departure_delay"]

# Flag increasing delay
df_with_coordinates["is_delay_increasing"] = (df_with_coordinates["delay_change"] > 0).astype(int)

# Rolling delay average (captures delay trends within a train's route)
df_with_coordinates["rolling_arrival_delay"] = df_with_coordinates.groupby("train_id")["stop_arrival_delay"].shift(1).rolling(3).mean()
df_with_coordinates["rolling_departure_delay"] = df_with_coordinates.groupby("train_id")["stop_departure_delay"].shift(1).rolling(3).mean()

In [82]:
df_with_coordinates.head()

Unnamed: 0,train_id,train_number,departure_station,train_departure_delay,arrival_station,train_arrival_delay,train_class,scheduled_departure_time,scheduled_arrival_time,stop_name,...,day_of_week,is_weekend,month,is_rush_hour,station_traffic,is_high_traffic_station,delay_change,is_delay_increasing,rolling_arrival_delay,rolling_departure_delay
0,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,BOLOGNA C.LE,...,0,0,9,0,161611,1,-1.0,0,,
1,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,S.LAZZARO DI SAVENA,...,0,0,9,0,14415,0,0.0,0,,
2,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,OZZANO DELL'EMILIA,...,0,0,9,0,14957,0,1.0,1,,
3,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,CASTEL S.PIETRO TERME,...,0,0,9,0,27446,1,-1.0,0,1.666667,1.666667
4,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,IMOLA,...,0,0,9,0,42632,1,-1.0,0,2.666667,2.666667


**Historical Delay Trends**

In [83]:
df_with_coordinates["historical_avg_delay"] = df_with_coordinates.groupby(["stop_name", "hour"])["stop_arrival_delay"].transform("mean")

In [84]:
df_with_coordinates.head()

Unnamed: 0,train_id,train_number,departure_station,train_departure_delay,arrival_station,train_arrival_delay,train_class,scheduled_departure_time,scheduled_arrival_time,stop_name,...,is_weekend,month,is_rush_hour,station_traffic,is_high_traffic_station,delay_change,is_delay_increasing,rolling_arrival_delay,rolling_departure_delay,historical_avg_delay
0,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,BOLOGNA C.LE,...,0,9,0,161611,1,-1.0,0,,,0.297143
1,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,S.LAZZARO DI SAVENA,...,0,9,0,14415,0,0.0,0,,,11.009852
2,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,OZZANO DELL'EMILIA,...,0,9,0,14957,0,1.0,1,,,11.737624
3,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,CASTEL S.PIETRO TERME,...,0,9,0,27446,1,-1.0,0,1.666667,1.666667,10.546798
4,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,IMOLA,...,0,9,0,42632,1,-1.0,0,2.666667,2.666667,9.186275


**Length of stay by individual station**

In [None]:
# Planned stop duration
df_with_coordinates["planned_stop_duration"] = (df_with_coordinates["stop_departure_time"] - df_with_coordinates["stop_arrival_time"]).dt.total_seconds() / 60

# Convert delays (minutes) to timedelta
df_with_coordinates["stop_departure_delay_timedelta"] = pd.to_timedelta(df_with_coordinates["stop_departure_delay"], unit="m")
df_with_coordinates["stop_arrival_delay_timedelta"] = pd.to_timedelta(df_with_coordinates["stop_arrival_delay"], unit="m")

df_with_coordinates["actual_stop_duration"] = (
    (df_with_coordinates["stop_departure_time"] + df_with_coordinates["stop_departure_delay_timedelta"]) - 
    (df_with_coordinates["stop_arrival_time"] + df_with_coordinates["stop_arrival_delay_timedelta"])
).dt.total_seconds() / 60

df_with_coordinates["planned_vs_actual_stop_duration_ratio"] = df_with_coordinates["actual_stop_duration"] / df_with_coordinates["planned_stop_duration"]

# Handle division by zero or NaN values (avoid infinities)
df_with_coordinates["planned_vs_actual_stop_duration_ratio"] = df_with_coordinates["planned_vs_actual_stop_duration_ratio"].replace([np.inf, -np.inf], np.nan)
df_with_coordinates["planned_vs_actual_stop_duration_ratio"] = df_with_coordinates["planned_vs_actual_stop_duration_ratio"].fillna(1)  # Default to 1 when missing data

# Drop temporary columns
df_with_coordinates = df_with_coordinates.drop(columns=["stop_departure_delay_timedelta", "stop_arrival_delay_timedelta"])

In [None]:
df_with_coordinates.head()

Unnamed: 0,train_id,train_number,departure_station,train_departure_delay,arrival_station,train_arrival_delay,train_class,scheduled_departure_time,scheduled_arrival_time,stop_name,...,station_traffic,is_high_traffic_station,delay_change,is_delay_increasing,rolling_arrival_delay,rolling_departure_delay,historical_avg_delay,planned_stop_duration,actual_stop_duration,planned_vs_actual_stop_duration_ratio
0,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,BOLOGNA C.LE,...,161611,1,-1.0,0,,,0.297143,,,1.0
1,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,S.LAZZARO DI SAVENA,...,14415,0,0.0,0,,,11.009852,1.0,1.0,1.0
2,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,OZZANO DELL'EMILIA,...,14957,0,1.0,1,,,11.737624,1.0,0.0,0.0
3,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,CASTEL S.PIETRO TERME,...,27446,1,-1.0,0,1.666667,1.666667,10.546798,1.0,2.0,2.0
4,17431-1727735880-Qk9MT0dOQSBDLkxF,17431,BOLOGNA C.LE,1.0,FAENZA,-1.0,REG,2024-09-30 22:38:00,2024-09-30 23:25:00,IMOLA,...,42632,1,-1.0,0,2.666667,2.666667,9.186275,1.0,2.0,2.0


In [89]:
df_with_coordinates.describe()

Unnamed: 0,train_departure_delay,train_arrival_delay,scheduled_departure_time,scheduled_arrival_time,stop_arrival_delay,stop_departure_delay,stop_arrival_time,stop_departure_time,latitude,longitude,...,station_traffic,is_high_traffic_station,delay_change,is_delay_increasing,rolling_arrival_delay,rolling_departure_delay,historical_avg_delay,planned_stop_duration,actual_stop_duration,planned_vs_actual_stop_duration_ratio
count,28064060.0,28064060.0,28064057,28064057,28064060.0,28064060.0,25298337,25289139,27103840.0,27103840.0,...,28064060.0,28064060.0,28064060.0,28064060.0,19737160.0,19737160.0,28064060.0,22523430.0,22523430.0,28064060.0
mean,3695575.0,2.397261,2024-06-29 18:43:58.267582720,2024-06-29 20:33:41.845865216,2.905763,3.994929,2024-06-29 19:39:49.019669504,2024-06-29 19:32:54.912667904,43.69523,11.23444,...,41263.65,0.5002298,-1.089165,0.1271088,3.261271,4.613413,2.905763,1.399346,2.703894,1.962393
min,-40.0,-75.0,2023-12-31 23:02:00,2023-12-31 23:28:00,-10.0,-10.0,2023-12-31 23:13:00,2023-12-31 23:02:00,36.7204,6.703205,...,1.0,0.0,-284.0,0.0,-10.0,-10.0,-10.0,-437799.0,-437799.0,-371.0
25%,1.0,-2.0,2024-03-29 05:50:00,2024-03-29 07:35:00,0.0,1.0,2024-03-29 06:21:00,2024-03-29 06:15:00,41.92489,9.187344,...,11811.0,0.0,-2.0,0.0,0.0,1.333333,1.636632,1.0,1.0,1.0
50%,1.0,1.0,2024-06-27 04:56:00,2024-06-27 06:43:00,1.0,2.0,2024-06-27 05:29:30,2024-06-27 05:23:00,44.48262,11.1498,...,23942.0,1.0,-1.0,0.0,1.666667,2.666667,2.704198,1.0,2.0,2.0
75%,3.0,3.0,2024-10-01 05:05:00,2024-10-01 06:54:00,4.0,5.0,2024-10-01 06:51:00,2024-10-01 06:41:00,45.48588,12.62163,...,51902.0,1.0,0.0,0.0,4.0,5.333333,3.954041,1.0,3.0,3.0
max,8541205000000.0,602.0,2024-12-31 22:55:00,2025-01-01 13:40:00,300.0,300.0,2025-01-01 13:40:00,2025-01-01 13:34:00,47.00374,18.36933,...,257156.0,1.0,300.0,1.0,296.6667,297.0,295.0,482408.0,482409.0,351.0
std,5585687000.0,9.577877,,,7.471119,7.189786,,,2.247084,2.363644,...,44244.75,0.5,3.871933,0.3330948,7.167194,7.342167,1.950032,283.1935,283.2021,1.830402


**Total Travel Duration Features**

In [87]:
# Planned travel duration (entire journey)
df["scheduled_total_duration"] = (df["scheduled_arrival_time"] - df["scheduled_departure_time"]).dt.total_seconds() / 60

# Convert delays (minutes) to timedelta
df_with_coordinates["train_departure_delay_timedelta"] = pd.to_timedelta(df_with_coordinates["train_departure_delay"], unit="m")
df_with_coordinates["train_arrival_delay_timedelta"] = pd.to_timedelta(df_with_coordinates["train_arrival_delay"], unit="m")

df["actual_total_duration"] = (
    (df_with_coordinates["scheduled_departure_time"] + df_with_coordinates["train_departure_delay_timedelta"]) - 
    (df_with_coordinates["scheduled_arrival_time"] + df_with_coordinates["train_arrival_delay_timedelta"])
).dt.total_seconds() / 60

# Ratio between actual and planned total duration
df["planned_vs_actual_total_ratio"] = df["actual_total_duration"] / df["scheduled_total_duration"]

# Handle division by zero or NaN values
df["planned_vs_actual_total_ratio"].replace([np.inf, -np.inf], np.nan, inplace=True)
df["planned_vs_actual_total_ratio"].fillna(1, inplace=True)  # Default to 1 when missing data

# Drop temporary columns
df_with_coordinates = df_with_coordinates.drop(columns=["train_departure_delay_timedelta", "train_arrival_delay_timedelta"])

OutOfBoundsDatetime: cannot convert input 8541204628196.0 with the unit 'm'

**Extreme Delay Flag**

In [None]:
# Define extreme delay threshold (e.g., top 5% of delays)
extreme_delay_threshold = df_with_coordinates["stop_arrival_delay"].quantile(0.95)
df_with_coordinates["is_extreme_delay"] = (df_with_coordinates["stop_arrival_delay"] >= extreme_delay_threshold).astype(int)

In [None]:
# =====================================================
# 📌 3. BUILD KNOWLEDGE GRAPH
# =====================================================

# Create a directed graph
G = nx.DiGraph()

# Add nodes (stations)
stations = df["stop_name"].unique()
G.add_nodes_from(stations)

# Add edges (train connections between consecutive stops)
for _, row in df.iterrows():
    dep_station = row["departure_station"]
    arr_station = row["stop_name"]
    delay = row["stop_arrival_delay"]

    if G.has_edge(dep_station, arr_station):
        G[dep_station][arr_station]['weight'] = (G[dep_station][arr_station]['weight'] + delay) / 2
    else:
        G.add_edge(dep_station, arr_station, weight=delay)


In [None]:
# =====================================================
# 📌 4. COMPUTE GRAPH-BASED FEATURES
# =====================================================

# Centrality Measures
degree_centrality = nx.degree_centrality(G)
pagerank = nx.pagerank(G, alpha=0.85)

# Shortest Paths to the Main Hub
main_hub = max(degree_centrality, key=degree_centrality.get)  # Station with highest connectivity
shortest_paths = dict(nx.shortest_path_length(G, weight="weight"))

# Assign graph features to dataset
df["degree_centrality"] = df["stop_name"].map(degree_centrality)
df["pagerank"] = df["stop_name"].map(pagerank)
df["shortest_path_to_hub"] = df["stop_name"].map(lambda x: shortest_paths.get(x, {}).get(main_hub, None))

# Normalize graph features
df["degree_centrality"] /= df["degree_centrality"].max()
df["pagerank"] /= df["pagerank"].max()
df["shortest_path_to_hub"] /= df["shortest_path_to_hub"].max()

In [None]:
# =====================================================
# 📌 5. OPTIONAL: WEATHER DATA INTEGRATION
# =====================================================

try:
    weather_df = pd.read_csv("data/external/weather_data.csv", parse_dates=["date"])
    df = df.merge(weather_df, on=["date", "stop_name"], how="left")
    print("✅ Weather data successfully merged!")
except FileNotFoundError:
    print("⚠️ Weather data file not found. Skipping integration.")

In [None]:
# =====================================================
# 📌 6. DEFINE TARGET VARIABLE (y) FOR ML
# =====================================================

df["next_stop_arrival_delay"] = df.groupby("train_id")["stop_arrival_delay"].shift(-1)


In [None]:
# =====================================================
# 📌 7. FEATURE SELECTION & SAVE FINAL DATASET
# =====================================================

# Drop unnecessary columns (keep timestamps for time-series models)
drop_cols = ["train_id", "train_number", "stop_departure_time", "stop_arrival_time"]
df.drop(columns=drop_cols, inplace=True)

# Save dataset
df.to_csv("data/processed/train_dataset_with_knowledge_graph.csv", index=False)

print("✅ Feature extraction with Knowledge Graph completed! Processed dataset saved.")

Your feature extraction pipeline is well-structured and includes a comprehensive set of features covering temporal, spatial, delay propagation, and travel duration aspects. However, considering the **objectives of your project** and **the best practices from the literature**, I have some **suggestions to improve it**.  

---

## ✅ **What Works Well**
✔ **Time-Based Features:**  
   - Useful for identifying trends (rush hour, weekdays, seasons).  
   - **Good addition**: `is_rush_hour` and `season` features.  

✔ **Station-Specific Features:**  
   - **Congestion proxy (`station_traffic`) is a good indicator** of potential delays.  

✔ **Delay Propagation Features:**  
   - **Rolling averages (`rolling_arrival_delay`) are critical for LSTM/RNN models.**  
   - **`is_delay_increasing` helps classify increasing delay events.**  

✔ **Extreme Delay Flag:**  
   - Helps distinguish normal vs. **extreme** delays.  

✔ **Next Stop Delay Prediction (`next_stop_arrival_delay`)**  
   - **Great choice for supervised learning (LSTM, RF, GBM).**  

---

## ❌ **What Could Be Improved or Removed**
### 🔻 **1. K-Means Clustering for Station Groups**  
- **Potential Issue**: Are latitude/longitude available in your dataset? If not, this clustering is not feasible.  
- **Alternative**: Instead of K-Means, you could **use a Knowledge Graph-based station connectivity model** (inspired by your Phase 5 idea).  

### 🔻 **2. Travel Duration Features**
- **`actual_travel_duration` may be redundant**: Since we are predicting **arrival delay**, the delay itself captures unexpected variations in travel time.  
- **Instead**, you can use:  
  ✅ **`planned_vs_actual_duration_ratio`** = `actual_travel_duration / scheduled_travel_duration`  
  - A ratio above 1 suggests delays in travel time.  

### 🔻 **3. Drop Unnecessary Features**
You dropped:  
```python
drop_cols = ["train_id", "train_number", "scheduled_departure_time", 
             "scheduled_arrival_time", "stop_departure_time", "stop_arrival_time", 
             "total_journey_start"]
```
- **Possible issue**:  
  - If you're using **LSTM or RNN**, you might **need timestamps** (`scheduled_departure_time`) for time-series modeling.  
  - Instead of dropping `train_id`, you could use it for **cross-validation grouping** (e.g., ensuring train sequences stay in the same fold).  

---

## 🔥 **Features to Consider Adding**
### 🟢 **1. Historical Delay Trends (Inspired by Literature)**
🔹 **Idea from [Real-Time Passenger Train Delay Prediction (Amtrak Study)](11)**  
- **Why?** If a train was delayed at `t-1`, it is more likely to be delayed at `t`.  
- **How?** Compute **past mean delays** per train, per station, per hour.  
```python
df["historical_avg_delay"] = df.groupby(["stop_name", "hour"])["stop_arrival_delay"].transform("mean")
```

### 🟢 **2. Weather Data Integration (Inspired by [Dynamic Delay Predictions Study](12))**  
🔹 **Why?** Weather (rain, snow) affects train delays.  
🔹 **How?** If you have external data, join with historical weather features:  
```python
df = df.merge(weather_df, on=["date", "stop_name"], how="left")
```
If `weather_df` contains features like `"rain_mm"`, `"temperature"`, etc., these could be useful.  

### 🟢 **3. Interaction Features**
Instead of raw congestion values, use **ratios** to station congestion:  
```python
df["relative_congestion"] = df["station_traffic"] / df["station_traffic"].max()
```
This normalizes congestion across different regions.  

---

## 🎯 **Final Feature Engineering Plan**
✔ **Keep**:  
- `hour`, `day_of_week`, `is_weekend`, `is_rush_hour`, `season`  
- `station_traffic`, `is_high_traffic_station`, `delay_change`, `is_delay_increasing`, `rolling_arrival_delay`  
- `historical_avg_delay` ✅ (New!)  
- `relative_congestion` ✅ (New!)  
- `planned_vs_actual_duration_ratio` ✅ (Modified!)  
- `weather_features` (if available) ✅ (New!)  

❌ **Drop or Modify**:  
- `actual_travel_duration` (Use `planned_vs_actual_duration_ratio` instead)  
- `station_cluster` (unless you confirm lat/lon data availability)  

---

## **Final Thoughts**
Your feature extraction is **already very strong**, but adding **historical delay features, weather impact, and congestion ratios** will make it even **more aligned with recent research**.  

🔥 Let me know if you need help integrating these! 🚆💡

In [None]:
# DATASET CON IL METEO 

Temperature
Rainfall/precipitation
Wind speed
Visibility (fog, storms, etc.)
Severe weather event indicators (e.g., storms, snowfall, heatwaves)