# **Baseline**  

Our main goal is to **predict train delays** at the next stop. This means:  
- **Input (`X`)**: Features related to train schedules, past delays, station characteristics, congestion, and network centrality (if using `df_graph`).  
- **Target (`y`)**: Next stop arrival delay (`stop_arrival_delay`).  

Our df_graph includes network-related centrality measures (PageRank, betweenness, closeness, etc.). 
Since we expect that the features extracted from the graph could help model how delays propagate across the railway network, we first train the models on the dataset without these features and then on the complete dataset. This allows us to evaluate their actual contribution to the prediction performance.

In this notebook we will only train and analyze the baseline model. **ElasticNet** was chosen for this analysis due to its ability to combine both L1 (Lasso) and L2 (Ridge) regularization. This makes it a flexible choice when dealing with datasets where some features may be highly correlated. The L1 penalty encourages sparsity, automatically selecting the most relevant features by driving less important coefficients to zero, while the L2 penalty helps to handle multicollinearity by penalizing large coefficients.

ElasticNet is particularly advantageous when there are highly correlated features (as we will see between `stop_departure_delay` and `train_departure_delay` in the dataset) because it allows both feature selection and regularization. By tuning the `alpha` (regularization strength) and `l1_ratio` (balance between Lasso and Ridge penalties), ElasticNet offers a balance between eliminating irrelevant variables and stabilizing the estimates of correlated variables.

## **Preprocessing**

In the original dataset, the feature `stop_departure_delay` represents the departure delay at the current station. However, using this feature to predict `stop_arrival_delay` of the same row introduces **data leakage**, as it provides information from the future that would not be available in a real-world prediction scenario.  

To ensure a realistic and valid modeling approach, we **shift** the departure delay of a train's previous stop to the current stop. This means that `prev_stop_departure_delay` now correctly represents the departure delay from the previous station, which can naturally influence the arrival delay at the current station.  

Additionally, rows where no previous stop exists (i.e., the first stop of each train) are removed to maintain dataset integrity. Finally, the original `stop_departure_delay` column is dropped to prevent unintended information leakage.  

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [18]:
PROCESSED_PATH = Path("data/processed")
df = pd.read_parquet(PROCESSED_PATH / "train_data_fe.parquet")
df_graph = pd.read_parquet(PROCESSED_PATH / "train_data_fe_graph.parquet")

In [19]:
df = df.sort_values(by=["train_number", "month", "day_of_week", "hour"])

df["prev_stop_departure_delay"] = df.groupby("train_number")["stop_departure_delay"].shift(1)

df.loc[df.groupby("train_number").head(1).index, "prev_stop_departure_delay"] = np.nan

df = df.drop(columns=["stop_departure_delay"])

df = df.dropna().reset_index(drop=True)


In [20]:
df_graph = df_graph.sort_values(by=["train_number", "month", "day_of_week", "hour"])

df_graph["prev_stop_departure_delay"] = df_graph.groupby("train_number")["stop_departure_delay"].shift(1)

df_graph.loc[df_graph.groupby("train_number").head(1).index, "prev_stop_departure_delay"] = np.nan

df_graph = df_graph.drop(columns=["stop_departure_delay"])

df_graph = df_graph.dropna().reset_index(drop=True)

In [21]:
drop_cols = [
    "scheduled_departure_time", 
    "scheduled_arrival_time",
    "stop_departure_time",
    "departure_station", "arrival_station",
    "stop_name"
]

df = df.drop(columns=drop_cols)
df_graph = df_graph.drop(columns=drop_cols)

In [22]:
df.head()

Unnamed: 0,train_number,train_departure_delay,stop_arrival_delay,is_terminal_stop,latitude,longitude,hour,day_of_week,is_weekend,month,is_rush_hour,is_high_traffic_station,prev_stop_departure_delay
0,10,2.0,3.0,False,45.808946,9.072331,6,0,0,12,0,1,2.0
1,10,1.0,0.0,True,45.486347,9.204528,6,0,0,12,0,1,0.0
2,10,1.0,-5.0,False,45.808946,9.072331,6,0,0,12,0,1,1.0
3,10,1.0,0.0,True,45.486347,9.204528,6,0,0,12,0,1,0.0
4,10,1.0,-7.0,False,45.808946,9.072331,6,0,0,12,0,1,1.0


In [23]:
df.columns

Index(['train_number', 'train_departure_delay', 'stop_arrival_delay',
       'is_terminal_stop', 'latitude', 'longitude', 'hour', 'day_of_week',
       'is_weekend', 'month', 'is_rush_hour', 'is_high_traffic_station',
       'prev_stop_departure_delay'],
      dtype='object')

So at the end we keep:
- `train_number`
- `train_departure_delay`(Initial conditions)
- `stop_arrival_delay` (Target variable)
- `prev_stop_departure_delay` (Useful for delay propagation)
- `latitude`, `longitude` (If geospatial patterns exist)
- `hour`, `day_of_week`, `is_weekend`, `is_rush_hour` (Time effects)
- `is_high_traffic_station` (Congestion indicator)

Particularly, the `train_number` identifies a specific sequence of stops. Even though there is some variability, the same train usually follows a consistent route.
It allows us to captures recurring delay patterns. Infact, some train numbers might systematically experience delays due to factors like priority, maintenance schedules, or traffic congestion.

The problem is that train_number is a categorical variable with many unique values (16585), which means that it cannot be used directly in models like Linear Regression or XGBoost, and it can create overfitting if some `train_numbers` are rarely seen in training.

Instead of using train_number as a raw category, a solution could be replace it with the average delay of that train across the dataset (frequency encoding).

In [24]:
train_delay_map = df.groupby("train_number")["stop_arrival_delay"].mean()

df["train_avg_delay"] = df["train_number"].map(train_delay_map)
df_graph["train_avg_delay"] = df_graph["train_number"].map(train_delay_map)

df.drop(columns=["train_number"], inplace=True)
df_graph.drop(columns=["train_number"], inplace=True)


In [25]:
df.head(10)

Unnamed: 0,train_departure_delay,stop_arrival_delay,is_terminal_stop,latitude,longitude,hour,day_of_week,is_weekend,month,is_rush_hour,is_high_traffic_station,prev_stop_departure_delay,train_avg_delay
0,2.0,3.0,False,45.808946,9.072331,6,0,0,12,0,1,2.0,-2.0
1,1.0,0.0,True,45.486347,9.204528,6,0,0,12,0,1,0.0,-2.0
2,1.0,-5.0,False,45.808946,9.072331,6,0,0,12,0,1,1.0,-2.0
3,1.0,0.0,True,45.486347,9.204528,6,0,0,12,0,1,0.0,-2.0
4,1.0,-7.0,False,45.808946,9.072331,6,0,0,12,0,1,1.0,-2.0
5,1.0,0.0,True,45.486347,9.204528,6,1,0,12,0,1,0.0,-2.0
6,1.0,3.0,False,45.808946,9.072331,6,1,0,12,0,1,1.0,-2.0
7,4.0,0.0,True,45.486347,9.204528,6,1,0,12,0,1,0.0,-2.0
8,4.0,-7.0,False,45.808946,9.072331,6,1,0,12,0,1,4.0,-2.0
9,3.0,0.0,True,45.486347,9.204528,6,1,0,12,0,1,0.0,-2.0


In [26]:
df.describe()

Unnamed: 0,train_departure_delay,stop_arrival_delay,latitude,longitude,hour,day_of_week,is_weekend,month,is_rush_hour,is_high_traffic_station,prev_stop_departure_delay,train_avg_delay
count,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0,21868270.0
mean,2.732117,3.053886,43.52349,11.40381,11.11157,2.752494,0.2242911,6.439963,0.2813572,0.4670946,4.018756,3.053886
std,4.825292,7.552311,2.255552,2.361062,5.025524,1.909132,0.4171147,3.472697,0.4496614,0.4989161,7.275789,2.53276
min,-9.0,-10.0,36.73188,6.709959,0.0,0.0,0.0,1.0,0.0,0.0,-10.0,-10.0
25%,1.0,0.0,41.89614,9.23926,6.0,1.0,0.0,3.0,0.0,0.0,1.0,1.49863
50%,2.0,1.0,44.33214,11.34227,11.0,3.0,0.0,6.0,0.0,0.0,2.0,2.607118
75%,3.0,4.0,45.43387,12.77914,15.0,4.0,0.0,10.0,1.0,1.0,5.0,4.031311
max,289.0,299.0,47.00412,18.16572,23.0,6.0,1.0,12.0,1.0,1.0,299.0,257.0


In [27]:
df.dropna(inplace=True)
df_graph.dropna(inplace=True)

In [28]:
correlation_matrix = df.corr()
print(correlation_matrix["stop_arrival_delay"].sort_values(ascending=False))

stop_arrival_delay           1.000000
prev_stop_departure_delay    0.936630
train_departure_delay        0.522963
train_avg_delay              0.335362
latitude                     0.044421
month                        0.042656
hour                         0.027057
is_rush_hour                 0.016957
day_of_week                 -0.015679
is_weekend                  -0.020547
is_high_traffic_station     -0.032144
longitude                   -0.033651
is_terminal_stop            -0.145491
Name: stop_arrival_delay, dtype: float64


In [29]:
correlation_matrix = df_graph.corr()
print(correlation_matrix["stop_arrival_delay"].sort_values(ascending=False))

stop_arrival_delay           1.000000
prev_stop_departure_delay    0.936630
train_departure_delay        0.522963
train_avg_delay              0.335362
latitude                     0.044421
month                        0.042656
hour                         0.027057
is_rush_hour                 0.016957
pagerank                     0.001079
betweenness_centrality      -0.013184
day_of_week                 -0.015679
is_weekend                  -0.020547
closeness_centrality        -0.021745
is_high_traffic_station     -0.032144
longitude                   -0.033651
degree_centrality           -0.035561
is_terminal_stop            -0.145491
Name: stop_arrival_delay, dtype: float64


The feature `prev_stop_departure_delay` exhibits a very high correlation (0.936) with the target variable `stop_arrival_delay`. While this is expected, since train delays tend to propagate along the route, it poses a critical question for real-world deployment:  

In an ideal scenario, we could assume real-time access to the previous stop’s departure delay, but this assumption might not hold in practice. For example, when predicting delays for future trips, `prev_stop_departure_delay` would be unknown, making our model impractical for many real-world applications.  

To ensure our model is robust and applicable in real-world conditions, we adopt a prudent approach:  
- We exclude `prev_stop_departure_delay` from our feature set and evaluate the model’s performance without it.  
- This forces the model to rely on alternative features like `train_departure_delay`, `train_avg_delay`, and contextual data (`hour`, `day_of_week`, etc.).  
- If performance degradation is minimal, this version of the model is preferable for deployment.  

This project is not only focused on achieving the best predictive performance but also on fostering good machine learning practices in a real-world scenario. By removing `prev_stop_departure_delay`, we explore:  
- How alternative features contribute to delay prediction.  
- The trade-off between predictive power and real-world usability.
- The necessity of feature engineering to compensate for missing high-correlation variables. 

In [30]:
df = df.drop(columns=["prev_stop_departure_delay"])
df_graph = df_graph.drop(columns=["prev_stop_departure_delay"])

df.to_parquet(PROCESSED_PATH / "final_data.parquet")
df_graph.to_parquet(PROCESSED_PATH / "final_data_graph.parquet")

## **Training**

To fine-tune the regularization parameters `alpha` and `l1_ratio`, **GridSearchCV** was applied to the validation set. This allows us to find the optimal combination of regularization strength and the balance between Lasso and Ridge penalties. 

However, while K-Fold Cross-Validation is a great way to evaluate model performance more reliably by splitting the data into multiple train-test subsets, we need to be careful with time-series data (like train delays).

Standard K-Fold CV splits the data randomly into K equal-sized folds. But, when working with time series, a model might learn from future data and then be tested on past data. The time-dependent relationships are broken.

A solution is to apply Time-Series Aware K-Fold CV.
This technique preserves the time order by ensuring that the training set always precedes the test set.
In this way we avoid data leakage by training only on past data and testing on future data, simulating real-world predictions where we use historical data to forecast upcoming delays.

![image](figures\tscv.png)


In [32]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd
import numpy as np
from pathlib import Path

In [33]:
PROCESSED_PATH = Path("data/processed")
RESULTS_PATH = Path("results")
RESULTS_PATH.mkdir(parents=True, exist_ok=True)

In [34]:
def train_and_evaluate_elasticnet(dataset_name):
    print(f"Training and evaluating ElasticNet on {dataset_name} dataset...")

    df = pd.read_parquet(PROCESSED_PATH / f"final_data{'_graph' if dataset_name == 'Graph' else ''}.parquet")
    df = df.sort_values(by=["month", "day_of_week", "hour"])

    target_col = "stop_arrival_delay"
    X = df.drop(columns=[target_col])
    y = df[target_col]

    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    X_test_scaled = scaler.transform(X_test)

    elasticnet = ElasticNet()

    param_grid = {
        'alpha': [0.1, 0.5, 1.0, 5.0],  # Regularization strength
        'l1_ratio': [0.1, 0.5, 0.9]     # Mix between Lasso (L1) and Ridge (L2)
    }

    # TimeSeriesSplit for cross-validation
    tscv = TimeSeriesSplit(n_splits=5)

    grid_search = GridSearchCV(estimator=elasticnet, param_grid=param_grid, 
                               cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1)

    grid_search.fit(X_val_scaled, y_val)

    best_params = grid_search.best_params_
    print(f"Best parameters found for {dataset_name}: {best_params}")

    # Train the model with the best parameters on the training set
    best_model = ElasticNet(alpha=best_params['alpha'], l1_ratio=best_params['l1_ratio'])
    best_model.fit(X_train_scaled, y_train)

    # Make predictions on the test set
    y_pred = best_model.predict(X_test_scaled)

    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"{dataset_name} ElasticNet Results: MAE: {mae:.4f}, RMSE: {rmse:.4f}, R^2: {r2:.4f}")

    results_filename = f"elasticnet_gridsearch_results_{dataset_name.lower()}.csv"
    pd.DataFrame({
        "MAE": [mae],
        "RMSE": [rmse],
        "R^2": [r2],
        "Best Alpha": [best_params['alpha']],
        "Best L1 Ratio": [best_params['l1_ratio']]
    }).to_csv(RESULTS_PATH / results_filename, index=False)

    return best_model

In [35]:
datasets = ["Base", "Graph"]
for dataset in datasets:
    train_and_evaluate_elasticnet(dataset)
    print("\n")

Training and evaluating ElasticNet on Base dataset...
Best parameters found for Base: {'alpha': 0.1, 'l1_ratio': 0.9}
Base ElasticNet Results: MAE: 3.0381, RMSE: 6.1340, R^2: 0.3444


Training and evaluating ElasticNet on Graph dataset...
Best parameters found for Graph: {'alpha': 0.1, 'l1_ratio': 0.9}
Graph ElasticNet Results: MAE: 3.0301, RMSE: 6.1297, R^2: 0.3453




### **Evaluation**:

As expected, after removing prev_stop_departure_delay the performance is degraded.

Both the **Base** and **Graph** datasets perform very similarly. Contrary to our expectations, the extra graph-based features may not add useful predictive power. Not only, they could be introducing noise rather than improving the model. We will see if this behavior is confirmed by later models.
  
**ElasticNet** with the best fund parameters (`alpha=0.1` and `l1_ratio=0.9`) provides a R^2 of around 0.34, meaning the model explains approximately 34% of the variance in the target variable. Although the results are already really discouraging, these experiments aim to compare different datasets and different techniques. And still, there is room for improvement.