# Advanced Time Series Forecasting with Attention LSTM

This project implements a deep learning based time series forecasting system using an Attention-based LSTM model and compares its performance against an XGBoost baseline model.

Evaluation metrics used:
- MAE
- RMSE
- MAPE

We also perform Rolling Origin Cross Validation for robust evaluation.


In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from xgboost import XGBRegressor


## Data Generation

We generate synthetic retail demand data consisting of:
- Sales target (seasonal + trend)
- Foot traffic (correlated feature)
- Market noise (random walk)


In [2]:
np.random.seed(42)

t = np.arange(1095)

sales = 0.05*t + 15*np.sin(2*np.pi*t/365) + np.random.normal(0,2,1095)
traffic = 0.7*sales + np.random.normal(0,1,1095)
noise = np.cumsum(np.random.normal(0,1,1095))

df = pd.DataFrame({
    'Sales_Target':sales,
    'Foot_Traffic':traffic,
    'Market_Noise':noise
})

df.head()


Unnamed: 0,Sales_Target,Foot_Traffic,Market_Noise
0,0.993428,0.614683,-1.598124
1,0.031672,0.100805,-1.135952
2,1.911701,-0.66001,0.888358
3,3.970355,3.695576,-0.474816
4,0.76373,0.881099,-0.28511


## Preprocessing

We normalize the dataset using MinMax scaling and create sliding windows
of past 30 days to predict the next day.


In [3]:
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)

WINDOW = 30

def create_sequences(data,window):
    X,y=[],[]
    for i in range(len(data)-window):
        X.append(data[i:i+window])
        y.append(data[i+window,0])
    return np.array(X),np.array(y)

X,y = create_sequences(scaled,WINDOW)

split=int(0.8*len(X))
X_train,X_test=X[:split],X[split:]
y_train,y_test=y[:split],y[split:]

X_train=torch.tensor(X_train,dtype=torch.float32)
y_train=torch.tensor(y_train,dtype=torch.float32)
X_test=torch.tensor(X_test,dtype=torch.float32)
y_test=torch.tensor(y_test,dtype=torch.float32)


## Attention LSTM Model

The model uses:
- LSTM encoder
- Attention mechanism
- Fully connected output layer


In [4]:
class AttentionLSTM(nn.Module):
    def __init__(self,input_size,hidden_size):
        super().__init__()
        self.lstm=nn.LSTM(input_size,hidden_size,batch_first=True)
        self.attn=nn.Linear(hidden_size,1)
        self.fc=nn.Linear(hidden_size,1)

    def forward(self,x):
        out,_=self.lstm(x)
        weights=torch.softmax(self.attn(out),dim=1)
        context=torch.sum(weights*out,dim=1)
        return self.fc(context)

model=AttentionLSTM(3,64)
optimizer=torch.optim.Adam(model.parameters(),lr=0.001)
loss_fn=nn.MSELoss()


In [5]:
for epoch in range(40):
    pred=model(X_train).squeeze()
    loss=loss_fn(pred,y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch%10==0:
        print("Epoch",epoch,"Loss",loss.item())


Epoch 0 Loss 0.30405905842781067
Epoch 10 Loss 0.1430244743824005
Epoch 20 Loss 0.02876606397330761
Epoch 30 Loss 0.024713926017284393


## Rolling Origin Cross Validation

Instead of predicting the whole test set at once, the model predicts
step-by-step while moving forward through time.


In [6]:
def rolling_cv(model,X,y):
    preds=[]
    for i in range(len(X)):
        with torch.no_grad():
            p=model(X[i:i+1]).item()
        preds.append(p)
    return np.array(preds)

lstm_preds=rolling_cv(model,X_test,y_test)


In [7]:
def metrics(true,pred):
    mae=mean_absolute_error(true,pred)
    rmse=np.sqrt(mean_squared_error(true,pred))
    mape=np.mean(np.abs((true-pred)/true))*100
    return mae,rmse,mape

mae_lstm,rmse_lstm,mape_lstm=metrics(y_test.numpy(),lstm_preds)
print("ATTENTION LSTM:",mae_lstm,rmse_lstm,mape_lstm)


ATTENTION LSTM: 0.2079366382578729 0.21785477440623596 28.068345115473615


## XGBoost Baseline

A gradient boosting regression model is used as a classical ML baseline.


In [8]:
X_xgb=X.reshape(len(X),-1)
X_train_xgb,X_test_xgb=X_xgb[:split],X_xgb[split:]
y_train_xgb,y_test_xgb=y[:split],y[split:]

xgb=XGBRegressor(n_estimators=300,max_depth=5,learning_rate=0.05)
xgb.fit(X_train_xgb,y_train_xgb)

xgb_preds=xgb.predict(X_test_xgb)

mae_xgb,rmse_xgb,mape_xgb=metrics(y_test_xgb,xgb_preds)
print("XGBOOST:",mae_xgb,rmse_xgb,mape_xgb)


XGBOOST: 0.03786828562484861 0.047326683685442406 5.162346132513275


## Final Comparison

We compare deep learning vs gradient boosting performance.


In [9]:
print("\nFINAL COMPARISON")
print("Model\t\tMAE\tRMSE\tMAPE")
print(f"AttentionLSTM\t{mae_lstm:.4f}\t{rmse_lstm:.4f}\t{mape_lstm:.2f}")
print(f"XGBoost\t\t{mae_xgb:.4f}\t{rmse_xgb:.4f}\t{mape_xgb:.2f}")



FINAL COMPARISON
Model		MAE	RMSE	MAPE
AttentionLSTM	0.2079	0.2179	28.07
XGBoost		0.0379	0.0473	5.16


## Conclusion

XGBoost performed better on this synthetic dataset because tree-based models
capture tabular patterns efficiently. Attention LSTM requires more data
and hyperparameter tuning to outperform classical ML models.

Future improvements:
- Hyperparameter tuning
- Longer training
- Real dataset
- Multiple features
