# Store Sales Time Series Forecasting with Neural Networks

## Project Overview

This project investigates the application of neural networks for demand forecasting in retail environments. The primary objective is to answer whether neural models – from simple dense networks (MLP) to more advanced architectures like LSTM – can effectively predict product demand and thereby support optimization of supply chain costs and inventory management.

**Key Research Question**: Can neural networks outperform traditional time series forecasting methods in predicting retail sales, and what is the optimal architecture for this specific domain?

---

## Project Objectives

### Primary Objectives
1. **Develop and Compare Neural Network Architectures** for time series forecasting in retail sales
2. **Evaluate Performance** against traditional baseline methods (Linear Regression, Prophet)
3. **Optimize Model Architecture** for different product categories and store types
4. **Provide Practical Insights** for inventory management and demand planning

### Secondary Objectives
1. **Feature Engineering**: Identify and create relevant features from temporal, promotional, and external data
2. **Model Interpretability**: Understand what patterns neural networks learn from sales data
3. **Scalability Assessment**: Evaluate computational requirements for real-world deployment
4. **Cross-validation Strategy**: Develop appropriate time series validation methodology

---

## Dataset Description

### Source
**Kaggle Competition**: [Store Sales - Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data)

**Corporación Favorita Grocery Sales Dataset**
- **Time Period**: 2013-01-01 to 2017-08-31 (1,684 days)
- **Stores**: 54 retail stores across Ecuador
- **Product Families**: 33 different product categories
- **Total Records**: ~3 million daily sales records
- **Coverage**: Complete grid format (store × product × date)

### Data Files
1. **train.csv** - Historical sales data
   - `date`: Date of sale
   - `store_nbr`: Store identifier (1-54)
   - `family`: Product family/category
   - `sales`: Units sold (target variable)
   - `onpromotion`: Number of items on promotion

2. **test.csv** - Test set for predictions (15 days after training period)

3. **stores.csv** - Store metadata
   - `store_nbr`: Store identifier
   - `city`: Store location
   - `state`: State/province
   - `type`: Store type (A, B, C, D, E)
   - `cluster`: Store cluster (1-17)

4. **oil.csv** - Daily oil prices (Ecuador's economy is oil-dependent)
   - `date`: Date
   - `dcoilwtico`: Oil price

5. **holidays_events.csv** - Holiday and event information
   - `date`: Date
   - `type`: Holiday type (Holiday, Event, etc.)
   - `locale`: Geographic scope (National, Regional, Local)
   - `transferred`: Whether holiday was transferred

6. **transactions.csv** - Daily transaction counts by store
   - `date`: Date
   - `store_nbr`: Store identifier
   - `transactions`: Number of transactions

### Data Characteristics
- **Temporal Granularity**: Daily sales records
- **Missing Data**: Zero sales records (~31% of dataset)
- **Sales Range**: $0.01 to $3,502 per day per store-product
- **Seasonality**: Clear weekly and monthly patterns
- **External Factors**: Oil prices, holidays, promotions impact sales

---

## Neural Network Architectures

### 1. Baseline Models
- **Linear Regression**: Simple linear relationship modeling
- **Prophet**: Facebook's time series forecasting tool
- **Moving Averages**: Simple and exponential smoothing

### 2. Multi-Layer Perceptron (MLP)
- **Architecture**: Dense feedforward network
- **Input**: Windowed time series features + categorical embeddings
- **Hidden Layers**: 2-4 layers with 128-512 neurons
- **Activation**: ReLU for hidden layers, Linear for output
- **Regularization**: Dropout (0.2-0.5), L2 regularization

### 3. Long Short-Term Memory (LSTM)
- **Architecture**: Recurrent neural network with memory cells
- **Input**: Sequential time series data (lookback window)
- **LSTM Layers**: 1-3 layers with 64-256 units
- **Output**: Dense layer for final prediction
- **Variants**: Vanilla LSTM, Bidirectional LSTM, Stacked LSTM

### 4. Advanced Architectures (Optional)
- **CNN-LSTM**: Convolutional layers for feature extraction + LSTM for temporal modeling
- **Transformer**: Attention-based architecture for sequence modeling
- **GRU**: Gated Recurrent Unit as LSTM alternative

### 5. TabPFN (Tabular Prior-Data Fitted Networks)
- **Architecture**: Pre-trained Transformer for small tabular datasets
- **Approach**: Prior-Data Fitted Networks using synthetic datasets
- **Advantages**: 
  - No hyperparameter tuning required
  - Extremely fast inference (seconds)
  - Strong performance on small tabular problems
  - No training needed - uses pre-trained weights
- **Application**: Feature-based forecasting (convert time series to tabular format)
- **Limitations**: Maximum 100 features, 10,000 samples per prediction
- **Use Case**: Benchmark against traditional tabular ML approaches

---

## Data Input Specifications

### Feature Engineering Pipeline

#### 1. Temporal Features
- **Lag Features**: Sales from previous 1, 7, 14, 30 days
- **Rolling Statistics**: Mean, std, min, max over 7, 14, 30-day windows
- **Seasonal Features**: Day of week, month, quarter, year
- **Calendar Features**: Is weekend, is month start/end, days from holiday

#### 2. Categorical Features
- **Store Information**: Store number, type, cluster, city, state
- **Product Information**: Product family
- **Encoding Method**: Embedding layers for neural networks

#### 3. External Features
- **Oil Prices**: Current and lagged oil prices
- **Holidays**: Holiday indicators (national, regional, local)
- **Promotions**: Number of items on promotion
- **Transactions**: Store transaction counts

### Input Data Format

#### For MLP Models
```python
# Feature vector per sample
Input Shape: (batch_size, n_features)

# Example feature vector:
[
    # Temporal features (10)
    sales_lag_1, sales_lag_7, sales_lag_14, sales_lag_30,
    rolling_mean_7, rolling_std_7, rolling_mean_30, rolling_std_30,
    day_of_week, month,
    
    # Store embeddings (embedded to 8 dimensions)
    store_embedding_1, store_embedding_2, ..., store_embedding_8,
    
    # Product embeddings (embedded to 6 dimensions)
    family_embedding_1, family_embedding_2, ..., family_embedding_6,
    
    # External features (4)
    oil_price, is_holiday, onpromotion, transactions
]

Total Features: ~28 dimensions
```

#### For LSTM Models
```python
# Sequential data with lookback window
Input Shape: (batch_size, sequence_length, n_features)

# Example: 30-day lookback window
sequence_length = 30
n_features = 8  # [sales, oil_price, onpromotion, transactions, 
                #  day_of_week, month, is_holiday, is_weekend]

# Each sample contains 30 consecutive days of data
# Target: sales value for day 31
```

#### For TabPFN Models
```python
# Tabular format with engineered features
Input Shape: (batch_size, n_features)  # Max 100 features, 10,000 samples

# Example feature vector for TabPFN:
[
    # Statistical features from last 7/14/30 days (60 features)
    sales_mean_7d, sales_std_7d, sales_min_7d, sales_max_7d, sales_last_7d,
    sales_mean_14d, sales_std_14d, sales_min_14d, sales_max_14d, sales_last_14d,
    sales_mean_30d, sales_std_30d, sales_min_30d, sales_max_30d, sales_last_30d,
    # ... similar for oil_price, onpromotion, transactions (45 more features)
    
    # Categorical features (one-hot encoded, 35 features)
    store_type_A, store_type_B, store_type_C, store_type_D, store_type_E,
    family_automotive, family_baby_care, ..., family_other,  # 33 families
    is_weekend,
    
    # Total: ~100 features (within TabPFN limit)
]

# Note: Time series is converted to cross-sectional tabular format
# Each row represents one store-product-date combination
# Features capture temporal patterns through statistical aggregations
```

### Data Preprocessing Steps

1. **Handling Zero Sales**
   - Option 1: Log transformation with offset: log(sales + 1)
   - Option 2: Separate binary classifier for zero vs non-zero
   - Option 3: Focus on positive sales only

2. **Normalization**
   - Numerical features: StandardScaler or MinMaxScaler
   - Sales target: Log transformation or standardization

3. **Time Series Split**
   - Training: 2013-01-01 to 2017-07-31
   - Validation: 2017-08-01 to 2017-08-15
   - Test: 2017-08-16 to 2017-08-31

4. **Cross-Validation Strategy**
   - Time Series Split: Expanding window validation
   - No random shuffling (preserves temporal order)

---

## Output Format Specifications

### Model Predictions

#### Single-Step Forecasting
```python
# Output for each model prediction
Output Shape: (n_samples, 1)

# Example output:
predictions = [
    [12.45],    # Predicted sales for store 1, family A, date t+1
    [8.73],     # Predicted sales for store 1, family B, date t+1
    [23.12],    # Predicted sales for store 2, family A, date t+1
    ...
]

# Post-processing:
# - Ensure non-negative predictions: max(0, prediction)
# - Inverse transform if log scaling was applied
# - Round to appropriate precision (0.01 for currency)
```

#### Multi-Step Forecasting (Optional)
```python
# Output for multiple days ahead
Output Shape: (n_samples, forecast_horizon)

# Example: 15-day forecast
forecasts = [
    [12.45, 11.23, 13.67, ..., 14.56],  # 15-day forecast for sample 1
    [8.73, 9.12, 8.45, ..., 9.23],      # 15-day forecast for sample 2
    ...
]
```

### Submission Format
```python
# Kaggle competition format
submission_df = pd.DataFrame({
    'id': [f'{date}_{store_nbr}_{family}' for ...],  # Unique identifier
    'sales': predictions  # Predicted sales values
})

# Example:
#           id                    sales
# 0   2017-08-16_1_AUTOMOTIVE     12.45
# 1   2017-08-16_1_BABY_CARE       8.73
# 2   2017-08-16_1_BEAUTY         23.12
# ...
```

### Model Outputs Structure
```python
# Comprehensive model output dictionary
model_output = {
    'predictions': predictions,           # Raw predictions
    'confidence_intervals': intervals,    # Prediction intervals (if available)
    'feature_importance': importance,     # Feature importance scores
    'training_history': history,          # Training loss/metrics history
    'validation_metrics': val_metrics,    # Validation performance
    'model_metadata': {
        'architecture': 'LSTM',
        'parameters': model_params,
        'training_time': training_time,
        'inference_time': inference_time
    }
}
```

---

## Evaluation Metrics

### Primary Metrics

#### 1. Root Mean Square Logarithmic Error (RMSLE)
```python
# Primary metric for Kaggle competition
RMSLE = sqrt(mean((log(predicted + 1) - log(actual + 1))^2))

# Advantages:
# - Penalizes underestimation more than overestimation
# - Handles zero values well
# - Scale-invariant
```

#### 2. Mean Absolute Error (MAE)
```python
# Interpretable metric in original units
MAE = mean(|predicted - actual|)

# Advantages:
# - Easy to interpret (average prediction error)
# - Robust to outliers
# - Same units as target variable
```

#### 3. Root Mean Square Error (RMSE)
```python
# Standard regression metric
RMSE = sqrt(mean((predicted - actual)^2))

# Advantages:
# - Penalizes large errors more heavily
# - Commonly used benchmark
# - Same units as target variable
```

### Secondary Metrics

#### 4. Mean Absolute Percentage Error (MAPE)
```python
# Percentage-based metric
MAPE = mean(|actual - predicted| / |actual|) * 100

# Advantages:
# - Scale-independent
# - Easy to interpret as percentage
# - Good for business understanding
```

#### 5. Weighted MAPE (wMAPE)
```python
# Weighted version to handle zero sales
wMAPE = sum(|actual - predicted|) / sum(|actual|) * 100

# Advantages:
# - Better handling of zero/small values
# - More stable than standard MAPE
```

#### 6. Directional Accuracy
```python
# Percentage of correct trend predictions
DA = mean(sign(actual[t] - actual[t-1]) == sign(predicted[t] - actual[t-1]))

# Advantages:
# - Measures trend prediction accuracy
# - Important for inventory planning
```

### Evaluation Strategy

#### Time Series Cross-Validation
```python
# Expanding window validation
for fold in range(n_folds):
    train_end = initial_train_size + fold * step_size
    val_start = train_end + 1
    val_end = val_start + validation_window
    
    # Train on expanding window
    train_data = data[:train_end]
    val_data = data[val_start:val_end]
    
    # Evaluate and store metrics
    metrics[fold] = evaluate_model(model, train_data, val_data)
```

#### Evaluation by Segments
1. **By Product Family**: Performance for each of 33 product categories
2. **By Store Type**: Performance across different store types (A, B, C, D, E)
3. **By Store Cluster**: Performance across 17 store clusters
4. **By Time Period**: Performance during different seasons/months
5. **By Sales Volume**: Performance for high vs low-volume products

#### Statistical Significance Testing
```python
# Diebold-Mariano test for forecast accuracy comparison
from scipy import stats

# Test if Model A significantly outperforms Model B
dm_statistic, p_value = diebold_mariano_test(errors_A, errors_B)
print(f"DM statistic: {dm_statistic:.4f}, p-value: {p_value:.4f}")
```

---

## Experimental Design

### Model Training Pipeline

#### 1. Data Preparation
```python
# Pipeline steps
1. Load raw data files
2. Feature engineering (lags, rolling stats, embeddings)
3. Handle zero sales (transformation or separate modeling)
4. Split data chronologically
5. Scale/normalize features
6. Create sequences for LSTM models
```

#### 2. Hyperparameter Optimization
```python
# Grid search parameters
mlp_params = {
    'hidden_layers': [2, 3, 4],
    'hidden_units': [128, 256, 512],
    'dropout_rate': [0.2, 0.3, 0.5],
    'learning_rate': [0.001, 0.01, 0.1],
    'batch_size': [64, 128, 256]
}

lstm_params = {
    'lstm_units': [64, 128, 256],
    'num_layers': [1, 2, 3],
    'sequence_length': [14, 30, 60],
    'dropout_rate': [0.2, 0.3, 0.5],
    'learning_rate': [0.001, 0.01]
}

tabpfn_params = {
    'feature_window': [7, 14, 30],  # Number of days to create features from
    'aggregation_methods': ['mean', 'std', 'min', 'max', 'last'],
    'max_features': 100,  # TabPFN limitation
    'ensemble_size': [1, 4, 16],  # Number of ensemble members
    'preprocessing': ['standard', 'none']  # Feature scaling options
}
```

#### 3. Training Configuration
```python
# Training settings
training_config = {
    'epochs': 100,
    'early_stopping': {
        'patience': 10,
        'monitor': 'val_loss',
        'restore_best_weights': True
    },
    'callbacks': [
        'EarlyStopping',
        'ReduceLROnPlateau',
        'ModelCheckpoint'
    ],
    'validation_split': 0.2  # From training data
}
```

### Implementation Checklist

#### Phase 1: Data Exploration ✅
- [x] Load and explore dataset
- [x] Analyze sales patterns and distributions
- [x] Identify data quality issues
- [x] Visualize temporal patterns
- [x] Analyze store and product performance

#### Phase 2: Data Preprocessing
- [ ] Implement feature engineering pipeline
- [ ] Handle zero sales appropriately
- [ ] Create categorical embeddings
- [ ] Implement time series data splitting
- [ ] Create data loaders for different model types

#### Phase 3: Baseline Models
- [ ] Implement linear regression baseline
- [ ] Implement Prophet model
- [ ] Implement moving averages
- [ ] Establish baseline performance metrics

#### Phase 4: Neural Network Implementation
- [ ] Implement MLP architecture
- [ ] Implement LSTM architecture
- [ ] Implement TabPFN approach (feature engineering + pre-trained model)
- [ ] Hyperparameter tuning
- [ ] Model training and validation

#### Phase 5: Model Comparison and Analysis
- [ ] Compare all models using established metrics
- [ ] Statistical significance testing
- [ ] Error analysis by segments
- [ ] Feature importance analysis
- [ ] Computational efficiency comparison

---

## Expected Outcomes

### Research Questions to Answer

1. **Model Performance Hierarchy**
   - Do neural networks outperform traditional methods?
   - Which architecture performs best for different product categories?
   - How does performance vary across stores and time periods?
   - Can TabPFN compete with specialized time series methods?

2. **Feature Importance**
   - Which features are most predictive of sales?
   - How important are external factors (oil prices, holidays)?
   - Do categorical embeddings improve performance?
   - What feature engineering works best for TabPFN?

3. **Practical Considerations**
   - What is the computational cost vs accuracy trade-off?
   - How stable are the models across different time periods?
   - Can the models handle seasonal variations effectively?
   - Is TabPFN's zero-shot approach viable for retail forecasting?

### Success Criteria

#### Quantitative Targets
- **RMSLE < 0.5**: Competitive performance on Kaggle leaderboard
- **MAE < $5**: Practical accuracy for inventory planning
- **MAPE < 15%**: Industry-standard forecasting accuracy
- **Neural Network Improvement**: >10% improvement over baselines

#### Qualitative Targets
- Clear understanding of when neural networks excel vs traditional methods
- Actionable insights for retail demand forecasting
- Robust model that generalizes across different product categories
- Interpretable results that can guide business decisions

### Deliverables

1. **Jupyter Notebooks**
   - Data exploration and analysis
   - Model implementation and training
   - Results comparison and visualization

2. **Source Code**
   - Modular, reusable implementations
   - Data preprocessing pipeline
   - Model training and evaluation scripts

3. **Technical Report**
   - Methodology and experimental design
   - Results analysis and interpretation
   - Recommendations for practical implementation

4. **Model Artifacts**
   - Trained model weights
   - Feature importance rankings
   - Performance metrics by segment

---

## Project Timeline

| Phase | Duration | Deliverables |
|-------|----------|-------------|
| **Data Exploration** | Week 1 | ✅ Complete analysis notebook |
| **Data Preprocessing** | Week 2 | Feature engineering pipeline |
| **Baseline Models** | Week 3 | Traditional forecasting benchmarks |
| **Neural Networks** | Week 4-5 | MLP and LSTM implementations |
| **Evaluation & Analysis** | Week 6 | Model comparison and insights |
| **Documentation** | Week 7 | Final report and presentation |

---

*This project aims to provide both theoretical insights into neural network applications in time series forecasting and practical tools for retail demand prediction.*