# ARIMAX Modeling for Time Series Analysis

This notebook demonstrates how to implement and use ARIMAX (AutoRegressive Integrated Moving Average with eXogenous variables) for time series forecasting. ARIMAX extends the traditional ARIMA model by incorporating external variables that might influence the target variable.

In [None]:
!pip install -q statsmodels seaborn

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.statespace.sarimax import SARIMAX
# from statsmodels.stats.diagnostic import acf
from sklearn.metrics import mean_squared_error, mean_absolute_error
from datetime import datetime, timedelta
from pathlib import Path

# Set style for better visualizations
#plt.style.use('seaborn')
sns.set_palette('deep')

In [None]:
DATA_PATH = Path("../../data")

## Load data

Load data for ARIMAX model
1. A target variable (e.g., débit_insitu)
2. Two exogenous variables (e.g., P_cumul_7j, débit_mgb)

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv(
    DATA_PATH/'data_cumul.csv', 
    sep=';', 
    usecols=['time', 'P_cumul_7j', 'débit_insitu', 'débit_mgb'], 
    index_col='time',
    converters={"time": pd.to_datetime}
    )

## Data Visualization

Let's visualize our time series data to understand the patterns and relationships.

In [None]:
# Plot time series
fig, axes = plt.subplots(3, 1, figsize=(15, 12))
fig.suptitle('Time Series Components')

# Sales
data['débit_insitu'].plot(ax=axes[0], title='Water flow over Time')
axes[0].set_xlabel('')
axes[0].set_ylabel('débit_insitu')

# Temperature
data['P_cumul_7j'].plot(ax=axes[1], title='Rain cumul over Time')
axes[1].set_xlabel('')
axes[1].set_ylabel('P_cumul_7j')

# Marketing Spend
data['débit_mgb'].plot(ax=axes[2], title='MGB model prediction over Time')
axes[2].set_xlabel('Date')
axes[2].set_ylabel('débit_mgb')

plt.tight_layout()
plt.show()

## Prepare Data for ARIMAX

We'll split our data into training and testing sets, and prepare the exogenous variables.

In [None]:
# Split data into train and test sets (80-20 split)
train_size = int(len(data) * 0.8)
train_data = data[:train_size]
test_data = data[train_size:]

# Prepare exogenous variables
exog_train = train_data[['P_cumul_7j', 'débit_mgb']]
exog_test = test_data[['P_cumul_7j', 'débit_mgb']]

print(f"Training set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

## Build and Train ARIMAX Model

We'll use the SARIMAX class from statsmodels to implement our ARIMAX model. The order parameters (p,d,q) will be set to (1,1,1) for this example, but in practice, you should use techniques like AIC/BIC or grid search to find optimal parameters.

In [None]:
# Initialize and train ARIMAX model
model = SARIMAX(train_data['débit_insitu'],
                exog=exog_train,
                order=(1, 1, 1),
                enforce_stationarity=False,
                enforce_invertibility=False)

model_fit = model.fit(disp=False)
print(model_fit.summary())

## Make Predictions and Evaluate Model

In [None]:
# Make predictions
predictions = model_fit.predict(start=len(train_data),
                              end=len(data)-1,
                              exog=exog_test)

# Calculate error metrics
mse = mean_squared_error(test_data['débit_insitu'], predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(test_data['débit_insitu'], predictions)

print(f'Mean Squared Error: {mse:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')
print(f'Mean Absolute Error: {mae:.2f}')

# Plot actual vs predicted values
plt.figure(figsize=(15, 6))
plt.plot(test_data.index, test_data['débit_insitu'], label='Actual')
plt.plot(test_data.index, predictions, label='Predicted')
plt.title('ARIMAX: Actual vs Predicted Flow')
plt.xlabel('Date')
plt.ylabel('Flow')
plt.legend()
plt.show()

## Model Diagnostics

Let's examine the model residuals to check if our model assumptions are met.

In [None]:
# Get model residuals
residuals = pd.DataFrame(model_fit.resid)

# Plot residuals
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Model Diagnostics')

# Residuals over time
residuals.plot(ax=axes[0,0], title='Residuals over Time')
axes[0,0].set_xlabel('Date')
axes[0,0].set_ylabel('Residual')

# Residuals histogram
residuals.hist(ax=axes[0,1], bins=30)
axes[0,1].set_title('Residuals Distribution')

# Q-Q plot
from scipy import stats
stats.probplot(residuals.iloc[:,0], dist="norm", plot=axes[1,0])
axes[1,0].set_title('Q-Q Plot')

# Autocorrelation plot
pd.plotting.autocorrelation_plot(residuals.iloc[:,0], ax=axes[1,1])
axes[1,1].set_title('Residuals Autocorrelation')

plt.tight_layout()
plt.show()

## Conclusions

The ARIMAX model we built demonstrates how to:
1. Incorporate exogenous variables (temperature and marketing spend) into time series forecasting
2. Make predictions on test data
3. Evaluate model performance using various metrics
4. Perform model diagnostics

To improve the model, you could:
1. Tune the ARIMAX parameters (p,d,q) using grid search or AIC/BIC criteria
2. Add seasonal components (SARIMAX)
3. Include more relevant exogenous variables
4. Handle any seasonality or trends in the data preprocessing step