# Notebook 2: Time Series Analysis

Welcome to the second notebook in our advanced machine learning series under **Part_3_Advanced_Topics**. In this notebook, we will explore **Time Series Analysis**, a set of techniques for analyzing and forecasting data collected over time, which is crucial in fields like finance, weather prediction, and sales forecasting.

We'll cover the following topics:
- What is Time Series Analysis?
- Key concepts: Trend, Seasonality, and Noise
- How Time Series models work
- Implementation using pandas and statsmodels
- Advantages and limitations

## What is Time Series Analysis?

Time Series Analysis involves studying data points collected or recorded at specific time intervals to identify patterns, trends, and seasonal variations. The goal is often to forecast future values based on historical data.

Unlike standard regression problems, time series data has a temporal dependency, meaning past values influence future ones, requiring specialized models and techniques.

## Key Concepts

- **Trend:** The long-term movement or direction in the data (e.g., increasing, decreasing, or constant).
- **Seasonality:** Regular, periodic fluctuations in the data due to seasonal effects (e.g., higher sales during holidays every year).
- **Noise:** Random variations or irregularities in the data that cannot be attributed to trend or seasonality.
- **Stationarity:** A property of time series where statistical properties like mean and variance remain constant over time. Many models assume or require stationarity.
- **Autocorrelation:** The correlation of a time series with its own past values, used to identify repeating patterns or dependencies.
- **ARIMA Model:** AutoRegressive Integrated Moving Average, a popular model combining autoregression (AR), differencing (I for integration), and moving average (MA) components.

## How Time Series Models Work

Time series analysis typically involves the following steps:

1. **Data Collection and Visualization:** Gather time series data and plot it to visually inspect for trends, seasonality, and anomalies.
2. **Decomposition:** Break down the series into trend, seasonality, and residual (noise) components.
3. **Stationarity Testing and Transformation:** Check if the data is stationary using tests like the Augmented Dickey-Fuller (ADF) test. If not, apply transformations like differencing or detrending.
4. **Model Selection and Fitting:** Choose a model (e.g., ARIMA) based on autocorrelation and partial autocorrelation plots, and fit it to the data.
5. **Forecasting:** Use the fitted model to predict future values.
6. **Evaluation:** Assess the model's performance using metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).

## Implementation Using pandas and statsmodels

Let's implement a basic time series analysis using Python. We'll use pandas for data manipulation and statsmodels for modeling and forecasting with an ARIMA model on a synthetic dataset.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Generate a synthetic time series dataset with trend and seasonality
np.random.seed(42)
t = np.arange(0, 100, 1)
trend = 0.5 * t
seasonality = 10 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 2, len(t))
data = trend + seasonality + noise

# Create a DataFrame with a time index
dates = pd.date_range(start='2020-01-01', periods=len(t), freq='M')
ts_df = pd.DataFrame(data, index=dates, columns=['value'])

# Visualize the time series
plt.figure(figsize=(10, 6))
plt.plot(ts_df, label='Time Series Data')
plt.title('Synthetic Time Series with Trend and Seasonality')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Decompose the time series
decomposition = seasonal_decompose(ts_df['value'], model='additive', period=12)
fig = decomposition.plot()
fig.set_size_inches(10, 8)
plt.show()

# Test for stationarity using Augmented Dickey-Fuller test
result = adfuller(ts_df['value'])
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
print('Critical Values:', result[4])
if result[1] < 0.05:
    print('The series is stationary')
else:
    print('The series is not stationary')

# Since the series may not be stationary due to trend, let's difference it
ts_diff = ts_df['value'].diff().dropna()
result_diff = adfuller(ts_diff)
print(f'ADF Statistic after differencing: {result_diff[0]}')
print(f'p-value after differencing: {result_diff[1]}')
if result_diff[1] < 0.05:
    print('The differenced series is stationary')
else:
    print('The differenced series is not stationary')

# Split the data into training and testing sets
train_size = int(len(ts_df) * 0.8)
train, test = ts_df['value'][:train_size], ts_df['value'][train_size:]

# Fit an ARIMA model (order determined by trial, typically after ACF/PACF analysis)
# For simplicity, using ARIMA(1,1,1) due to differencing and basic trend/seasonality
model = ARIMA(train, order=(1, 1, 1))
model_fit = model.fit()
print(model_fit.summary())

# Forecast future values
forecast = model_fit.forecast(steps=len(test))
forecast_index = test.index

# Plot the forecast against actual values
plt.figure(figsize=(10, 6))
plt.plot(train.index, train, label='Training Data')
plt.plot(test.index, test, label='Actual Test Data')
plt.plot(forecast_index, forecast, label='Forecast', color='red')
plt.title('ARIMA Forecast vs Actual Values')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Evaluate the forecast
mae = mean_absolute_error(test, forecast)
rmse = np.sqrt(mean_squared_error(test, forecast))
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')

## Advantages and Limitations

**Advantages:**
- Captures temporal dependencies and patterns like trends and seasonality, which are critical for forecasting.
- Provides interpretable components (trend, seasonality) through decomposition.
- Models like ARIMA are well-established and can be effective for linear time series.

**Limitations:**
- Assumes certain properties like stationarity, which may require preprocessing (differencing, detrending).
- Traditional models like ARIMA struggle with non-linear patterns; advanced methods (e.g., neural networks) may be needed.
- Requires careful selection of model parameters (e.g., ARIMA orders), often through trial and error or diagnostic plots.
- Sensitive to missing data or outliers, which can distort analysis and forecasts.

## Conclusion

Time Series Analysis is a powerful approach for understanding and predicting data that evolves over time. By decomposing data into trend, seasonality, and noise, and applying models like ARIMA, you can generate meaningful forecasts for various applications. While traditional methods have limitations with complex or non-linear data, they form a solid foundation before exploring advanced techniques like LSTM or Prophet.

In the next notebook, we will explore another advanced topic to further enhance our machine learning toolkit.