# Assignment Week 7 - Time Series Analysis and Neural Network Forecasting

In this assignment, you'll work with a complete time series workflow: from data preprocessing to building neural network forecasting models. You'll use the **Air Quality Dataset** from UCI, which contains hourly measurements of air pollutants and weather conditions.

## Learning Objectives:
- Analyze and preprocess time series data
- Engineer relevant features for time series forecasting
- Build and compare neural network architectures for forecasting
- Evaluate model performance and understand hyperparameter impact

**Dataset:** Air Quality Dataset (More than 9,000 hourly observations from March 2004 to February 2005)
- More manageable size than Jena Climate dataset
- Real-world complexity with missing values and multiple variables
- Suitable for learning neural network concepts without excessive training time

## Task 1: Data Download and Initial Analysis

Download and explore the Air Quality dataset to understand its structure and characteristics.

In [None]:
# Download the Air Quality dataset
!wget https://archive.ics.uci.edu/static/public/360/air+quality.zip
!unzip -o 'air+quality.zip'

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('AirQualityUCI.csv', sep=';', decimal=',')

# Display basic information
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# 1. Analyze the dataset structure and identify data quality issues
print("Dataset Info:")
df.info()
print("\nDataset Description:")
df.describe()

In [None]:
# 2. Examine missing values and data quality
print("Missing values per column:")
print(df.isnull().sum())

print("\nUnique values in key columns:")
for col in df.columns[:5]:
    print(f"{col}: {df[col].nunique()} unique values")

### YOUR CODE: Identify and analyze any anomalies or data quality issues ###

## Task 2: Data Cleaning and Preprocessing

Clean the dataset and prepare it for time series analysis.

In [None]:
# 1. Create proper datetime index
df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H.%M.%S')
df = df.set_index('DateTime')

# 2. Remove unnecessary columns and handle missing values
# Remove columns with all NaN or the original Date/Time columns
df_clean = df.drop(['Date', 'Time'], axis=1)
df_clean = df_clean.dropna(how='all', axis=1)  # Remove columns with all NaN

# 3. Handle missing values (marked as -200.0 in this dataset)
df_clean = df_clean.replace(-200.0, np.nan)

print("Cleaned dataset shape:", df_clean.shape)
print("Missing values after cleaning:")
print(df_clean.isnull().sum())

### YOUR CODE: Apply interpolation to handle remaining missing values ###
# Hint: Use appropriate interpolation method for time series data

In [None]:
# 4. Select target variable and features
# We'll focus on predicting CO(GT) - Carbon Monoxide concentration
target_var = 'CO(GT)'
feature_cols = ['PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'T', 'RH', 'AH']

# Create final dataset
df_final = df_clean[[target_var] + feature_cols].copy()
df_final = df_final.dropna()  # Remove any remaining NaN values

print(f"Final dataset shape: {df_final.shape}")
print(f"Date range: {df_final.index.min()} to {df_final.index.max()}")
df_final.head()

## Task 3: Exploratory Data Analysis

Analyze the time series characteristics of your target variable and key features.

In [None]:
# 1. Plot the target variable over time
plt.figure(figsize=(15, 8))

plt.subplot(2, 2, 1)
plt.plot(df_final.index, df_final[target_var])
plt.title(f'{target_var} Over Time')
plt.xlabel('Date')
plt.ylabel('CO Concentration')

# 2. Plot seasonal patterns
plt.subplot(2, 2, 2)
df_final.groupby(df_final.index.month)[target_var].mean().plot(kind='bar')
plt.title('Monthly Average CO Levels')
plt.xlabel('Month')

plt.subplot(2, 2, 3)
df_final.groupby(df_final.index.hour)[target_var].mean().plot(kind='bar')
plt.title('Daily Average CO Levels')
plt.xlabel('Hour of Day')

plt.subplot(2, 2, 4)
df_final.groupby(df_final.index.dayofweek)[target_var].mean().plot(kind='bar')
plt.title('Weekly Average CO Levels')
plt.xlabel('Day of Week (0=Monday)')

plt.tight_layout()
plt.show()

### YOUR CODE: Analyze and describe the patterns you observe ###

## Task 4: Feature Engineering

Create temporal features and apply cyclical encoding where appropriate.

In [None]:
# 1. Create basic temporal features
df_features = df_final.copy()

df_features['Hour'] = df_features.index.hour
df_features['DayOfWeek'] = df_features.index.dayofweek
df_features['Month'] = df_features.index.month
df_features['DayOfYear'] = df_features.index.dayofyear
df_features['Weekend'] = (df_features.index.dayofweek >= 5).astype(int)

### YOUR CODE: Implement cyclical encoding for temporal features ###
# Hint: Use sine/cosine transformations for cyclical features like hour, day of week, etc.
# Example: df_features['Hour_sin'] = np.sin(2 * np.pi * df_features['Hour'] / 24)

print("Features created:")
print(df_features.columns.tolist())

## Task 5: Data Preparation for Neural Networks

Prepare the dataset for training neural network models.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Normalize the features
scaler = StandardScaler()
feature_columns = [col for col in df_features.columns if col != target_var]
df_scaled = df_features.copy()
df_scaled[feature_columns] = scaler.fit_transform(df_features[feature_columns])

# First, let's implement the create_sequences function that was missing
def create_sequences(data, target_col, sequence_length, forecast_horizon=1):
    """
    Create sequences for time series forecasting

    Args:
        data: DataFrame with time series data
        target_col: Name of target column
        sequence_length: Length of input sequences
        forecast_horizon: How many steps ahead to predict

    Returns:
        X: Input sequences
        y: Target values
    """
    X, y = [], []

    for i in range(len(data) - sequence_length - forecast_horizon + 1):
        # Input sequence (all features)
        X.append(data.iloc[i:(i + sequence_length)].values)
        # Target value (only target column)
        y.append(data.iloc[i + sequence_length + forecast_horizon - 1][target_col])

    return np.array(X), np.array(y)

# Create sequences (make sure df_scaled is defined first)
SEQUENCE_LENGTH = 24  # Use 24 hours of data to predict next hour
FORECAST_HORIZON = 1  # Predict 1 hour ahead

# Create sequences
X, y = create_sequences(df_scaled, target_var, SEQUENCE_LENGTH, FORECAST_HORIZON)

print(f"Sequences shape: X={X.shape}, y={y.shape}")

In [None]:
# 3. Split data into train/validation/test sets
# Use temporal split (no shuffling for time series)

# Split data into train/validation/test sets
train_size = int(0.7 * len(X))
val_size = int(0.2 * len(X))

X_train, X_val, X_test = X[:train_size], X[train_size:train_size+val_size], X[train_size+val_size:]
y_train, y_val, y_test = y[:train_size], y[train_size:train_size+val_size], y[train_size+val_size:]

print(f"Training set: X={X_train.shape}, y={y_train.shape}")
print(f"Validation set: X={X_val.shape}, y={y_val.shape}")
print(f"Test set: X={X_test.shape}, y={y_test.shape}")

## Task 6: Neural Network Model Implementation

Implement and train two different neural network architectures with various hyperparameter combinations.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, GRU, Dropout, Conv1D, MaxPooling1D, Flatten
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
import time

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

def build_lstm_model(input_shape, lstm_units=50, dropout_rate=0.2, learning_rate=0.001):
    """
    Build LSTM model for time series forecasting
    """
    model = Sequential([
        LSTM(lstm_units, return_sequences=True, input_shape=input_shape),
        Dropout(dropout_rate),
        LSTM(lstm_units//2, return_sequences=False),
        Dropout(dropout_rate),
        Dense(25, activation='relu'),
        Dense(1, activation='linear')
    ])

    model.compile(
        optimizer=Adam(learning_rate=learning_rate),
        loss='mse',
        metrics=['mae']
    )

    return model

def build_cnn_model(input_shape, filters=64, kernel_size=3, learning_rate=0.001):
    """
    Build CNN model for time series forecasting
    """
    ### YOUR CODE: Implement CNN architecture ###
    # Hint: Use Conv1D layers followed by MaxPooling1D, then Dense layers
    model = Sequential([
        # Add your CNN layers here
    ])

    model.compile(
        optimizer=Adam(learning_rate=learning_rate),
        loss='mse',
        metrics=['mae']
    )

    return model

# Define hyperparameter combinations to test
lstm_configs = [
    {'lstm_units': 32, 'dropout_rate': 0.1, 'learning_rate': 0.001},
    {'lstm_units': 64, 'dropout_rate': 0.2, 'learning_rate': 0.001},
    {'lstm_units': 50, 'dropout_rate': 0.3, 'learning_rate': 0.0005}
]

cnn_configs = [
    {'filters': 32, 'kernel_size': 3, 'learning_rate': 0.001},
    {'filters': 64, 'kernel_size': 5, 'learning_rate': 0.001},
    {'filters': 64, 'kernel_size': 3, 'learning_rate': 0.0005}
]

print("Model architectures and hyperparameters defined.")

In [None]:

# Train LSTM models with different hyperparameters
lstm_results = []
input_shape = (X_train.shape[1], X_train.shape[2])  # (sequence_length, n_features)

for i, config in enumerate(lstm_configs):
    print(f"\nTraining LSTM Model {i+1} with config: {config}")

    # Build model
    model = build_lstm_model(input_shape, **config)

    # Callbacks
    callbacks = [
        EarlyStopping(patience=10, restore_best_weights=True),
        ReduceLROnPlateau(patience=5, factor=0.5)
    ]

    # Train model
    start_time = time.time()
    history = model.fit(
        X_train, y_train,
        epochs=50,
        batch_size=32,
        validation_data=(X_val, y_val),
        callbacks=callbacks,
        verbose=1
    )

    training_time = time.time() - start_time

    # Evaluate model
    train_loss = model.evaluate(X_train, y_train, verbose=0)
    val_loss = model.evaluate(X_val, y_val, verbose=0)
    test_loss = model.evaluate(X_test, y_test, verbose=0)

    # Store results
    result = {
        'model_type': 'LSTM',
        'config': config,
        'train_loss': train_loss[0],
        'val_loss': val_loss[0],
        'test_loss': test_loss[0],
        'train_mae': train_loss[1],
        'val_mae': val_loss[1],
        'test_mae': test_loss[1],
        'training_time': training_time,
        'history': history.history
    }

    lstm_results.append(result)

    print(f"Training completed in {training_time:.2f} seconds")
    print(f"Test MSE: {test_loss[0]:.4f}, Test MAE: {test_loss[1]:.4f}")

print("\nAll LSTM models trained successfully!")

In [None]:
# Train CNN models with different hyperparameters
cnn_results = []

### YOUR CODE: Implement CNN model training similar to LSTM training above ###
# Follow the same pattern as LSTM training but use CNN models

print("CNN model training completed!")

## Task 7: Model Comparison and Analysis

Compare the performance of different models and analyze the impact of hyperparameters.

In [None]:
# Compare model performance
all_results = lstm_results + cnn_results

# Create comparison DataFrame
comparison_data = []
for result in all_results:
    comparison_data.append({
        'Model': result['model_type'],
        'Config': str(result['config']),
        'Test MSE': result['test_loss'],
        'Test MAE': result['test_mae'],
        'Training Time (s)': result['training_time']
    })

comparison_df = pd.DataFrame(comparison_data)
print("Model Performance Comparison:")
print(comparison_df.to_string(index=False))

# Plot performance comparison
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
plt.bar(range(len(all_results)), [r['test_loss'] for r in all_results])
plt.title('Test MSE Comparison')
plt.xlabel('Model Index')
plt.ylabel('MSE')

plt.subplot(2, 2, 2)
plt.bar(range(len(all_results)), [r['test_mae'] for r in all_results])
plt.title('Test MAE Comparison')
plt.xlabel('Model Index')
plt.ylabel('MAE')

plt.subplot(2, 2, 3)
plt.bar(range(len(all_results)), [r['training_time'] for r in all_results])
plt.title('Training Time Comparison')
plt.xlabel('Model Index')
plt.ylabel('Time (seconds)')

plt.subplot(2, 2, 4)
# Plot training history for best model
best_model = min(all_results, key=lambda x: x['test_loss'])
plt.plot(best_model['history']['loss'], label='Training Loss')
plt.plot(best_model['history']['val_loss'], label='Validation Loss')
plt.title(f'Best Model Training History ({best_model["model_type"]})')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

### YOUR CODE: Analyze and discuss the results ###
# Write your analysis of:
# 1. Which model performed best and why?
# 2. How did hyperparameters affect performance?
# 3. What patterns do you observe in the training curves?
# 4. What recommendations would you make for further improvements?

## Task 8: Reflection and Next Steps

**Write your analysis and conclusions here:**

### Model Performance Analysis
[Analyze which model performed best and why]

### Hyperparameter Impact
[Discuss how different hyperparameters affected model performance]

### Key Insights
[What did you learn about time series forecasting with neural networks?]

### Recommendations for Improvement
[What would you do differently or what additional techniques would you try?]

### Optional Extensions (if time permits)
- Implement multivariate forecasting using all available features
- Try other architectures (GRU, Transformer, etc.)
- Implement ensemble methods
- Add seasonal decomposition
- Experiment with different sequence lengths and forecast horizons