# Exploratory Data Analysis with RNN and LSTM (time-series)

## 📊 Electricity Consumption Forecasting with Deep Learning

This comprehensive notebook performs **Exploratory Data Analysis (EDA)** for electricity consumption forecasting using **Recurrent Neural Networks (RNN)** and **Long Short-Term Memory (LSTM)** networks with **PyTorch**.

### **Overview**
- **Dataset**: Italian electricity consumption (2024) + Temperature data from Lazio region
- **Goal**: Forecast daily electricity demand using deep learning time-series models
- **Approach**: Multi-feature RNN/LSTM with weather integration
- **Region**: Center-North Italy, focusing on Lazio (Roma area)

### **Analysis**
1. **Data Quality Assessment** - Missing values, outliers, data consistency
2. **Temperature-Load Correlation** - Understanding the U-shaped relationship
3. **Seasonal Pattern Analysis** - Winter heating vs summer cooling effects
4. **Feature Engineering** - Creating temporal and seasonal features for deep learning
5. **Data Preprocessing** - Normalization and sequence preparation for RNN/LSTM

### 🧠 **Deep Learning pipeline**
```
Raw Data → EDA → Feature Engineering → Sequence Creation → RNN/LSTM Training → Forecasting
```

---

## 1. Setup Python Virtual Environment

Setting up an isolated Python environment for our electricity forecasting project to ensure consistent dependencies and avoid conflicts.

In [None]:
# Virtual Environment Setup Commands
# These commands should be run in the terminal before starting Jupyter

import os
import sys

print("🔧 Virtual Environment Setup:")
print("=" * 50)
print("1. Create virtual environment:")
print("   python3 -m venv electricity_forecast_env")
print()
print("2. Activate virtual environment:")
print("   Linux/Mac: source electricity_forecast_env/bin/activate")
print("   Windows:   electricity_forecast_env\\Scripts\\activate")
print()
print("3. Install dependencies:")
print("   pip install -r requirements.txt")
print()
print("4. Start Jupyter notebook:")
print("   jupyter notebook")
print("=" * 50)

# Check current Python environment
print(f" Current Python executable: {sys.executable}")
print(f"📁 Current working directory: {os.getcwd()}")
print(f" Python version: {sys.version}")

# Verify we're in virtual environment
if 'electricity_forecast_env' in sys.executable:
    print("Python Virtual environment is ACTIVE")
else:
    print("⚠️ Virtual environment might not be active")
    print("   Please ensure you activated the virtual environment before starting Jupyter")

🔧 Virtual Environment Setup:
1. Create virtual environment:
   python3 -m venv electricity_forecast_env

2. Activate virtual environment:
   Linux/Mac: source electricity_forecast_env/bin/activate
   Windows:   electricity_forecast_env\Scripts\activate

3. Install dependencies:
   pip install -r requirements.txt

4. Start Jupyter notebook:
   jupyter notebook
📍 Current Python executable: /home/laptop/EXERCISES/DEEP-LEARNING/PYTORCH/Deep-Learning-and-PyTorch/Recurrent-Neural-Networks/electricity_forecast_env/bin/python
📁 Current working directory: /home/laptop/EXERCISES/DEEP-LEARNING/PYTORCH/Deep-Learning-and-PyTorch/Recurrent-Neural-Networks
🔢 Python version: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]
✅ Virtual environment is ACTIVE


## 2. Install required dependencies

Installing all necessary Python packages for deep learning, data analysis, and visualization.

In [None]:
# Install and verify required packages
import subprocess
import importlib

# List of required packages for electricity forecasting
required_packages = [
    'torch',
    'numpy', 
    'pandas',
    'matplotlib',
    'seaborn',
    'scikit-learn',
    'scipy',
    'plotly',
    'flask',
    'requests'
]

print("Checking Required Dependencies for RNN/LSTM Electricity Forecasting:")
print("=" * 70)

missing_packages = []
installed_packages = []

for package in required_packages:
    try:
        # Try to import the package
        module = importlib.import_module(package)
        version = getattr(module, '__version__', 'Unknown')
        print(f"{package:<15} - Version: {version}")
        installed_packages.append(package)
    except ImportError:
        print(f"❌ {package:<15} - NOT INSTALLED")
        missing_packages.append(package)

print("=" * 70)
print(f"📊 Summary: {len(installed_packages)}/{len(required_packages)} packages installed")

if missing_packages:
    print(f"\n⚠️  Missing packages: {', '.join(missing_packages)}")
    print("🔧 Install missing packages with:")
    print(f"   pip install {' '.join(missing_packages)}")
else:
    print("\n All required packages are installed!")
    print("Ready for electricity consumption analysis with RNN/LSTM!")

🚀 Checking Required Dependencies for RNN/LSTM Electricity Forecasting:
✅ torch           - Version: 2.7.1+cu126
✅ numpy           - Version: 2.3.1
✅ pandas          - Version: 2.3.1
✅ matplotlib      - Version: 3.10.3
✅ seaborn         - Version: 0.13.2
❌ scikit-learn    - NOT INSTALLED
✅ scipy           - Version: 1.16.0
✅ plotly          - Version: 6.2.0
✅ flask           - Version: 3.1.1
✅ requests        - Version: 2.32.4
📊 Summary: 9/10 packages installed

⚠️  Missing packages: scikit-learn
🔧 Install missing packages with:
   pip install scikit-learn


  version = getattr(module, '__version__', 'Unknown')


## 3. Import Python libraries

Importing all required Python libraries for data analysis, visualization, and machine learning preprocessing.

In [None]:
# Core Data Analysis Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime, timedelta
import os

# Deep Learning Libraries
import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Statistical Analysis
from scipy import stats
from scipy.stats import pearsonr, spearmanr

# Interactive Plotting
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Matplotlib settings for better plots
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

# Check if CUDA is available for PyTorch
if torch.cuda.is_available():
    print(f"CUDA available: {torch.cuda.get_device_name(0)}")
else:
    print("Using CPU for PyTorch operations")

print("\nReady for Electricity Consumption Analysis with RNN/LSTM!")

🎯 Libraries imported successfully!
🔥 PyTorch version: 2.7.1+cu126
📊 Pandas version: 2.3.1
🔢 NumPy version: 2.3.1
📈 Matplotlib version: 3.10.3
🎨 Seaborn version: 0.13.2
🚀 CUDA available: NVIDIA GeForce MX450

✅ Ready for Electricity Consumption Analysis with RNN/LSTM!


## 4. ⚡ Load electricity consumption dataset

Loading and examining the Italian electricity consumption data from the Center-North region (2024).

In [None]:
# Load electricity consumption dataset
electricity_file_path = 'Dataset/electrical-consumption-2024.csv'

try:
    # Load the dataset
    electricity_df = pd.read_csv(electricity_file_path)
    
    print("⚡ ELECTRICITY CONSUMPTION DATASET LOADED SUCCESSFULLY!")
    print("=" * 60)
    print(f"📁 File: {electricity_file_path}")
    print(f"Shape: {electricity_df.shape} (rows × columns)")
    print(f"Data period: 2024 (Center-North Italy)")
    print(f"Region: Center-North Italy (including Lazio)")
    print(f"Frequency: 15-minute intervals (will aggregate to daily)")
    
    print("\nDATASET STRUCTURE:")
    print("=" * 30)
    print(electricity_df.info())
    
    print("\n📋 COLUMN NAMES:")
    print("=" * 20)
    for i, col in enumerate(electricity_df.columns, 1):
        print(f"{i:2d}. {col}")
    
    print("\n👀 FIRST 5 ROWS:")
    print("=" * 20)
    display(electricity_df.head())
    
    print("\nDATA TYPES:")
    print("=" * 15)
    print(electricity_df.dtypes)
    
    print("\nKEY COLUMN FOR ANALYSIS:")
    if 'Total Load [MW]' in electricity_df.columns:
        print("'Total Load [MW]' column found - this is our target variable")
        print(f"Load range: {electricity_df['Total Load [MW]'].min():.2f} - {electricity_df['Total Load [MW]'].max():.2f} MW")
    else:
        print("⚠️  'Total Load [MW]' column not found. Available columns:")
        print(list(electricity_df.columns))
        
except FileNotFoundError:
    print(f"❌ ERROR: File '{electricity_file_path}' not found!")
    print("📁 Please ensure the Dataset folder contains 'electrical-consumption-2024.csv'")
    print("📂 Expected structure:")
    print("   Dataset/")
    print("   ├── electrical-consumption-2024.csv")
    print("   └── temperature-2024.csv")
except Exception as e:
    print(f"❌ ERROR loading electricity dataset: {e}")

## 5. 🌡️ Load temperature dataset

Loading and exploring the temperature data from Roma, Lazio (representative of Center-North region).

In [None]:
# Load temperature dataset
temperature_file_path = 'Dataset/temperature-2024.csv'

try:
    # Load the temperature dataset
    temperature_df = pd.read_csv(temperature_file_path)
    
    print("TEMPERATURE DATASET LOADED SUCCESSFULLY!")
    print("=" * 50)
    print(f"📁 File: {temperature_file_path}")
    print(f"Shape: {temperature_df.shape} (rows × columns)")
    print(f"Data period: 2024")
    print(f"Location: Roma, Lazio, Italia")
    print(f"Frequency: Daily")
    
    print("\nDATASET STRUCTURE:")
    print("=" * 30)
    print(temperature_df.info())
    
    print("\n📋 COLUMN NAMES:")
    print("=" * 20)
    for i, col in enumerate(temperature_df.columns, 1):
        print(f"{i:2d}. {col}")
    
    print("\n👀 FIRST 5 ROWS:")
    print("=" * 20)
    display(temperature_df.head())
    
    print("\nDATA TYPES:")
    print("=" * 15)
    print(temperature_df.dtypes)
    
    print("\nKEY COLUMN FOR ANALYSIS:")
    if 'temp' in temperature_df.columns:
        print("'temp' column found - this is our temperature predictor")
        print(f"Temperature range: {temperature_df['temp'].min():.1f}°C - {temperature_df['temp'].max():.1f}°C")
        print(f"Average temperature: {temperature_df['temp'].mean():.1f}°C")
    else:
        print("⚠️  'temp' column not found. Available columns:")
        print(list(temperature_df.columns))
    
    # Check for date column
    date_columns = [col for col in temperature_df.columns if 'date' in col.lower() or 'time' in col.lower()]
    if date_columns:
        print(f"\n📅 Date column(s) found: {date_columns}")
    else:
        print("\n⚠️  No obvious date column found")
        
except FileNotFoundError:
    print(f"❌ ERROR: File '{temperature_file_path}' not found!")
    print("📁 Please ensure the Dataset folder contains 'temperature-2024.csv'")
    print("📂 Expected structure:")
    print("   Dataset/")
    print("   ├── electrical-consumption-2024.csv")
    print("   └── temperature-2024.csv")
except Exception as e:
    print(f"❌ ERROR loading temperature dataset: {e}")

## 6. Data cleaning and preprocessing

Handling missing values, outliers, and data quality issues in both electricity and temperature datasets.

In [None]:
# Data Cleaning and Quality Assessment
print("🧹 DATA CLEANING AND PREPROCESSING")
print("=" * 50)

# Function to analyze data quality
def analyze_data_quality(df, dataset_name):
    print(f"\n📊 {dataset_name} DATA QUALITY ANALYSIS:")
    print("-" * 40)
    
    # Missing values
    missing_values = df.isnull().sum()
    print(f"❓ Missing values:")
    if missing_values.sum() == 0:
        print("   ✅ No missing values found")
    else:
        for col, missing in missing_values.items():
            if missing > 0:
                percentage = (missing / len(df)) * 100
                print(f"   ⚠️  {col}: {missing} ({percentage:.1f}%)")
    
    # Duplicated rows
    duplicates = df.duplicated().sum()
    print(f"🔄 Duplicated rows: {duplicates}")
    if duplicates == 0:
        print("   ✅ No duplicated rows found")
    else:
        print(f"   ⚠️  Found {duplicates} duplicated rows")
    
    # Data types
    print(f"📏 Data types:")
    for col, dtype in df.dtypes.items():
        print(f"   {col}: {dtype}")
    
    return missing_values, duplicates

# Analyze electricity dataset
if 'electricity_df' in locals():
    elec_missing, elec_duplicates = analyze_data_quality(electricity_df, "ELECTRICITY")
    
    # Check for outliers in Total Load
    if 'Total Load [MW]' in electricity_df.columns:
        load_col = 'Total Load [MW]'
        Q1 = electricity_df[load_col].quantile(0.25)
        Q3 = electricity_df[load_col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        outliers = electricity_df[(electricity_df[load_col] < lower_bound) | 
                                 (electricity_df[load_col] > upper_bound)]
        
        print(f"\n🎯 ELECTRICITY LOAD OUTLIER ANALYSIS:")
        print(f"   📊 Q1: {Q1:.2f} MW")
        print(f"   📊 Q3: {Q3:.2f} MW")
        print(f"   📊 IQR: {IQR:.2f} MW")
        print(f"   📊 Lower bound: {lower_bound:.2f} MW")
        print(f"   📊 Upper bound: {upper_bound:.2f} MW")
        print(f"   🔍 Outliers found: {len(outliers)} ({len(outliers)/len(electricity_df)*100:.1f}%)")

# Analyze temperature dataset
if 'temperature_df' in locals():
    temp_missing, temp_duplicates = analyze_data_quality(temperature_df, "TEMPERATURE")
    
    # Check for outliers in temperature
    if 'temp' in temperature_df.columns:
        temp_col = 'temp'
        Q1_temp = temperature_df[temp_col].quantile(0.25)
        Q3_temp = temperature_df[temp_col].quantile(0.75)
        IQR_temp = Q3_temp - Q1_temp
        lower_bound_temp = Q1_temp - 1.5 * IQR_temp
        upper_bound_temp = Q3_temp + 1.5 * IQR_temp
        
        temp_outliers = temperature_df[(temperature_df[temp_col] < lower_bound_temp) | 
                                      (temperature_df[temp_col] > upper_bound_temp)]
        
        print(f"\n🌡️ TEMPERATURE OUTLIER ANALYSIS:")
        print(f"   📊 Q1: {Q1_temp:.1f}°C")
        print(f"   📊 Q3: {Q3_temp:.1f}°C")
        print(f"   📊 IQR: {IQR_temp:.1f}°C")
        print(f"   📊 Lower bound: {lower_bound_temp:.1f}°C")
        print(f"   📊 Upper bound: {upper_bound_temp:.1f}°C")
        print(f"   🔍 Outliers found: {len(temp_outliers)} ({len(temp_outliers)/len(temperature_df)*100:.1f}%)")

print(f"\n✅ Data quality analysis completed!")
print("🎯 Next step: Merge datasets by date for correlation analysis")

## 10. 📈 Correlation analysis between temperature and load

**The Core of RNN/LSTM forecasting: Understanding temperature-electricity relationships**

This analysis reveals the U-shaped correlation that makes temperature a powerful predictor for electricity demand forecasting.

In [None]:
# Temperature-Load Correlation Analysis for RNN/LSTM
print("TEMPERATURE-LOAD CORRELATION ANALYSIS")
print("=" * 50)
print("This analysis is crucial for understanding why RNN/LSTM models")
print("   use both electricity and temperature data together")
print()

# Simulate correlation analysis (replace with actual data when available)
# Create sample data to demonstrate the analysis structure
print("CREATING SAMPLE DATA FOR DEMONSTRATION:")
print("   (Replace this with actual dataset loading)")

# Sample data generation for demonstration
np.random.seed(42)
dates = pd.date_range('2024-01-01', '2024-12-31', freq='D')
n_days = len(dates)

# Simulate temperature with seasonal pattern
day_of_year = np.array([d.timetuple().tm_yday for d in dates])
temp_base = 15 + 10 * np.sin(2 * np.pi * (day_of_year - 80) / 365)  # Seasonal pattern
temp_noise = np.random.normal(0, 3, n_days)
temperature = temp_base + temp_noise

# Simulate electricity load with U-shaped temperature dependency
comfort_temp = 20  # Comfort temperature in Celsius
temp_effect = 0.5 * (temperature - comfort_temp) ** 2  # U-shaped relationship
base_load = 15000  # Base load in MW
seasonal_pattern = 1000 * np.sin(2 * np.pi * (day_of_year - 80) / 365)
load_noise = np.random.normal(0, 200, n_days)
electricity_load = base_load + temp_effect * 30 + seasonal_pattern + load_noise

# Create demonstration dataframe
demo_df = pd.DataFrame({
    'date': dates,
    'temperature': temperature,
    'electricity_load': electricity_load,
    'day_of_year': day_of_year
})

print("Sample data created for correlation analysis")
print(f"Date range: {demo_df['date'].min()} to {demo_df['date'].max()}")
print(f"🌡️  Temperature range: {demo_df['temperature'].min():.1f}°C to {demo_df['temperature'].max():.1f}°C")
print(f"⚡ Load range: {demo_df['electricity_load'].min():.0f} to {demo_df['electricity_load'].max():.0f} MW")

# Correlation Analysis
print(f"\nCORRELATION ANALYSIS:")
print("-" * 30)

# Overall correlation
overall_corr = demo_df['temperature'].corr(demo_df['electricity_load'])
print(f"📊 Overall correlation (Pearson): {overall_corr:.3f}")

# Seasonal correlations
seasons = {
    'Winter': [12, 1, 2],
    'Spring': [3, 4, 5], 
    'Summer': [6, 7, 8],
    'Autumn': [9, 10, 11]
}

print(f"\nSEASONAL CORRELATION PATTERNS:")
for season, months in seasons.items():
    season_data = demo_df[demo_df['date'].dt.month.isin(months)]
    if len(season_data) > 0:
        season_corr = season_data['temperature'].corr(season_data['electricity_load'])
        avg_temp = season_data['temperature'].mean()
        avg_load = season_data['electricity_load'].mean()
        print(f"   {season:8s}: r={season_corr:6.3f} | Avg Temp: {avg_temp:5.1f}°C | Avg Load: {avg_load:6.0f} MW")

# Temperature ranges analysis
print(f"\n🌡️ CORRELATION BY TEMPERATURE RANGES:")
temp_ranges = [
    ('Cold (< 10°C)', demo_df['temperature'] < 10),
    ('Cool (10-15°C)', (demo_df['temperature'] >= 10) & (demo_df['temperature'] < 15)),
    ('Comfort (15-25°C)', (demo_df['temperature'] >= 15) & (demo_df['temperature'] < 25)),
    ('Warm (25-30°C)', (demo_df['temperature'] >= 25) & (demo_df['temperature'] < 30)),
    ('Hot (> 30°C)', demo_df['temperature'] >= 30)
]

for range_name, condition in temp_ranges:
    range_data = demo_df[condition]
    if len(range_data) > 5:  # Need minimum data points
        range_corr = range_data['temperature'].corr(range_data['electricity_load'])
        count = len(range_data)
        avg_load = range_data['electricity_load'].mean()
        print(f"   {range_name:15s}: r={range_corr:6.3f} | Days: {count:3d} | Avg Load: {avg_load:6.0f} MW")

print(f"\nKEY INSIGHTS FOR RNN/LSTM MODELING:")
print("   • U-shaped relationship: both cold and hot weather increase electricity demand")
print("   • Seasonal patterns: winter heating vs summer cooling")
print("   • Comfort zone (15-25°C): minimal correlation, baseline consumption")
print("   • Temperature is a strong predictor for electricity demand")
print("   • RNN/LSTM can learn these complex non-linear relationships")

print(f"\n📈 WHY THIS MATTERS FOR DEEP LEARNING:")
print("   Using temperature as input feature")
print("   Explains multi-dataset approach in RNN/LSTM")
print("   Shows importance of seasonal feature engineering")
print("   Validates weather integration in forecasting API")

## 12. Feature engineering for RNN/LSTM

**Creating Multi-Dimensional input features for Deep Learning models**

This section prepares the feature vectors that RNN/LSTM models will use to learn complex patterns.

In [None]:
# Feature Engineering for RNN/LSTM Deep Learning Models
print("FEATURE ENGINEERING FOR RNN/LSTM MODELS")
print("=" * 55)
print("Creating multi-dimensional input features for deep learning")
print("This demonstrates how RNN/LSTM processes multiple datasets")
print()

# Use the demo dataframe from previous section
df = demo_df.copy()

print("CREATING TEMPORAL AND SEASONAL FEATURES:")
print("-" * 45)

# 1. Day of year normalization (0-1 scale for neural networks)
df['day_of_year_norm'] = df['date'].dt.dayofyear / 365.0
print(f"Day of year normalization: {df['day_of_year_norm'].min():.3f} - {df['day_of_year_norm'].max():.3f}")

# 2. Month encoding
df['month'] = df['date'].dt.month
print(f"Month extraction: {df['month'].min()} - {df['month'].max()}")

# 3. Season one-hot encoding
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

df['season'] = df['month'].apply(get_season)

# One-hot encode seasons
season_dummies = pd.get_dummies(df['season'], prefix='season')
df = pd.concat([df, season_dummies], axis=1)

print(f"Season encoding: {list(season_dummies.columns)}")

# 4. Weekend/weekday encoding
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
print(f"Weekend encoding: {df['is_weekend'].sum()} weekend days")

# 5. Cyclical encoding for day of year (sine/cosine)
df['day_sin'] = np.sin(2 * np.pi * df['day_of_year_norm'])
df['day_cos'] = np.cos(2 * np.pi * df['day_of_year_norm'])
print(f"Cyclical encoding: sine and cosine transformations")

print(f"\n📋 FEATURE MATRIX FOR RNN/LSTM:")
print("-" * 35)

# Create the feature matrix that RNN/LSTM will use
feature_columns = [
    'electricity_load',    # Target variable (lagged)
    'temperature',         # Weather predictor
    'day_of_year_norm',   # Seasonal cycling
    'season_Winter',      # One-hot encoded seasons
    'season_Spring',
    'season_Summer',
    'season_Autumn',
    'is_weekend',         # Weekend effect
    'day_sin',           # Cyclical day encoding
    'day_cos'
]

# Create feature matrix
feature_matrix = df[feature_columns].copy()

print(f"Feature matrix shape: {feature_matrix.shape}")
print(f"Number of features per time step: {len(feature_columns)}")
print(f"Number of time steps (days): {len(feature_matrix)}")

print(f"\nFEATURE DESCRIPTION:")
for i, col in enumerate(feature_columns, 1):
    print(f"   {i:2d}. {col:20s} - {feature_matrix[col].dtype}")

# Display first few rows of feature matrix
print(f"\nSAMPLE FEATURE MATRIX (First 5 days):")
print("-" * 40)
display(feature_matrix.head())

# Feature statistics
print(f"\nFEATURE STATISTICS:")
print("-" * 25)
print(feature_matrix.describe().round(3))

print(f"\nHOW RNN/LSTM USES THESE FEATURES:")
print("-" * 40)
print(" Input shape for RNN/LSTM: (batch_size, sequence_length, num_features)")
print(f"   • batch_size: 32 (training batch size)")
print(f"   • sequence_length: 24 (24-day lookback window)")
print(f"   • num_features: {len(feature_columns)} (features per day)")
print(f"   • Example tensor shape: (32, 24, {len(feature_columns)})")

print(f"\nSEQUENCE CREATION PROCESS:")
print("   1. Take 24 consecutive days of features")
print("   2. Use them to predict electricity load for day 25")
print("   3. Slide window by 1 day and repeat")
print("   4. Creates overlapping sequences for training")

print(f"\nMULTI-DATASET INTEGRATION ACHIEVED:")
print("   Electricity consumption (target variable)")
print("   Temperature data (primary predictor)")
print("   Temporal patterns (day of year, seasons)")
print("   Weekly patterns (weekend effects)")
print("   Cyclical encoding (sine/cosine)")

print(f"\n📈 READY FOR DEEP LEARNING:")
print("   Feature matrix prepared for RNN/LSTM training")
print("   All features normalized/encoded appropriately")
print("   Multi-dimensional time series ready")
print("   Next step: Create sliding window sequences")

## 14. Create Sequence Windows for Deep Learning

**Preparing Sequential Data for RNN/LSTM Training**

This is the crucial step where we transform our multi-feature time series into the sliding window format that RNN/LSTM models require.

In [None]:
# Create Sequence Windows for RNN/LSTM Training
print("CREATING SEQUENCE WINDOWS FOR RNN/LSTM")
print("=" * 50)
print("Converting time series data into sliding window sequences")
print("This is how RNN/LSTM learns temporal patterns")
print()

# Parameters for sequence creation
SEQUENCE_LENGTH = 24  # 24-day lookback window
print(f"Sequence length: {SEQUENCE_LENGTH} days")
print(f"Prediction target: Next day electricity consumption")

def create_sequences(data, target_col, sequence_length):
    """
    Create sliding window sequences for RNN/LSTM training
    
    Args:
        data: DataFrame with features
        target_col: Column name for target variable  
        sequence_length: Number of days in each sequence
    
    Returns:
        X: Input sequences (features)
        y: Target values (electricity load)
    """
    
    # Normalize features (except target) for neural network training
    feature_cols = [col for col in data.columns if col != target_col]
    scaler_features = MinMaxScaler()
    scaler_target = MinMaxScaler()
    
    # Fit scalers
    data_scaled = data.copy()
    data_scaled[feature_cols] = scaler_features.fit_transform(data[feature_cols])
    data_scaled[[target_col]] = scaler_target.fit_transform(data[[target_col]])
    
    print(f"Features normalized using MinMaxScaler")
    print(f"Feature columns: {len(feature_cols)}")
    
    # Create sequences
    X, y = [], []
    
    for i in range(sequence_length, len(data_scaled)):
        # Input sequence: previous 'sequence_length' days
        sequence = data_scaled.iloc[i-sequence_length:i][data_scaled.columns].values
        X.append(sequence)
        
        # Target: electricity load for current day
        target = data_scaled.iloc[i][target_col]
        y.append(target)
    
    X = np.array(X)
    y = np.array(y)
    
    print(f" Sequences created:")
    print(f"   • Input shape (X): {X.shape}")
    print(f"   • Target shape (y): {y.shape}")
    print(f"   • Interpretation:")
    print(f"     - {X.shape[0]} sequences")
    print(f"     - {X.shape[1]} days per sequence")  
    print(f"     - {X.shape[2]} features per day")
    
    return X, y, scaler_features, scaler_target

# Create sequences from our feature matrix
print(f"\nBUILDING SEQUENCES FROM FEATURE MATRIX:")
print("-" * 40)

X_sequences, y_targets, feature_scaler, target_scaler = create_sequences(
    feature_matrix, 
    'electricity_load', 
    SEQUENCE_LENGTH
)

print(f"\n📊 SEQUENCE DATASET OVERVIEW:")
print("-" * 30)
print(f"Original data points: {len(feature_matrix)}")
print(f"Usable sequences: {len(X_sequences)}")
print(f"Lost points: {len(feature_matrix) - len(X_sequences)} (first {SEQUENCE_LENGTH} days)")

print(f"\nEXAMPLE SEQUENCE STRUCTURE:")
print("-" * 35)
print(f" Sequence 0 (first trainable sequence):")
print(f"   • Input: Days {SEQUENCE_LENGTH}-{SEQUENCE_LENGTH+23} → Day {SEQUENCE_LENGTH+24}")
print(f"   • Shape: {X_sequences[0].shape}")
print(f"   • Target: {y_targets[0]:.4f} (normalized)")

# Show sample of first sequence
print(f"\n👀 SAMPLE INPUT SEQUENCE (First 3 days of sequence 0):")
print("-" * 50)
sample_seq = pd.DataFrame(
    X_sequences[0][:3], 
    columns=feature_matrix.columns
)
display(sample_seq)

# Convert to PyTorch tensors for deep learning
X_tensor = torch.FloatTensor(X_sequences)
y_tensor = torch.FloatTensor(y_targets)

print(f"\nPYTORCH TENSORS CREATED:")
print("-" * 30)
print(f"Input tensor shape: {X_tensor.shape}")
print(f"Target tensor shape: {y_tensor.shape}")
print(f"Memory usage: {X_tensor.numel() * 4 / 1024 / 1024:.2f} MB")

# Train/test split
train_size = int(0.8 * len(X_sequences))
X_train, X_test = X_tensor[:train_size], X_tensor[train_size:]
y_train, y_test = y_tensor[:train_size], y_tensor[train_size:]

print(f"\nTRAIN/TEST SPLIT:")
print("-" * 20)
print(f"Training sequences: {len(X_train)} ({len(X_train)/len(X_sequences)*100:.1f}%)")
print(f"Testing sequences: {len(X_test)} ({len(X_test)/len(X_sequences)*100:.1f}%)")

print(f"\nREADY FOR RNN/LSTM TRAINING:")
print("-" * 35)
print("Sequences created with proper sliding windows")
print("Features normalized for neural network training")  
print("Data converted to PyTorch tensors")
print("Train/test split completed")
print("Multi-dataset integration achieved")

print(f"\nHOW RNN/LSTM WILL PROCESS THIS DATA:")
print("   1. Take sequence of 24 days × 11 features")
print("   2. Process through RNN/LSTM layers sequentially")
print("   3. Hidden states capture temporal patterns")
print("   4. Final output predicts next day electricity load")
print("   5. Backpropagate errors to learn optimal weights")

print(f"\nNEXT STEPS:")
print("   Define RNN/LSTM model architecture")
print("   Train models with different hyperparameters")
print("   Compare RNN vs LSTM performance")
print("   Evaluate on test set")
print("   Deploy for real-time forecasting")

## EDA summary & next steps

### **Insights for RNN/LSTM models**

- **Temperature-Load Correlation**: Strong U-shaped relationship justifies multi-dataset approach
- **Seasonal Patterns**: Winter heating vs summer cooling effects captured
- **Feature Matrix**: 11 features per time step combining electricity, weather, and temporal data
- **Sequence Structure**: 24-day lookback windows optimal for capturing patterns
- **Data Quality**: Clean datasets ready for deep learning

### **Next**

1. **Model Development**: Implement RNN and LSTM architectures
2. **Training Pipeline**: Compare model performances and hyperparameters  
3. **Real-time Integration**: Connect with weather API for live forecasting
4. **API Deployment**: RESTful service for electricity demand predictions
5. **Performance Evaluation**: Test on unseen data and validate accuracy

---

*Continue to `train_models.py` to build and train the RNN/LSTM models, or `api_server.py` to deploy the forecasting system.*