## Python Example

In [2]:
from utils.dataframes import *

meter_usage

Unnamed: 0,service_point_id,meter_id,interval_end_datetime,meter_channel,kwh,account_number
0,2300822246,L108605388,10/1/2022 12:00:00 AM,10,0.594,30010320353
1,2300822246,L108605388,10/1/2022 12:15:00 AM,10,0.101,30010320353
2,2300822246,L108605388,10/1/2022 12:30:00 AM,10,0.104,30010320353
3,2300822246,L108605388,10/1/2022 12:45:00 AM,10,0.106,30010320353
4,2300822246,L108605388,10/1/2022 1:00:00 AM,10,0.099,30010320353
...,...,...,...,...,...,...
500275,2300588897,L108607371,9/30/2021 7:00:00 PM,10,1.242,35012790198
500276,2300588897,L108607371,9/30/2021 8:00:00 PM,10,1.202,35012790198
500277,2300588897,L108607371,9/30/2021 9:00:00 PM,10,1.186,35012790198
500278,2300588897,L108607371,9/30/2021 10:00:00 PM,10,1.150,35012790198


## SQL Example

In [3]:
from utils.runtime import connect_to_db

electric_brew = connect_to_db()

electric_brew.execute("SELECT * FROM meter_usage").fetchdf()

Unnamed: 0,service_point_id,meter_id,interval_end_datetime,meter_channel,kwh,account_number
0,2300588853,L123057647,10/1/2022 12:00:00 AM,10,0.043,35012787756
1,2300588853,L123057647,10/1/2022 12:15:00 AM,10,0.040,35012787756
2,2300588853,L123057647,10/1/2022 12:30:00 AM,10,0.045,35012787756
3,2300588853,L123057647,10/1/2022 12:45:00 AM,10,0.040,35012787756
4,2300588853,L123057647,10/1/2022 1:00:00 AM,10,0.045,35012787756
...,...,...,...,...,...,...
500275,2300588897,L108607371,9/30/2021 7:00:00 PM,10,1.242,35012790198
500276,2300588897,L108607371,9/30/2021 8:00:00 PM,10,1.202,35012790198
500277,2300588897,L108607371,9/30/2021 9:00:00 PM,10,1.186,35012790198
500278,2300588897,L108607371,9/30/2021 10:00:00 PM,10,1.150,35012790198


In [4]:
electric_brew.close()

## Advanced Analysis Ideas

#### Ideas

1. ~~**Model Validation the Right Way: Holdout Sets**~~
   - ~~Develop a model validation strategy using holdout sets extracted from the `fct_electric_brew` dataset. These sets should focus on different billing intervals, various meters, and distinct operational periods (like peak and off-peak hours). The objective is to assess the model's predictive accuracy in real-world scenarios, especially for forecasting energy consumption and associated costs during different periods.~~

   - ~~Plot residuals for each of the models' holdout sets~~  
<br>

1. ~~**Model Validation via Cross-Validation**~~
   - ~~Implement a robust cross-validation framework on `dim_datetimes` and `fct_electric_brew` datasets. This framework should assess the consistency and reliability of predictive models across various temporal segments, such as different months, weeks, or specific operational hours, to enhance the accuracy of peak hour energy consumption predictions.~~   
<br>

1. **Learning Curves**
   - Analyze learning curves by incrementally increasing the volume of training data from the `fct_electric_brew` dataset, which includes varied energy consumption patterns. Evaluate how different machine learning models improve or stabilize in performance as more data is fed into them. This analysis aims to determine the point of diminishing returns in terms of data volume and its effect on model performance.  
<br>

1. **Bayesian Classification**
   - Utilize Bayesian classification techniques on the `fct_electric_brew` dataset to make probabilistic predictions about energy usage patterns. This approach will be particularly beneficial in assessing the likelihood of different consumption patterns and their impact on cost, which is crucial in ROI analyses and risk assessment.  
<br>

1.  **Polynomial Basis Functions**
    - Investigate and model non-linear relationships within the `dim_datetimes` and `fct_electric_brew` datasets. This exploration should focus on uncovering complex patterns in energy consumption during peak hours, using polynomial basis functions to represent these non-linear relationships in the predictive models.  
<br>

1.  **Regularization**
    - Implement regularization techniques in predictive models that combine features from `dim_datetimes`, `dim_meters`, `dim_bills`, and `fct_electric_brew`. The goal is to control model complexity, particularly in models predicting various aspects of energy consumption, and prevent overfitting by penalizing large or irrelevant model coefficients.  
<br>

1.  **Feature Ranking, PCR, and Lasso Regression (hw07)**
    - Apply Lasso regression across all dimensions, including `dim_datetimes`, `dim_meters`, `dim_bills`, and `fct_electric_brew`, to conduct feature selection. This method will identify the most impactful predictors across different datasets, revealing the variables that significantly influence peak hour energy consumption and cost patterns.  
<br>

1.  **Principal Component Analysis**
    - Perform Principal Component Analysis (PCA) on the `dim_bills` dataset to reduce its dimensionality. This technique will help in identifying the most significant billing factors and underlying patterns, simplifying complex billing data into principal components that retain the most important information for predictive modeling.  
<br>

1.  **Anomaly Detection**
    - Deploy anomaly detection algorithms to identify unusual patterns or outliers in cost and consumption data within the `fct_electric_brew` dataset. This analysis aims to pinpoint inefficiencies, potential errors, or opportunities for cost savings, especially in peak hour energy usage.  
<br>

1.  **Time-Based Forecasting**
    - Use advanced time series forecasting methods on `dim_datetimes` and `dim_bills` to predict future energy costs and consumption patterns. This forecasting will help in strategic planning and decision-making, particularly in identifying potential benefits of shifting energy usage to off-peak times.  
<br>

1.  **K-Means Clustering**
    - Apply K-Means Clustering to the `fct_electric_brew` dataset to segment and analyze different patterns of energy consumption and associated costs. This clustering will assist in identifying distinct consumption groups or patterns within the brewery, facilitating targeted strategies for cost reduction and energy management.  


In [25]:
from utils.dataframes import fct_electric_brew, dim_datetimes, dim_meters, dim_bills
from utils.runtime    import setup_plot_params

from sklearn.compose       import ColumnTransformer
from sklearn.ensemble      import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model  import LinearRegression
from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def setup_analysis(temporal_only : bool = False) -> dict:
    '''
    This utility function consolidates the data preparation and model setup steps for various analyses of the `fct_electric_brew` dataset,
    in the spirit of DRY.
     
    It merges related data tables, extracts features, sets up preprocessing pipelines, and initializes a set of models for further analysis. 
    The function is designed to standardize these steps across different analyses, ensuring consistency and efficiency.

    Methodology:
        1. Data Merging: Combine `fct_electric_brew` with `dim_datetimes`, `dim_meters`, and `dim_bills` for a comprehensive dataset.
        2. Feature Engineering: Select relevant features for analysis, focusing on key aspects like time, meter readings, and supplier.
        3. Preprocessing Setup: Standardize numerical features and encode categorical features for machine learning algorithms.
        4. Model Initialization: Set up pipelines for various regression models, including Linear Regression, Random Forest, and Gradient Boosting.
        5. Output Packaging: Return a dictionary containing prepared data, features, target variable, and model pipelines.

    Note: For Step 2, the `temporal_only` argument allows for the selection of only time-related features if set to True.

    Mathematical Concepts:
        • Standardization (Z-Score Normalization):
            - Formula: z = (x - μ) / σ
            - Purpose: Scales features to have a mean (μ) of zero and a variance (σ) of one.

        • One-Hot Encoding:
            - Purpose: Transforms categorical variables into a binary matrix, enabling easier processing by machine learning models.

        • Regression Modeling:
            - Concept: Techniques used to predict a continuous target variable based on one or more input features.

        • Pipeline Creation:
            - Purpose: Streamlines preprocessing and modeling steps into a single, unified process for more efficient analysis.

    Produces:
        - A comprehensive dataset ready for analysis.
        - A set of preprocessed features and a target variable.
        - A dictionary of machine learning model pipelines, primed for training and evaluation.
    '''

    setup_plot_params()

    # Joining fact and dimension tables
    data = fct_electric_brew.merge(dim_datetimes, how = 'left', left_on = 'dim_datetimes_id', right_on = 'id', suffixes = ('', '_dd')) \
                            .merge(dim_meters,    how = 'left', left_on = 'dim_meters_id',    right_on = 'id', suffixes = ('', '_dm')) \
                            .merge(dim_bills,     how = 'left', left_on = 'dim_bills_id',     right_on = 'id', suffixes = ('', '_db'))

    # Feature selection and engineering
    temporal = [] if temporal_only else ['meter_id', 'supplier']
    features = data[['hour', 'week', 'month', 'quarter', 'year', 'period', 'kwh'] + temporal]
    target   = data['total_cost']

    # Step 2: Model Selection & Pipelines
    # Preprocessing for numerical and categorical features
    categorical = ['period'] + temporal
    numerical   = [col for col in features.columns if col not in categorical]

    preprocessor = ColumnTransformer(transformers = [('num', StandardScaler(), numerical),
                                                     ('cat', OneHotEncoder(),  categorical)])

    # Define models within a pipeline to ensure consistent preprocessing
    models = {'Linear Regression' : Pipeline([('preprocessor', preprocessor), 
                                              ('regressor',    LinearRegression())]),

              'Random Forest'     : Pipeline([('preprocessor', preprocessor), 
                                              ('regressor',    RandomForestRegressor())]),

              'Gradient Boosting' : Pipeline([('preprocessor', preprocessor), 
                                              ('regressor',    GradientBoostingRegressor())])}

    return {'data'     : data,
            'features' : features,
            'target'   : target,
            'models'   : models}

## Thoughts after DRY Function

- Left joining on all DataFrames is resulting in additional processing time for both of the established functions
- How should nulls be handled in the dataset?
- Unclear why TCA is taking longer, but might be due to `kwh` being included as a Standardized feature in this version
- Should time components be sorted into any particular order before TCA?

In [28]:
'''
Model Validation Using Holdout Sets

This script develops a model validation strategy using holdout sets from the `fct_electric_brew` dataset. The focus is
on forecasting energy consumption and associated costs across different periods, and the initial interest is in determining
which regressor might be best suited for a dataset of this size with this many key features, using standard parameters like
a holdout size of 30%.

Methodology:
    1. Analysis Setup: Call `setup_analysis` to prepare underlying data and initialize model pipelines
    3. Dataset Splitting: Stratify data into training and holdout sets to validate each model against unbiased data.
    4. Model Training: Fit models on training data, enabling them to uncover underlying data patterns.
    5. Model Evaluation & Visualization: Assess models on holdout set, then visualize prediction accuracy and residuals.

Mathematical Concepts:
    • Linear Regression:
        - Formula: y = β₀ + β₁x₁ + ... + βₙxₙ
        - Purpose: Predicts a dependent variable value (y) based on independent variables (x).

    • Random Forest:
        - Concept: An ensemble learning method that constructs multiple decision trees at training and outputs the mode of 
                   classes for classification or mean prediction for regression.

    • Gradient Boosting:
        - Concept: Boosting method combining weak predictive models (typically decision trees) to create a strong predictor.

    • R² (Coefficient of Determination):
        - Formula: R² = 1 - Σ(yᵢ - f(xᵢ))² / Σ(yᵢ - ȳ)²
        - Purpose: Measures the proportion of the variance for the dependent variable explained by the independent variables.

Produces:
    - Trained and validated models with performance insights.
    - Visualizations highlighting model predictions and performance.
'''

import numpy   as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics         import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from utils.runtime           import find_project_root

# Step 1: Analysis Setup
setup    = setup_analysis()
features = setup['features']
target   = setup['target']
models   = setup['models']

# Step 2: Dataset Splitting
X_train, X_holdout, y_train, y_holdout = train_test_split(features, 
                                                          target, 
                                                          test_size    = 0.3, 
                                                          random_state = 42)

# Step 3: Training and Validation
for name, pipeline in models.items():
    pipeline.fit(X_train, y_train)

# Step 4: Model Evaluation & Visualization
plt.figure(figsize = (20, 10))

for i, (name, pipeline) in enumerate(models.items()):

    y_pred    = pipeline.predict(X_holdout)
    residuals = abs(y_holdout - y_pred)
    r2        = r2_score(y_holdout, y_pred)

    # Print model performance metrics
    print(f"{name}",
          f"    MSE : {mean_squared_error(y_holdout,  y_pred):.4f}",
          f"    MAE : {mean_absolute_error(y_holdout, y_pred):.4f}",    
          f"    R²  : {r2:.4f}",
          sep='\n')

    # Model visualization
    plt.subplot(1, len(models), i + 1)
    plt.plot(y_holdout, y_holdout, color = '1', linewidth = 2, linestyle = ':') # Perfect prediction line
    sns.scatterplot(x = y_holdout, 
                    y = y_pred, 
                    c = np.power(residuals, 0.3), # More aggressive colormapping (i.e. creates more distance from 0)
                    cmap      = 'cividis_r',                  
                    alpha     = 0.6, 
                    edgecolor = None)
    
    plt.xlabel('Actual Total Cost'    if i == 1 else '')
    plt.ylabel('Predicted Total Cost' if i == 0 else '')
    plt.title(f'{name} ($R²$: {r2:.4f})')

plt.suptitle('Model Validation Using Holdout Sets', weight = 'bold', fontsize = 15)
plt.tight_layout()
plt.savefig(f"{find_project_root('./fig/analysis/01 - Model Validation Using Holdout Sets.png')}")
plt.show()

In [27]:
'''
Temporal Consistency Analysis via Cross-Validation

This script conducts a detailed temporal consistency analysis using the RandomForestRegressor on the merged `fct_electric_brew` 
dataset with `dim_datetimes` features included. 

It aims to evaluate the model's performance over different time segments, focusing on its ability to forecast energy consumption 
and costs reliably across these periods.

Methodology:
    1. Analysis Setup: Call `setup_analysis` with temporal features only to prepare underlying data and initialize the pipeline.
    2. Cross-Validation Setup: Implement TimeSeriesSplit to assess model performance over successive time splits.
    3. RandomForest Evaluation: Train the RandomForestRegressor on temporal features only and evaluate using cross-validation.
    4. Annotated Visualization: Plot cross-validation scores with annotations for data volume in each split.

Mathematical Concepts:
    • Time Series Analysis:
        - Purpose: Analyze time-ordered data points to identify trends, patterns, and seasonal variations.
    
    • Cross-Validation (TimeSeriesSplit):
        - Purpose: Evaluate model's prediction reliability over time, ensuring that training occurs only on past data.
    
    • RandomForestRegressor:
        - Concept: An ensemble method using multiple decision trees to improve prediction accuracy and control over-fitting.
    
    • Mean Squared Error (MSE):
        - Formula: MSE = 1/n Σ(yᵢ - ŷᵢ)²
        - Purpose: Quantify the average magnitude of the errors between predicted and actual values.

Produces:
    - Visualization of model performance over time, highlighting any variances and data volume.
'''

import matplotlib.pyplot as plt

from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from utils.runtime           import find_project_root

# Step 1: Analysis Setup
setup    = setup_analysis(True) # Temporal features only
features = setup['features']
target   = setup['target']
model    = setup['models']['Random Forest']

# Step 2: Cross-Validation Setup
tscv  = TimeSeriesSplit(n_splits = 10)

# Step 3: RandomForest Training & Evaluation
cv_scores = cross_val_score(model, 
                            features, 
                            target, 
                            cv      = tscv, 
                            scoring = 'neg_mean_squared_error')

cv_norm = np.abs(cv_scores) / np.max(np.abs(cv_scores))

# Step 4: Annotated Visualization
bars = plt.bar(range(tscv.n_splits), cv_scores, color = plt.cm.cividis_r(cv_norm))

# Formatted Size Annotations
split_sizes = [len(holdout_index) for holdout_index, _ in tscv.split(features)]
for bar, size in zip(bars, split_sizes):
    plt.text(bar.get_x() + bar.get_width() / 2, 
             0, 
             f'Size: {size:,}',
             va   = 'bottom', 
             ha   = 'center')
    
plt.xlabel('Time Split')
plt.ylabel('Cross-Validation Score ($-MSE$)')
plt.title('Temporal Consistency of the Random Forest Regressor')
plt.xticks(range(tscv.n_splits), [f'Split {i + 1}' for i in range(tscv.n_splits)])

plt.tight_layout()
plt.savefig(find_project_root('./fig/analysis/02 - Temporal Consistency of the Random Forest Regressor.png'))
plt.show()

KeyboardInterrupt: 