# Linear Regression Model

Based on the previous analysis, it seems there is a linear regression model than can be built here where Demand it related to Temperature. I think it would be best to have an __ensemble__ algorithmn where:

1) __Ransac Linear Regression__: Look at air temp vs Demand for given days of the year. Seeing as I have multiple grids I would take the median prediction which could be robust.
2) __Machine Learning__: Use an ML model to tidy up the prediction and look for micro trends.

In the LR model I would like to run a regression for every grid_id for each day of the year where y = demand and x = airTemperature_min. The result will look something like this: 

```python
{1: {'52.00000_-0.50000': {'slope': -0.9802276889431906, 'intercept': 19.459949154023224}, #The first number represents the day of the year.
    '52.00000_0.00000': {'slope': -0.9310295962378231, 'intercept': 19.345573208158548}, 
    '52.00000_0.50000': {'slope': -0.9317602754566605, 'intercept': 19.110430006500174}, 
    '51.50000_-0.50000': {'slope': -0.9569986102825089, 'intercept': 19.978916812381136}, 
    ...}
```

Where each grid and day will have a gradient and intercept value. In order to make a more robust regression I will use RANSAC that is good with respect to outliers. I will then run a prediction for each day in my test and training data and use the LR output as a proxy-prediction. These proxy-predictions will then be fed into the second step machine learning model.

In [1]:
import sys
import pandas as pd
import warnings
import os
warnings.filterwarnings('ignore')
sys.path.append('C:/projects/python/time-series-analysis')

Filter the locations to only include GDUK_EAST_ANGLIA

In [2]:
grid_locations = pd.read_csv("../data/raw/network_gridlocations_&_population.csv")
demand_data = pd.read_csv("../data/raw/demand_data/GDUK_EAST_ANGLIA.csv")
location = "GDUK EAST ANGLIA"
grid_locations = grid_locations[grid_locations['PLANT']==location]

# Linaer Regression RANSAC

Now we need to create a function that will take these list of gridIDs and then run a linea regression of demand vs. air temperature. We do this for all the grids and then take the median value as an initial prediction. We import oure pre-built class that can already achieve this.

In [3]:
from src.util.Initial_Linear_Regression import InitialLinearRegression

In [4]:
model = InitialLinearRegression(grid_locations, demand_data, "../data/raw/noaa")
model.preprocess_demand_data()
model.load_and_preprocess_weather_data()
model.fit_models()


Demand Data Pivot:
Weather Data for GRID ID 52.00000_-0.50000 Processed
Weather Data for GRID ID 52.00000_0.00000 Processed
Weather Data for GRID ID 52.00000_0.50000 Processed
Weather Data for GRID ID 51.50000_-0.50000 Processed
Weather Data for GRID ID 52.50000_0.00000 Processed
Weather Data for GRID ID 52.00000_1.00000 Processed
Weather Data for GRID ID 52.50000_1.50000 Processed
Weather Data for GRID ID 51.50000_0.00000 Processed
Weather Data for GRID ID 52.50000_0.50000 Processed
Weather Data for GRID ID 51.50000_0.50000 Processed
Weather Data for GRID ID 52.00000_1.50000 Processed
Weather Data for GRID ID 52.50000_1.00000 Processed
Weather Data for GRID ID 53.00000_1.00000 Processed
Weather Data for GRID ID 53.00000_1.50000 Processed
Weather Data for GRID ID 52.50000_-0.50000 Processed
Weather Data for GRID ID 53.00000_0.50000 Processed
Weather Data for GRID ID 51.50000_1.00000 Processed
Weather Data for GRID ID 53.50000_-2.00000 Processed
Weather Data for GRID ID 52.50000_-2.5000

Fitting models:  99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 363/365 [01:04<00:00,  7.30it/s]

No valid data or insufficient data points for day 362, GRID ID 52.50000_-0.50000
No valid data or insufficient data points for day 363, GRID ID 52.50000_-0.50000


Fitting models: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 365/365 [01:04<00:00,  5.66it/s]

No valid data or insufficient data points for day 364, GRID ID 52.50000_-0.50000
No valid data or insufficient data points for day 365, GRID ID 52.50000_-0.50000





### Data Pre-Processing

First we need to process the data. Will use a standard scaler and first try and feed as much information as I can into the model.

In [8]:
import os
import numpy as np
import pickle

def apply_prediction(row, model, grid_id):
    try:
        day_of_year = pd.to_datetime(row['date']).dayofyear
        temperature = row['airTemperature_min']
        return model.predict(grid_id, day_of_year, temperature)
    except Exception as e:
        print(f"Error with prediction for grid {grid_id} on day {day_of_year}: {e}")
        return np.nan  # Return NaN to handle errors gracefully

def process_weather_data(grid_locations, demand_data, test_period, model, save_path="../data/processed/machine_learning"):
    dataframe_list = []

    # Loop through each grid ID, load the data, rename columns, and append to list
    for grid_id in grid_locations['GRID_ID']:
        df = pd.read_csv(f"../data/raw/noaa/{grid_id}.csv")
        df['date'] = pd.to_datetime(df['date'])  # Ensure 'date' is in datetime format
        df['linear_prediction'] = df.apply(apply_prediction, axis=1, args=(model, grid_id))
        df.rename(columns={col: f"{col}_{grid_id.replace('.', '')}" if col != 'date' else col for col in df.columns}, inplace=True)
        dataframe_list.append(df)

    # Concatenate all DataFrames along columns
    combined_df = pd.concat(dataframe_list, axis=1)
    combined_df = combined_df.loc[:,~combined_df.columns.duplicated()]

    # Convert 'date' to datetime and ensure no duplicated date columns
    combined_df['date'] = pd.to_datetime(combined_df['date'].drop_duplicates())
    
    # Merge with demand data, ensuring both 'date' columns are datetime
    demand_data['date'] = pd.to_datetime(demand_data['Applicable For'])
    demand_data = demand_data[['date', 'Value']]
    combined_df = pd.merge(combined_df, demand_data, on='date', how='left')

    # Create new time-based features
    combined_df['day_of_year'] = combined_df['date'].dt.dayofyear
    combined_df['day_of_week'] = combined_df['date'].dt.dayofweek
    combined_df['day_of_month'] = combined_df['date'].dt.day
    combined_df['month'] = combined_df['date'].dt.month
    combined_df['year'] = combined_df['date'].dt.year
    combined_df.dropna(inplace=True)

    # Convert all relevant columns to float
    combined_df = combined_df.astype({col: 'float' for col in combined_df.columns if col not in ['date']})
    
    # Sort combined dataframe by date
    combined_df = combined_df.sort_values('date')

    # Manually split the data into training and testing sets based on the last 'test_period' days
    split_point = combined_df.shape[0] - test_period
    split_point_dates = combined_df.iloc[split_point:]['date'].tolist()  # Get split point dates

    X = combined_df.drop(columns=['Value', 'date'])  # Assuming 'date' should also be dropped from features
    y = combined_df['Value'].astype(float)  # Ensure target is also float
    X_train, X_test = X.iloc[:split_point], X.iloc[split_point:]
    y_train, y_test = y.iloc[:split_point], y.iloc[split_point:]

    # Convert DataFrames to numpy arrays
    processed_data = {
        'X_train': X_train.to_numpy(),
        'X_test': X_test.to_numpy(),
        'y_train': y_train.to_numpy(),
        'y_test': y_test.to_numpy(),
        'split_point_dates': split_point_dates
    }

    # Save column names
    column_names = X.columns.tolist()
    with open(os.path.join(save_path, "column_names.pkl"), 'wb') as f:
        pickle.dump(column_names, f)
    
    # Save the processed data as NumPy arrays
    os.makedirs(save_path, exist_ok=True)
    # combined_df.to_csv(os.path.join(save_path, "processed_data.csv"), index=False)
    np.save(os.path.join(save_path, "X_train.npy"), X_train)
    np.save(os.path.join(save_path, "X_test.npy"), X_test)
    np.save(os.path.join(save_path, "y_train.npy"), y_train)
    np.save(os.path.join(save_path, "y_test.npy"), y_test)
    np.save(os.path.join(save_path, "split_point_dates.npy"), np.array(split_point_dates))
    
    return processed_data


In [9]:
X_train, X_test, y_train, y_test, y_test_dates = process_weather_data(grid_locations, demand_data, 360, model)