# Grid Consumption Forecasting with PySpark

This notebook demonstrates how to forecast grid consumption using PySpark. It involves:
- Loading and preprocessing weather and consumption data.
- Feature engineering, including temporal and lagged features.
- Training a Random Forest Regressor for prediction.
- Evaluating the model's performance.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lag, month, dayofweek, hour, unix_timestamp, from_unixtime
from pyspark.sql.window import Window
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

## Initialize Spark Session

In [2]:
spark = SparkSession.builder \
    .appName("Grid Consumption Forecasting") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/04 00:23:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load Weather and Consumption Data

Load the weather and grid consumption data from CSV files.

In [3]:
weather_file = "../dataset/weather_data.csv"
consumption_file = "../dataset/grid_consumption.csv"

weather_data = spark.read.csv(weather_file, header=True, inferSchema=True)
consumption_data = spark.read.csv(consumption_file, header=True, inferSchema=True)

                                                                                

Load the weather and grid consumption data from CSV files.

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# File paths to point to the mounted drive
weather_file = "/content/drive/MyDrive/Weather_Data/weather_data.csv"
consumption_file = "/content/drive/MyDrive/Weather_Data/grid_consumption.csv"

weather_data = spark.read.csv(weather_file, header=True, inferSchema=True)
consumption_data = spark.read.csv(consumption_file, header=True, inferSchema=True)

## Data Preprocessing

Convert timestamps to Spark's timestamp format and round weather data to hourly timestamps.

In [4]:
weather_data = weather_data.withColumn("Date", col("Date").cast("timestamp"))
consumption_data = consumption_data.withColumn("Date", col("Date").cast("timestamp"))

weather_data = weather_data.withColumn("Date", (unix_timestamp("Date") / 3600).cast("int") * 3600)

## Merge Datasets

Join the weather and consumption datasets on `City` and `Date`.

In [5]:
# Convert weather_data.Date back to TIMESTAMP
weather_data = weather_data.withColumn("Date", from_unixtime(col("Date").cast("int")))

# Perform the join
merged_data = consumption_data.join(weather_data, on=["City", "Date"], how="inner")

## Feature Engineering

Add temporal features and lagged consumption features for better modeling.

In [6]:
merged_data = merged_data \
    .withColumn("Hour", hour(col("Date"))) \
    .withColumn("DayOfWeek", dayofweek(col("Date"))) \
    .withColumn("Month", month(col("Date")))

window_spec = Window.partitionBy("City").orderBy("Date")
for lag_val in range(1, 25):
    merged_data = merged_data.withColumn(f"Lag_{lag_val}", lag("Consumption (MW)", lag_val).over(window_spec))

merged_data = merged_data.dropna()

## Prepare Data for Modeling

Use a `VectorAssembler` to combine features into a single vector for model training.

In [7]:
feature_columns = [
    "Temperature (C)", "Feels Like (C)", "Humidity (%)", "Pressure (hPa)",
    "Wind Speed (m/s)", "Cloudiness (%)", "Rain (1h mm)", "Hour", "DayOfWeek", "Month"
] + [f"Lag_{lag_val}" for lag_val in range(1, 25)]

assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(merged_data).select("features", col("Consumption (MW)").alias("label"))

## Split Data into Training and Testing Sets

In [8]:
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

## Train a Random Forest Regressor

In [None]:
rf = RandomForestRegressor(featuresCol="features", labelCol="label", numTrees=100, maxDepth=10)
model = rf.fit(train_data)

## Evaluate the Model

In [None]:
test_predictions = model.transform(test_data)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(test_predictions)

print(f"Root Mean Squared Error (RMSE): {rmse}")

## Save the Model for Future Use

In [None]:
model.save("grid_consumption_rf_model")