XGBoost (Extreme Gradient Boosting): XGBoost is a popular machine learning algorithm that is particularly effective for modeling structured data. It is based on decision trees and uses gradient boosting to improve the performance of the models

In this notebook using the XGBoost analyze.

1. Import the necessary libraries.

In [51]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

2. Load the dataset into a Pandas DataFrame and preprocess it as needed:

In [52]:
# Load the CSV file into a pandas DataFrame
df = pd.read_csv("data.csv")
df = df.dropna() # remove any rows with missing data

3. Define the model

In [53]:
# Define the XGBoost model and its hyperparameters
model = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.5
)

4. The data for the GR_load_actual_entsoe_transparency

In [54]:
# Split the data into features (X) and target (y)
df['utc_timestamp'] = pd.to_datetime(df['utc_timestamp'])
X = df.drop('utc_timestamp', axis=1)
y = df['GR_load_actual_entsoe_transparency']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [55]:
X_train.head()

Unnamed: 0,GR_load_actual_entsoe_transparency,GR_load_forecast_entsoe_transparency,GR_solar_generation_actual,GR_wind_onshore_generation_actual
4307,63612300,66780000,15960000.0,6700000.0
22942,60766200,60908800,0.0,7340000.0
9403,72354900,73960000,10000.0,1740000.0
10513,50146100,49477100,0.0,4120000.0
14096,60238600,59380100,9180000.0,16690000.0


Train and prediction

In [56]:
# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

In [57]:
y_pred

array([66201820., 59507024., 41082864., ..., 48974040., 66303360.,
       63670044.], dtype=float32)

In [58]:
y_test

1023     66171500
6492     59467600
10733    41086100
13745    74282200
12932    70842200
           ...   
12167    47630700
18353    59089100
5449     48956900
23313    66414500
3586     63641800
Name: GR_load_actual_entsoe_transparency, Length: 4815, dtype: int64

In [59]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 4179604559.307996


5. The data for the GR_load_forecast_entsoe_transparency

In [60]:
# Split the data into features (X) and target (y)
df['utc_timestamp'] = pd.to_datetime(df['utc_timestamp'])
X = df.drop('utc_timestamp', axis=1)
y = df['GR_load_forecast_entsoe_transparency']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [61]:
# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

In [62]:
y_pred

array([51595084., 40720100., 76391744., ..., 60933364., 55472148.,
       63189272.], dtype=float32)

In [63]:
y_test

23022    51611600
15097    40674700
540      76450000
18544    66851900
723      49580000
           ...   
11302    56961700
14268    61179300
18301    60950300
5192     55570000
7123     63150000
Name: GR_load_forecast_entsoe_transparency, Length: 4815, dtype: int64

In [64]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 3971697205.2060227


6. The data for the GR_solar_generation_actual

In [65]:
# Split the data into features (X) and target (y)
df['utc_timestamp'] = pd.to_datetime(df['utc_timestamp'])
X = df.drop('utc_timestamp', axis=1)
y = df['GR_solar_generation_actual']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [66]:
# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

In [67]:
y_pred

array([1.6871027e+02, 1.6562167e+07, 1.5606325e+07, ..., 1.6871027e+02,
       1.5405820e+07, 1.6871027e+02], dtype=float32)

In [68]:
y_test

1397            0.0
14245    16590000.0
5509     15590000.0
17098     1500000.0
7570     11510000.0
            ...    
5628     17580000.0
11764           0.0
71              0.0
13069    15390000.0
14781           0.0
Name: GR_solar_generation_actual, Length: 4815, dtype: float64

In [69]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 178525867.4920639


7. The data for the GR_wind_onshore_generation_actual

In [70]:
# Split the data into features (X) and target (y)
df['utc_timestamp'] = pd.to_datetime(df['utc_timestamp'])
X = df.drop('utc_timestamp', axis=1)
y = df['GR_wind_onshore_generation_actual']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [71]:
# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

In [72]:
y_pred

array([17332950. ,  1454575.5, 11891878. , ...,  4290031.5, 13257710. ,
       19425660. ], dtype=float32)

In [73]:
y_test

20216    17390000.0
875       1460000.0
18064    11890000.0
8562      1310000.0
7945     10260000.0
            ...    
5634      6500000.0
18510     8960000.0
12785     4290000.0
16126    13260000.0
19828    19420000.0
Name: GR_wind_onshore_generation_actual, Length: 4815, dtype: float64

In [74]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 227438983.1501955
