# Overview

The goal of the competition is to create an energy prediction model of prosumers to reduce energy imbalance costs.

This competition aims to tackle the issue of energy imbalance, a situation where the energy expected to be used doesn't line up with the actual energy used or produced. Prosumers, who both consume and generate energy, contribute a large part of the energy imbalance. Despite being only a small part of all consumers, their unpredictable energy use causes logistical and financial problems for the energy companies.

# Business Understanding

The number of prosumers is rapidly increasing, and solving the problems of energy imbalance and their rising costs is vital. If left unaddressed, this could lead to increased operational costs, potential grid instability, and inefficient use of energy resources. If this problem were effectively solved, it would significantly reduce the imbalance costs, improve the reliability of the grid, and make the integration of prosumers into the energy system more efficient and sustainable. Moreover, it could potentially incentivize more consumers to become prosumers, knowing that their energy behavior can be adequately managed, thus promoting renewable energy production and use.

Enefit is one of the biggest energy companies in Baltic region. As experts in the field of energy, we help customers plan their green journey in a personal and flexible manner as well as implement it by using environmentally friendly energy solutions.

At present, Enefit is attempting to solve the imbalance problem by developing internal predictive models and relying on third-party forecasts. However, these methods have proven to be insufficient due to their low accuracy in forecasting the energy behavior of prosumers. The shortcomings of these current methods lie in their inability to accurately account for the wide range of variables that influence prosumer behavior, leading to high imbalance costs. By opening up the challenge to the world's best data scientists through the Kaggle platform, Enefit aims to leverage a broader pool of expertise and novel approaches to improve the accuracy of these predictions and consequently reduce the imbalance and associated costs.

# Data Identification
Your challenge in this competition is to predict the amount of electricity produced and consumed by Estonian energy customers who have installed solar panels. You'll have access to weather data, the relevant energy prices, and records of the installed photovoltaic capacity.

This is a forecasting competition using the time series API. The private leaderboard will be determined using real data gathered after the submission period closes.

**train.csv**

* `county` - An ID code for the county.
* `is_business` - Boolean for whether or not the prosumer is a business.
* `product_type` - ID code with the following mapping of codes to contract types: `{0: "Combined", 1: "Fixed", 2: "General service", 3: "Spot"}`.
* `target` - The consumption or production amount for the relevant segment for the hour. The segments are defined by the `county`, `is_business`, and `product_type`.
* `is_consumption` - Boolean for whether or not this row's target is consumption or production.
* `datetime` - The Estonian time in EET (UTC+2) / EEST (UTC+3).
* `data_block_id` - All rows sharing the same `data_block_id` will be available at the same forecast time. This is a function of what information is available when forecasts are actually made, at 11 AM each morning. For example, if the forecast weather `data_block_id` for predictins made on October 31st is 100 then the historic weather `data_block_id` for October 31st will be 101 as the historic weather data is only actually available the next day.
* `row_id` - A unique identifier for the row.
* `prediction_unit_id` - A unique identifier for the `county`, `is_business`, and `product_type` combination. New prediction units can appear or disappear in the test set.

**gas_prices.csv**

* `origin_date` - The date when the day-ahead prices became available.
* `forecast_date` - The date when the forecast prices should be relevant.
* `[lowest/highest]_price_per_mwh` - The lowest/highest price of natural gas that on the day ahead market that trading day, in Euros per megawatt hour equivalent.
* `data_block_id`

**client.csv**

* `product_type`
* `county` - An ID code for the county. See `county_id_to_name_map`.json for the mapping of ID codes to county names.
* `eic_count` - The aggregated number of consumption points (EICs - European Identifier Code).
* `installed_capacity` - Installed photovoltaic solar panel capacity in kilowatts.
* `is_business` - Boolean for whether or not the prosumer is a business.
* `date`
* `data_block_id`

**electricity_prices.csv**
* `origin_date`
* `forecast_date`
* `euros_per_mwh` - The price of electricity on the day ahead markets in euros per megawatt hour.
* `data_block_id`

**forecast_weather.csv** Weather forecasts that would have been available at prediction time. Sourced from the [European Centre for Medium-Range Weather Forecasts](https://codes.ecmwf.int/grib/param-db/?filter=grib2).

* `[latitude/longitude]` - The coordinates of the weather forecast.
* `origin_datetime` - The timestamp of when the forecast was generated.
* `hours_ahead` - The number of hours between the forecast generation and the forecast weather. Each forecast covers 48 hours in total.
* `temperature` - The air temperature at 2 meters above ground in degrees Celsius.
* `dewpoint` - The dew point temperature at 2 meters above ground in degrees Celsius.
* `cloudcover_[low/mid/high/total]` - The percentage of the sky covered by clouds in the following altitude bands: 0-2 km, 2-6, 6+, and total.
* `10_metre_[u/v]_wind_component` - The [eastward/northward] component of wind speed measured 10 meters above surface in meters per second.
* `data_block_id`
* `forecast_datetime` - The timestamp of the predicted weather. Generated from `origin_datetime` plus `hours_ahead`.
* `direct_solar_radiation` - The direct solar radiation reaching the surface on a plane perpendicular to the direction of the Sun accumulated during the preceding hour, in watt-hours per square meter.
* `surface_solar_radiation_downwards` - The solar radiation, both direct and diffuse, that reaches a horizontal plane at the surface of the Earth, in watt-hours per square meter.
* `snowfall` - Snowfall over the previous hour in units of meters of water equivalent.
* `total_precipitation` - The accumulated liquid, comprising rain and snow that falls on Earth's surface over the preceding hour, in units of meters.

**historical_weather.csv** [Historic weather data](https://www.kaggle.com/code/sohier/enefit-basic-submission-demo/notebook)
* `datetime`
* `temperature`
* `dewpoint`
* `rain` - Different from the forecast conventions. The rain from large scale weather systems of the preceding hour in millimeters.
* `snowfall` - Different from the forecast conventions. Snowfall over the preceding hour in centimeters.
* `surface_pressure` - The air pressure at surface in hectopascals.
* `cloudcover_[low/mid/high/total]` - Different from the forecast conventions. Cloud cover at 0-3 km, 3-8, 8+, and total.
* `windspeed_10m` - Different from the forecast conventions. The wind speed at 10 meters above ground in meters per second.
* `winddirection_10m` - Different from the forecast conventions. The wind direction at 10 meters above ground in degrees.
* `shortwave_radiation` - Different from the forecast conventions. The global horizontal irradiation in watt-hours per square meter.
* `direct_solar_radiation`
* `diffuse_radiation` - Different from the forecast conventions. The diffuse solar irradiation in watt-hours per square meter.
* `[latitude/longitude]` - The coordinates of the weather station.
* `data_block_id`

**public_timeseries_testing_util.py** An optional file intended to make it easier to run custom offline API tests. See the script's docstring for details. You will need to edit this file before using it.

**example_test_files/** Data intended to illustrate how the API functions. Includes the same files and columns delivered by the API. The first three `data_block_ids` are repeats of the last three `data_block_ids` in the train set.

**example_test_files/sample_submission.csv** A valid sample submission, delivered by the API. See [this notebook](https://www.kaggle.com/code/sohier/enefit-basic-submission-demo/notebook) for a very simple example of how to use the sample submission.

**example_test_files/revealed_targets.csv** The actual target values from the day before the forecast time. This amounts to two days of lag relative to the prediction times in the **test.csv**.

**enefit/** Files that enable the API. Expect the API to deliver all rows in under 15 minutes and to reserve less than 0.5 GB of memory. The copy of the API that you can download serves the data from **example_test_files/**. You must make predictions for those dates in order to advance the API but those predictions are not scored. Expect to see roughly three months of data delivered initially and up to ten months of data by the end of the forecasting period.

**Submissions are evaluated on the Mean Absolute Error (MAE) between the predicted return and the observed target.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly import tools
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
from pprint import pprint
import warnings
warnings.filterwarnings("ignore")
pd.options.display.float_format = '{:.2f}'.format

# train.csv

In [None]:
df_train = pd.read_csv("/kaggle/input/predict-energy-behavior-of-prosumers/train.csv")
df_train["datetime"] = pd.to_datetime(df_train["datetime"])
df_train_consumption = df_train[df_train["is_consumption"]==1]
monthly_consumption = df_train_consumption.groupby(pd.Grouper(key="datetime", freq='M')).mean()
weekly_consumption = df_train_consumption.groupby(pd.Grouper(key="datetime", freq='W')).mean()
daily_consumption = df_train_consumption.groupby(pd.Grouper(key="datetime", freq='D')).mean()
mean_consumption = df_train_consumption.target.mean()
df_train_production = df_train[df_train["is_consumption"]==0]
monthly_production = df_train_production.groupby(pd.Grouper(key="datetime", freq='M')).mean()
weekly_production = df_train_production.groupby(pd.Grouper(key="datetime", freq='W')).mean()
daily_production = df_train_production.groupby(pd.Grouper(key="datetime", freq='D')).mean()
mean_production = df_train_production.target.mean()

In [None]:
def plot_one(df, mean, color, title, annotation, yaxis_title, y="target", line_shape="linear"):
    fig = px.area(df, x=df.index, 
                  y=y, title=title,
                  line_shape=line_shape)
    fig.add_hline(y=mean, line_dash="dot", 
                  annotation_text=annotation, 
                  annotation_position="bottom right")
    fig.update_traces(line_color=color)
    fig.update_layout(xaxis_title="Date",
                      yaxis_title=yaxis_title)
    return fig

In [None]:
df_train.info()

In [None]:
df_train.isna().sum()

In [None]:
df_train = df_train.dropna(how="any")

In [None]:
df_train.describe().T

In [None]:
plot_one(daily_consumption, mean_consumption, 
         "#FA163F", "Daily Consumption", 
         "Average Consumption", "Consumption")

In [None]:
plot_one(daily_production, mean_production, 
         "#427D9D", "Daily Production", 
         "Average Production", "Production")

In [None]:
net_consumption = daily_consumption["target"]- daily_production["target"]
plot_one(net_consumption, net_consumption.mean(), 
         "#EC8F5E", "Net Consumption (Comsumption-Production)", 
         "Average Net Consumption",
         "Net Consumption")

In [None]:
parallel_diagram = df_train[['county', 'product_type', 'is_business', 'is_consumption']]
fig = px.parallel_categories(parallel_diagram, color_continuous_scale=px.colors.sequential.Inferno)
fig.update_layout(title='Parallel category diagram on Train Data set')
fig.show()

In [None]:
def plot_pie(df, col1, col2):
    df_ = df.groupby([col1, col2])[col1].count().reset_index(name='counts')
    df_[col1+","+col2] = df_[col1].astype(str) + ',' + df_[col2].astype(str) 
    fig = px.pie(df_, values='counts', names=col1+","+col2, title=col1+' | '+col2)
    fig.update_layout(autosize=True,width=700, height=650, 
                      margin=dict(l=50,r=50, b=60, t=50, pad=4),
                      paper_bgcolor="LightSteelBlue", showlegend=True)
    fig.show()

In [None]:
plot_pie(df_train, "product_type", "is_business")

In [None]:
plot_pie(df_train, "product_type", "is_consumption")

In [None]:
def plot_dist(df, col1, col2, color):    
    df_ = df.groupby([col1, col2])[col1].count().reset_index(name='counts')
    plt.figure(figsize=(5,5))
    plt.legend()
    sns.distplot(df_['counts'],label='counts', color=color)
    plt.show()

In [None]:
plot_dist(df_train, 'product_type', 'is_business', 'red')

In [None]:
plot_dist(df_train, 'product_type', 'is_consumption', 'red')

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.histplot(data=df_train, x='county', hue='is_business', multiple='stack', bins=30, palette='viridis', alpha=0.7)
plt.xlabel('County')
plt.ylabel('Count')
plt.title('Prosumer Distribution by County')
plt.legend(title='Is Business', loc='upper right')
plt.show()

In [None]:
desc_columns = ['county','is_business','product_type','is_consumption']

fig, axs = plt.subplots(1, len(desc_columns), figsize=(5*len(desc_columns), 3))

for i, column in enumerate(desc_columns):
    _ = sns.countplot(df_train, x=column, ax=axs[i])

_ = fig.tight_layout()

In [None]:
train_avgd = (
    df_train
    .groupby(['datetime','is_consumption'])
    ['target'].mean()
    .unstack()
    .rename({0: 'produced', 1:'consumed'}, axis=1)
)

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
_ = train_avgd.plot(ax=ax, alpha=0.5)
_ = ax.set_ylabel('Energy consumed / produced')

In [None]:
fig,ax = plt.subplots(1,1,figsize=(6,4))
_ = train_avgd.resample('M').mean().plot(ax=ax, marker='.')
_ = ax.set_ylabel('Average monthly')

In [None]:
fig,ax = plt.subplots(1,1,figsize=(6,4))
train_avgd.groupby(train_avgd.index.hour).mean().plot(ax=ax, marker='.')
_ = ax.set_xlabel('Hour')

# gas_price.csv

In [None]:
df_gas = pd.read_csv("/kaggle/input/predict-energy-behavior-of-prosumers/gas_prices.csv")
df_gas.drop(["origin_date"], inplace=True, axis=1)
df_gas["forecast_date"] = pd.to_datetime(df_gas["forecast_date"])
monthly_gas = df_gas.groupby(pd.Grouper(key="forecast_date", freq='M')).mean()
weekly_gas = df_gas.groupby(pd.Grouper(key="forecast_date", freq='W')).mean()
daily_gas = df_gas.groupby(pd.Grouper(key="forecast_date", freq='D')).mean()
mean_gas_low = df_gas.lowest_price_per_mwh.mean()
mean_gas_high = df_gas.highest_price_per_mwh.mean()

In [None]:
def plot_two(df, 
             mean1, mean2, 
             color1, color2, 
             title, 
             annotation1, annotation2, 
             yaxis_title, 
             y1,y2,
             line_shape="linear"):
    fig = px.area(df, x=df.index, y=[y1, y2],
                 title=title,
                 color_discrete_map={y1: color1,
                                     y2: color2},
                 line_shape=line_shape)
    fig.add_hline(y=mean1, line_dash="dot",
                 annotation_text=annotation1, annotation_position="top right")
    fig.add_hline(y=mean2, line_dash="dot",
                 annotation_text=annotation2, annotation_position="bottom right")
    fig.update_layout(xaxis_title="Date", yaxis_title=yaxis_title,
                     legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))
    return fig

In [None]:
df_gas.info()

In [None]:
df_gas.describe().T

In [None]:
plot_two(daily_gas,
         mean_gas_low, mean_gas_high,
         "#EFB74F", "#247881",
         "Daily Price/MWh",
         "Average Low Price/MWh", "Average High Price/MWh", 
         "Euros/MWh",
         "lowest_price_per_mwh", "highest_price_per_mwh")

In [None]:
plot_two(weekly_gas,
         mean_gas_low, mean_gas_high,
         "#6499E9", "#6930C3",
         "Weekly Price/MWh",
         "Average Low Price/MWh", "Average High Price/MWh", 
         "Euros/MWh",
         "lowest_price_per_mwh", "highest_price_per_mwh")

# historical_weather.csv

In [None]:
df_historical = pd.read_csv("/kaggle/input/predict-energy-behavior-of-prosumers/historical_weather.csv")
df_historical["datetime"] = pd.to_datetime(df_historical["datetime"])
monthly_historical = df_historical.groupby(pd.Grouper(key="datetime", freq='M')).mean()
weekly_historical = df_historical.groupby(pd.Grouper(key="datetime", freq='W')).mean()
daily_historical = df_historical.groupby(pd.Grouper(key="datetime", freq='D')).mean()
mean_historical_temp = df_historical.temperature.mean()
mean_historical_solar = df_historical.direct_solar_radiation.mean()
mean_historical_rain = df_historical.rain.mean()
mean_historical_snow = df_historical.snowfall.mean()
mean_historical_windspeed = df_historical.windspeed_10m.mean()
mean_historical_surface_pressure = df_historical.surface_pressure.mean()

In [None]:
df_historical.info()

In [None]:
df_historical.describe().T

In [None]:
plot_one(daily_historical,
         mean_historical_temp, 
         "#C59279", "Daily Historical Temperature", 
         "Average Historical Temperature",
         "Temperature",y="temperature")

In [None]:
plot_two(daily_historical,
         mean_historical_solar, mean_historical_solar,
         "#C70A80","#6499E9",
         "Daily Radiation Level",
         "Average Solar Radiation", "Average Solar Radiation",
         "Radiation Level",
         "direct_solar_radiation","shortwave_radiation")

In [None]:
plot_two(weekly_historical,
         mean_historical_solar, mean_historical_solar,
         "#C70A80","#6499E9",
         "Weekly Radiation Level",
         "Average Solar Radiation", "Average Solar Radiation",
         "Radiation Level",
         "direct_solar_radiation","shortwave_radiation",
         line_shape="spline")

In [None]:
plot_two(weekly_historical,
         mean_historical_rain, mean_historical_snow,
         "#325288","#565D47",
         "Weekly Rain/Snowfall",
         "Average Rainfall", "Average Snowfall",
         "Rain/Snow Level",
         "rain","snowfall",
         line_shape="spline")

In [None]:
plot_one(daily_historical,
         mean_historical_windspeed, 
         "#69C98D", "Daily Windspeed", 
         "Average Windspeed",
         "Windspeed (in m/s)",y="windspeed_10m",
          line_shape="spline")

In [None]:
plot_one(weekly_historical,
         mean_historical_windspeed, 
         "#1640D6", "Weekly Windspeed", 
         "Average Windspeed",
         "Windspeed (in m/s)",y="windspeed_10m",
          line_shape="spline")

In [None]:
fig = plot_one(daily_historical,
         mean_historical_surface_pressure, 
         "#D61640", "Daily Surface Pressure", 
         "Average Surface Pressure",
         "Pressure (in Hectopascals)",y="surface_pressure",
          line_shape="spline")
fig.update_layout(yaxis_range=[900,1100])

In [None]:
fig = plot_one(weekly_historical,
         mean_historical_surface_pressure, 
         "#26A620", "Weekly Surface Pressure", 
         "Average Surface Pressure",
         "Pressure (in Hectopascals)",y="surface_pressure",
          line_shape="spline")
fig.update_layout(yaxis_range=[900,1100])

In [None]:
df_historical["date"] = np.array(df_historical["datetime"], dtype="datetime64[D]")
df_historical

In [None]:
print(len(df_historical) - len(df_historical.drop_duplicates()))
location = df_historical[df_historical.duplicated()][["latitude", "longitude"]]
location.drop_duplicates()

In [None]:
import folium
fmap = folium.Map((58.5, 25), zoom_start=7)
fmap.add_child(folium.LatLngPopup())
for i, (lat, lon) in df_historical[['latitude', 'longitude']].drop_duplicates().iterrows():
    popup = folium.Popup(f'({lat}, {lon}))', max_width=200)
    marker = folium.CircleMarker((lat, lon), radius=5, popup=popup, fill_color='#EC4074')
    marker.add_to(fmap)
fmap

In [None]:
sum_list = []
sum_list2 = []
for i, (latitude, longitude) in df_historical[['latitude', 'longitude']].drop_duplicates().iterrows():
    mask1 = df_historical['latitude'] == latitude
    mask2 = df_historical['longitude'] == longitude
    location = df_historical[mask1 & mask2].reset_index(drop=True)
    df =  location.groupby('datetime')[['temperature', 'dewpoint']].mean()
    sum_list.append(sum(df['temperature'] <= df['dewpoint']))
    sum_list2.append(sum(df['temperature'] <= df['dewpoint']+1))
    
width = 14
print(f'total hour: {len(df)}')
print('temperature <= dewpoint')
for i in range(0, len(sum_list), width):
    print(sum_list[i:i+width])
print()
for i in range(0, len(sum_list), width):
    print(sum_list2[i:i+width])

In [None]:
def plot_historical_column(location, col):
    plt.figure(figsize=(10, 4))
    plt.plot(location.groupby('datetime')[[col]].mean(), label='hourly')
    plt.plot(location.groupby('date')[[col]].mean(), label='daily')
    plt.title(col)
    plt.xticks(rotation=25)
    plt.legend()
    plt.show()

In [None]:
weather_gen = df_historical[['latitude', 'longitude']].drop_duplicates().iterrows()

In [None]:
i, (latitude, longitude) = next(weather_gen)
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
mask1 = df_historical['latitude'] == latitude
mask2 = df_historical['longitude'] == longitude
location = df_historical[mask1 & mask2].reset_index(drop=True)

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plt.figure(figsize=(10, 4))
plt.plot(location.groupby('datetime')[['temperature']].mean(), label='temperature')
plt.plot(location.groupby('datetime')[['dewpoint']].mean(), label='dewpoint')
plt.title('hourly temperature vs dewpoint')
plt.xticks(rotation=25)
plt.legend()
plt.show()

plt.figure(figsize=(10, 4))
plt.plot(location.groupby('date')[['temperature']].mean(), label='temperature')
plt.plot(location.groupby('date')[['dewpoint']].mean(), label='dewpoint')
plt.title('daily temperature vs dewpoint')
plt.xticks(rotation=25)
plt.legend()
plt.show()

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plot_historical_column(location, 'rain')
plot_historical_column(location, 'snowfall')

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plot_historical_column(location, 'cloudcover_high')
plot_historical_column(location, 'cloudcover_mid')

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plot_historical_column(location, 'cloudcover_low')
plot_historical_column(location, 'cloudcover_total')

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plot_historical_column(location, 'windspeed_10m')
plot_historical_column(location, 'winddirection_10m')

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plot_historical_column(location, 'shortwave_radiation')
plot_historical_column(location, 'direct_solar_radiation')

In [None]:
print(f'[{i}] latitude: {latitude}, longitude: {longitude}')
plot_historical_column(location, 'diffuse_radiation')
plot_historical_column(location, 'surface_pressure')

# client.csv

In [None]:
df_client = pd.read_csv("/kaggle/input/predict-energy-behavior-of-prosumers/client.csv")
df_client

In [None]:
df_client.info()

In [None]:
df_client.describe().T

In [None]:
df_train_prediction_unit_id = df_train[["product_type", "county", "is_business"]].drop_duplicates()
df_client_prediction_unit_id = df_client[["product_type", "county", "is_business"]].drop_duplicates()
display(df_train_prediction_unit_id, df_client_prediction_unit_id)

In [None]:
df_train_prediction_unit_id = df_train[["product_type", "county", "is_business", "prediction_unit_id"]].drop_duplicates()
display(df_train_prediction_unit_id)

In [None]:
client_gen = df_client[["product_type", "county", "is_business"]].drop_duplicates().iterrows()

In [None]:
n_rows, n_cols = 3, 3
fig, axes = plt.subplots(n_rows, n_cols)
gen = zip([next(client_gen) for _ in range(n_rows * n_cols)], axes.ravel())
indexes = []
for (i, (product_type, county, is_business)), ax in gen:
    indexes.append(i)
    mask1 = df_client["product_type"] == product_type
    mask2 = df_client["county"] == county
    mask3 = df_client["is_business"] == is_business
    temp = df_client[mask1 & mask2 & mask3]
    
    ax.plot(temp["eic_count"].reset_index(drop=True))
    ax.set_ylabel("eic_count", color="blue")
    ax.set_xlabel("date")
    ax2 = ax.twinx()
    ax2.plot(temp["installed_capacity"].reset_index(drop=True), color="orange")
    ax2.set_ylabel("installed_capacity", color="orange")
print(f"prediction_unit_id: {indexes}")
plt.tight_layout()
plt.show()

In [None]:
parallel_diagram = df_client[['county', 'product_type', 'is_business']]
fig = px.parallel_categories(parallel_diagram, color_continuous_scale=px.colors.sequential.Inferno)
fig.update_layout(title='Parallel category diagram on client Data set')
fig.show()

In [None]:
plot_pie(df_client, "product_type", "is_business")

In [None]:
plot_dist(df_client, 'product_type', 'is_business', 'red')

# electricity_prices.csv

In [None]:
df_electricity = pd.read_csv("/kaggle/input/predict-energy-behavior-of-prosumers/electricity_prices.csv")
df_electricity["forecast_date"] = np.array(df_electricity["forecast_date"], dtype="datetime64")
df_electricity["date"] = np.array(df_electricity["forecast_date"], dtype="datetime64[D]")
df_electricity

In [None]:
df_electricity.info()

In [None]:
df_electricity.describe().T

In [None]:
hourly_electricity = df_electricity[["forecast_date","euros_per_mwh"]].set_index("forecast_date")
daily_electricity = df_electricity[["date","euros_per_mwh"]].groupby("date")["euros_per_mwh"].mean()
plt.figure(figsize=(11, 6))
plt.plot(hourly_electricity, label="hourly price")
plt.plot(daily_electricity, label="daily price")
plt.title("electricity price")
plt.xticks(rotation=25)
plt.grid()
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(11, 6))
plt.plot(hourly_electricity[:48], label='hourly price')
plt.plot(daily_electricity[:3], label='daily price')
plt.title('electricity price')
plt.ylabel('euros_per_mwh')
plt.xticks(rotation=25)
plt.grid()
plt.legend()
plt.show()

In [None]:
df_electricity["time"] = df_electricity["forecast_date"].dt.strftime("%H:%M:%S")
fig, axs = plt.subplots(1, 2, figsize=(9, 4), gridspec_kw={'width_ratios': [8, 1]}, sharey=True)
_ = sns.lineplot(df_electricity, x='forecast_date', y='euros_per_mwh', ax=axs[0])
_ = sns.boxplot(df_electricity, y='euros_per_mwh', ax=axs[1])
#_ = axs[1].get_yaxis().set_visible(False)
fig.tight_layout()

In [None]:
daily_elec_prices = (
    df_electricity[['forecast_date', 'euros_per_mwh']]
    .set_index('forecast_date')
    .resample('D')
    .mean()
)

fig, axs = plt.subplots(1, 2, figsize=(9, 4), gridspec_kw={'width_ratios': [8, 1]}, sharey=True)
_ = sns.lineplot(daily_elec_prices, x='forecast_date', y='euros_per_mwh', ax=axs[0])
_ = sns.boxplot(daily_elec_prices, y='euros_per_mwh', ax=axs[1])
#_ = axs[1].get_yaxis().set_visible(False)
fig.tight_layout()

# forecast_weather.csv

In [None]:
df_forecast = pd.read_csv("/kaggle/input/predict-energy-behavior-of-prosumers/forecast_weather.csv")
df_forecast["forecast_datetime"] = np.array(df_forecast["forecast_datetime"], dtype="datetime64")
df_forecast["date"] = np.array(df_forecast["forecast_datetime"], dtype="datetime64[D]")
df_forecast

In [None]:
df_forecast.info()

In [None]:
df_forecast.describe().T

In [None]:
import folium
fmap = folium.Map((58.8, 25), zoom_start=7)
fmap.add_child(folium.LatLngPopup())
for i, (lat, lon) in df_forecast[["latitude", "longitude"]].drop_duplicates().iterrows():
    popup = folium.Popup(f"({lat}, {lon})", max_width=200)
    marker = folium.CircleMarker((lat, lon), radius=5, popup=popup, fill_color='#EC4074')
    marker.add_to(fmap)
fmap

In [None]:
sum_list = []
sum_list2 = []
for i, (latitude, longitude) in df_forecast[["latitude", "longitude"]].drop_duplicates().iterrows():
    mask1 = df_forecast["latitude"] == latitude
    mask2 = df_forecast["longitude"] == longitude
    location = df_forecast[mask1 & mask2].reset_index(drop=True)
    df = location.groupby('forecast_datetime')[["temperature", "dewpoint"]].mean()
    sum_list.append(sum(df["temperature"] <= df["dewpoint"]))
    sum_list2.append(sum(df["temperature"] <= df["dewpoint"] + 1))
    
width = 14
print(f"total: {len(df)} hours = 24 hours/day x 638 forecast days")
print()
print("[Number of times the expression 'temperature <= dewpoint' is satisfied]")
for i in range(0, len(sum_list), width):
    print(sum_list[i:i+width])
print("<The position of the number is equal to the position of the observation point>")
print()
print('temperature <= dewpoint + 1')
for i in range(0, len(sum_list), width):
    print(sum_list2[i:i+width])

In [None]:
def plot_forecast_column(location, col):
    plt.figure(figsize=(10, 4))
    plt.plot(location.groupby('forecast_datetime')[[col]].mean(), label="hourly")
    plt.plot(location.groupby('date')[[col]].mean(), label="daily")
    plt.title(col)
    plt.xticks(rotation=25)
    plt.legend()
    plt.show()

In [None]:
df_forecast_gen = df_forecast[["latitude", "longitude"]].drop_duplicates().iterrows()

In [None]:
i, (latitude, longitude) = next(df_forecast_gen)
print(f"[{i}] latitude: {latitude}, longitude: {longitude}")
mask1 = df_forecast["latitude"] == latitude
mask2 = df_forecast["longitude"] == longitude
location = df_forecast[mask1 & mask2].reset_index(drop=True)

In [None]:
print(f'[{i} latitude: {latitude}, longitude: {longitude}]')
plt.figure(figsize=(10, 4))
plt.plot(location.groupby("forecast_datetime")[["temperature"]].mean(), label="temperature")
plt.plot(location.groupby("forecast_datetime")[["dewpoint"]].mean(), label="dewpoint")
plt.title("hourly temperature vs dewpoint")
plt.xticks(rotation=25)
plt.legend()
plt.show()

print(f'[{i} latitude: {latitude}, longitude: {longitude}]')
plt.figure(figsize=(10, 4))
plt.plot(location.groupby("date")[["temperature"]].mean(), label="temperature")
plt.plot(location.groupby("date")[["dewpoint"]].mean(), label="dewpoint")
plt.title("daily temperature vs dewpoint")
plt.xticks(rotation=25)
plt.legend()
plt.show()

In [None]:
print(f"[{i}] latitude: {latitude}, longitude: {longitude}")
plot_forecast_column(location[:3000], "cloudcover_high")
plot_forecast_column(location[:3000], "cloudcover_mid")

In [None]:
print(f"[{i}] latitude: {latitude}, longitude: {longitude}")
plot_forecast_column(location[:3000], "cloudcover_low")
plot_forecast_column(location[:3000], "cloudcover_total")

In [None]:
print(f"[{i}] latitude: {latitude}, longitude: {longitude}")
plot_forecast_column(location[:3000], "10_metre_u_wind_component")
plot_forecast_column(location[:3000], "10_metre_v_wind_component")

In [None]:
print(f"[{i}] latitude: {latitude}, longitude: {longitude}")
plot_forecast_column(location, "direct_solar_radiation")
plot_forecast_column(location, "surface_solar_radiation_downwards")

In [None]:
print(f"[{i}] latitude: {latitude}, longitude: {longitude}")
plot_forecast_column(location, "snowfall")
plot_forecast_column(location, "total_precipitation")

# county_id_to_name_map.json

In [None]:
import json
with open('/kaggle/input/predict-energy-behavior-of-prosumers/county_id_to_name_map.json', 'r') as f:
    json_data = json.load(f)
county_id_to_name_map = eval(json.dumps(json_data))
for key, value in county_id_to_name_map.items():
    print(key, value)

# revealed_targets

In [None]:
df_revealed = df_train[["county", "is_business", "product_type", "target", "is_consumption", "datetime", "prediction_unit_id"]].copy()

# Display Datasets

In [None]:
from colorama import Fore, Style, init;

def print_color(text:str, color = Fore.BLUE, style = Style.BRIGHT):
    '''Prints color outputs using colorama of a text string'''
    print(style + color + text + Style.RESET_ALL); 
    
def display_df(df, name):
    '''Display df shape and first row '''
    print_color(text = f'{name} data has {df.shape[0]} rows and {df.shape[1]} columns. \n ===> First row:')
    display(df.head(1))

In [None]:
display_df(df_train, 'train')

In [None]:
display_df(df_client, 'client')

In [None]:
display_df(df_electricity, 'electricity')

In [None]:
display_df(df_forecast, 'forecast_weather')

In [None]:
display_df(df_gas, 'gas_prices')

In [None]:
display_df(df_historical, 'historical_weather')

In [None]:
display_df(df_revealed, 'revealed_targets')

https://www.kaggle.com/code/syerramilli/enefit-eda-catboost-baseline