# **Data Processing**

After Loading the data, we must now process it in order to be able to use it in our model. For this we will follow the following steps:
1. Remove unnecessary columns
2. Handle the Date column on both datasets and unify the format
3. Join the datasets
4. Handle the missing values or rows (dates that are not present in both datasets) if any
5. Handle the categorical values

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

In [2]:
# Load the datasets
df_energy = pd.read_csv('datasets/energy.csv',na_filter=False, encoding = "latin")
df_meteorology = pd.read_csv('datasets/meteorology.csv', na_filter=False, encoding = "latin")

### **Remove unnecessary columns**

In [3]:
# We can drop city_name, sea_level and grnd_level as they only have one unique value
df_meteorology = df_meteorology.drop(['city_name', 'sea_level', 'grnd_level'], axis=1)

### **Handle the Date column on both datasets and unify the format**

Energy dataset:

In [4]:
# Convert columns to unified format
df_energy['datetime'] = pd.to_datetime(df_energy['Data'] + ' ' + df_energy['Hora'].astype(str) + ':00:00', format='%Y-%m-%d %H:%M:%S')

# Drop the original 'Data' and 'Hora' columns
df_energy = df_energy.drop(['Data', 'Hora'], axis=1)

# Print the updated DataFrame
df_energy.iloc[901].to_frame().T

Unnamed: 0,Normal (kWh),Horário Económico (kWh),Autoconsumo (kWh),Injeção na rede (kWh),datetime
901,0.0,0.0,0.274,Very High,2021-11-05 13:00:00


Meteorology dataset:

In [5]:
# Convert columns to unified format
df_meteorology['dt_iso'] = pd.to_datetime(df_meteorology['dt_iso'], format='%Y-%m-%d %H:%M:%S %z UTC')
df_meteorology['dt_iso'] = df_meteorology['dt_iso'].dt.tz_localize(None)

# Rename the column to 'datetime'
df_meteorology = df_meteorology.rename(columns={"dt_iso": "datetime"})

# We can also drop the 'dt' column as it is redundant
df_meteorology = df_meteorology.drop(['dt'], axis=1)

# Print the updated DataFrame
df_meteorology.iloc[801].to_frame().T

Unnamed: 0,datetime,temp,feels_like,temp_min,temp_max,pressure,humidity,wind_speed,rain_1h,clouds_all,weather_description
801,2021-10-04 09:00:00,14.03,13.84,13.34,14.54,1023,90,2.07,,69,broken clouds


Validating the dates:

In [6]:
# Order the dataframes by datetime so we can detect any time skips
df_energy = df_energy.sort_values(by=['datetime'])
df_meteorology = df_meteorology.sort_values(by=['datetime'])

In [7]:
time_diff_en = df_energy['datetime'].diff()
time_diff_me = df_meteorology['datetime'].diff()

# Print the irregular time intervals
irregularities_en = time_diff_en[time_diff_en != '0 days 01:00:00']
irregularities_me = time_diff_me[time_diff_me != '0 days 01:00:00']
print("Irregular time intervals in df_en:")
print(irregularities_en)
print("\n")
print("Irregular time intervals in df_me:")
print(irregularities_me)

Irregular time intervals in df_en:
0   NaT
Name: datetime, dtype: timedelta64[ns]


Irregular time intervals in df_me:
0   NaT
Name: datetime, dtype: timedelta64[ns]


### **Rename columns**

In [8]:
# Rename Injeção na rede (kWh) to Injection
df_energy = df_energy.rename(columns={'Injeção na rede (kWh)': 'Injection'})

df_energy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11016 entries, 0 to 11015
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Normal (kWh)             11016 non-null  float64       
 1   Horário Económico (kWh)  11016 non-null  float64       
 2   Autoconsumo (kWh)        11016 non-null  float64       
 3   Injection                11016 non-null  object        
 4   datetime                 11016 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 430.4+ KB


### **Join the datasets**

Outer join:

In [9]:
outer_join_merged_df = pd.merge(df_energy, df_meteorology, on='datetime', how='outer')

In [10]:
outer_join_merged_df.isna().sum()

Normal (kWh)               672
Horário Económico (kWh)    672
Autoconsumo (kWh)          672
Injection                  672
datetime                     0
temp                         0
feels_like                   0
temp_min                     0
temp_max                     0
pressure                     0
humidity                     0
wind_speed                   0
rain_1h                      0
clouds_all                   0
weather_description          0
dtype: int64

Since the datetime data exhibited no irregularities, the additional entries present in the Weather dataset but not in the Energy dataset can be attributed to the Weather dataset containing data from days before or after the Energy dataset's first or last entry, respectively. A manual analysis of the dataset reveals that the Weather dataset includes entries starting from 2021-09-01, while the Energy dataset commences from 2021-09-29. Consequently, there are two possible paths we can take: 
- The first is to exclude entries from the Weather dataset that precede 2021-09-29, as they will not contribute to the modeling process. In order to achieve this we will do an Inner Join between the two datasets on the datetime column, and the resulting dataset will be the one we use going forward.
- The alternative is to keep the entries from the Weather dataset that precede 2021-09-29, and fill the missing values with the mean of the values from the previous day. This will allow us to use the data from the Weather dataset in the modeling process, but it will also introduce a bias in the data. In order to achieve this we will keep this Outer Join between the two datasets and make the necessary modifications.

Therefore we will be training the models on two different datasets, one with the Inner Join and one with the Outer Join.