# Notebook for cleaning the electricity production data

Read data from the csv-files. Original data was xls-files, but there was something wrong with the file format. It seems that it was actually an HTML file and not a proper xls-file. We manually converted to xls-files in to csv-files in order to be able to read them.

In [1]:
import pandas as pd

production_2016 = pd.read_csv('.\\data\\production-per-country_2016_hourly.csv', delimiter=';')
production_2017 = pd.read_csv('.\\data\\production-per-country_2017_hourly.csv', delimiter=';')
production_2018 = pd.read_csv('.\\data\\production-per-country_2018_hourly.csv', delimiter=';')
production_2019 = pd.read_csv('.\\data\\production-per-country_2019_hourly.csv', delimiter=';')
production_2020 = pd.read_csv('.\\data\\production-per-country_2020_hourly.csv', delimiter=';')
production_2021 = pd.read_csv('.\\data\\production-per-country_2021_hourly.csv', delimiter=';')

print(production_2016.head(1))
print(production_2017.head(1))
print(production_2018.head(1))
print(production_2019.head(1))
print(production_2020.head(1))
print(production_2021.head(1))


  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic      EE  \
0   1.1.2016  00�-�01  16764.0  18054.0  7964.0  2914.0  45697.0  1041.0   

      LV     LT  Baltic  
0  436.0  252.0  1729.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2017  00�-�01  12316.0  16898.0  7079.0  4425.0  40717.0  841.0   

      LV     LT  Baltic  
0  240.0  396.0  1477.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2018  00�-�01  14131.0  18029.0  8050.0  3583.0  43793.0  944.0   

      LV     LT  Baltic  
0  496.0  617.0  2057.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2019  00�-�01  11408.0  17979.0  8345.0  4344.0  42077.0  852.0   

      LV     LT  Baltic  
0  271.0  593.0  1716.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2020  00�-�01  16935.0  18666.0  8078.0  3532.0  47211.0  405.0   

      LV     LT  Baltic  
0  211.0 

Keep datetime information and filter out other than Finnish production data.

In [2]:
from datetime import datetime as dt

# rename date information column
production_2016 = production_2016.rename(columns={'Unnamed: 0':'Date'})
production_2017 = production_2017.rename(columns={'Unnamed: 0':'Date'})
production_2018 = production_2018.rename(columns={'Unnamed: 0':'Date'})
production_2019 = production_2019.rename(columns={'Unnamed: 0':'Date'})
production_2020 = production_2020.rename(columns={'Unnamed: 0':'Date'})
production_2021 = production_2021.rename(columns={'Unnamed: 0':'Date'})


# keep only columns that are needed
production_2016 = production_2016[['Date', 'Hours', 'FI']]
production_2017 = production_2017[['Date', 'Hours', 'FI']]
production_2018 = production_2018[['Date', 'Hours', 'FI']]
production_2019 = production_2019[['Date', 'Hours', 'FI']]
production_2020 = production_2020[['Date', 'Hours', 'FI']]
production_2021 = production_2021[['Date', 'Hours', 'FI']]

# concat data
df = pd.concat([
    production_2016, 
    production_2017, 
    production_2018, 
    production_2019, 
    production_2020, 
    production_2021]
)

# rename consumption column
df = df.rename(columns={'FI':'PRODUCTION (MWh)'})

# drop rows where date is more than 31.8.2021
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
end = dt.strptime('01.09.2021', '%d.%m.%Y')
df = df[df['Date'] < end]

print(df.head(1))
print(df.shape)

        Date    Hours  PRODUCTION (MWh)
0 2016-01-01  00�-�01            7964.0
(49685, 3)


Clean the Hours column data. The data is in hh-hh format, but we would like the format in 0-23. So we only need the starting hour of the one hour time interval.

In [3]:
try:
    df['Hours'] = df['Hours'].map(lambda hours_str: int(hours_str[0:2]))
    df = df.rename(columns={'Hours':'Hour'})
except:
    pass
    
print(df.head(1))
print(df.shape)

        Date  Hour  PRODUCTION (MWh)
0 2016-01-01     0            7964.0
(49685, 3)


Write the clean data into csv-file.

In [4]:
df.to_csv('electricity-production-FI_2016-2021_hourly.csv', sep=';', encoding='utf-8', index=False)
print('Success')

Success
