# Notebook for cleaning the electricity consumption data

Read data from the csv-files. Original data was xls-files, but there was something wrong with the file format. It seems that it was actually an HTML file and not a proper xls-file. We manually converted to xls-files in to csv-files in order to be able to read them.

In [1]:
import pandas as pd

consumption_2016 = pd.read_csv('.\\data\\consumption-per-country_2016_hourly.csv', delimiter=';')
consumption_2017 = pd.read_csv('.\\data\\consumption-per-country_2017_hourly.csv', delimiter=';')
consumption_2018 = pd.read_csv('.\\data\\consumption-per-country_2018_hourly.csv', delimiter=';')
consumption_2019 = pd.read_csv('.\\data\\consumption-per-country_2019_hourly.csv', delimiter=';')
consumption_2020 = pd.read_csv('.\\data\\consumption-per-country_2020_hourly.csv', delimiter=';')
consumption_2021 = pd.read_csv('.\\data\\consumption-per-country_2021_hourly.csv', delimiter=';')

print(consumption_2016.head(1))
print(consumption_2017.head(1))
print(consumption_2018.head(1))
print(consumption_2019.head(1))
print(consumption_2020.head(1))
print(consumption_2021.head(1))


  Unnamed: 0    Hours       NO       SE       FI      DK   Nordic     EE  \
0   1.1.2016  00�-�01  15418.0  15432.0  10005.0  3159.0  44015.0  911.0   

      LV      LT  Baltic  
0  741.0  1029.0  2681.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2017  00�-�01  14912.0  14208.0  9565.0  2815.0  41498.0  753.0   

      LV     LT  Baltic  
0  660.0  874.0  2287.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2018  00�-�01  16989.0  15564.0  9715.0  3420.0  45688.0  827.0   

      LV      LT  Baltic  
0  663.0  1102.0  2592.0  
  Unnamed: 0    Hours       NO       SE       FI      DK   Nordic     EE  \
0   1.1.2019  00�-�01  15724.0  14597.0  10467.0  3258.0  44046.0  842.0   

      LV      LT  Baltic  
0  678.0  1171.0  2691.0  
  Unnamed: 0    Hours       NO       SE      FI      DK   Nordic     EE  \
0   1.1.2020  00�-�01  16151.0  14957.0  9548.0  3313.0  43970.0  805.0   

      LV      LT  Baltic  


Keep datetime information and filter out other than Finnish consumption data.

In [2]:
from datetime import datetime as dt

# concat data
df = pd.concat([
    consumption_2016, 
    consumption_2017, 
    consumption_2018, 
    consumption_2019, 
    consumption_2020, 
    consumption_2021]
)

# drop columns and set column name for date information
df = df.drop(['NO','SE','DK','Nordic','EE','LV','LT','Baltic'], axis=1, errors='ignore')
df.columns.values[0] = 'Date'

# rename consumption column
df = df.rename(columns={'FI':'CONSUMP (MWh)'})

# drop rows where date is more than 31.8.2021
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
end = dt.strptime('01.09.2021', '%d.%m.%Y')
df = df[df['Date'] < end]

print(df.head(1))
print(df.shape)

        Date    Hours  CONSUMP (MWh)
0 2016-01-01  00�-�01        10005.0
(49685, 3)


Clean the Hours column data. The data is in hh-hh format, but we would like the format in 0-23. So we only need the starting hour of the one hour time interval.

In [3]:
try:
    df['Hours'] = df['Hours'].map(lambda hours_str: int(hours_str[0:2]))
    df = df.rename(columns={'Hours':'Hour'})
except:
    pass
    
print(df.head(1))
print(df.shape)

        Date  Hour  CONSUMP (MWh)
0 2016-01-01     0        10005.0
(49685, 3)


Write the clean data into csv-file.

In [4]:
df.to_csv('electricity-consumption-FI_2016-2021_hourly.csv', sep=';', encoding='utf-8', index=False)
print('Success')

Success
