# ETL of the Generation and Emission scripts

## Table of Contents
<ul><li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Cleaning: Missing Values</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li></ul>

<a id='intro'></a>
## Introduction

Before its analysis, the data from the Generation and Emission files needs some transformations. In this script I'll be doing all of those transformations, explained step by step. Since these two kinds of files have the same structure, I will apply the same transformations to both of them.

<a id='wrangling'></a>
## Structure and characteristics of the dataset

The data from the Generation and Emission is structured in 730 files each, one for each day. Each file contains data from 09:00 PM of the previous day to 02:55 of the next day. That creates a lot of duplicates that will have to be removed. Except for the column 'Generation' in the generation files and the column 'Emission' in the emission files, the rest of the columns are present in both types of files. These are those columns:
*  The first one is the date and time, every 5 minutes.
*  Eólica: contains the energy generated, or the CO2 emissions, of the wind energy.
*  Nuclear: contains the energy generated, or the CO2 emissions, of the nuclear energy.
*  Carbón: contains the energy generated, or the CO2 emissions, of the charcoal energy.
*  Ciclo combinado: contains the energy generated, or the CO2 emissions, of the energy generated through the combined cycle. It's an energy generation process that combines two thermodinamic cycles in one system. In one of them, the working fluid is water steam and in the other one is gas. The heat generated is used to heat the gas and the steam and, after that, move the turbines, connected to an electric generator.
*  Hidráulica: contains the energy generated, or the CO2 emissions, of the hydropower energy.
*  Solar fotovoltaica: contains the energy generated, or the CO2 emissions, of the photovoltaic solar energy.
*  Solar térmica: contains the energy generated, or the CO2 emissions, of the thermal solar energy. This energy uses the solar radiation to create mechanic energy and, with it, electric energy.
*  Térmica renovable: contains the energy generated, or the CO2 emissions, of the thermal renewable energy. 
*  Motores diésel: contains the energy generated, or the CO2 emissions, of the energy created through diesel motors. 
*  Turbina de gas: contains the energy generated, or the CO2 emissions, of the gas turbine energy. 
*  Turbina de vapor: contains the energy generated, or the CO2 emissions, of the steam turbine energy. 
*  Generación auxiliar: contains the energy generated, or the CO2 emissions, of the auxiliar generation energy. 
*  Cogeneración y residuos: contains the energy generated, or the CO2 emissions, of the cogeneration and residues energy. Cogeneration is the generation of electric and thermal energy simultaneously. The solid urban residues are recycled and used as an energy source.

### General Properties

#### Import libraries

In [None]:
import pandas as pd
import numpy as np
import glob
from datetime import datetime

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Import and print dataset

In [None]:
# Create a list to do the ETL of the two types of files simultaneously.
path = ["/content/drive/MyDrive/Data REE/XLSX - GENERACIÓN","/content/drive/MyDrive/Data REE/XLSX - EMISIONES"]

# Glob module finds all the pathnames matching a specified pattern, although results are returned in arbitrary order
lista = [1,2]
for i in range(len(lista)):
  lista[i] = pd.concat(map(pd.read_excel, glob.glob(path[i] + "/*.xlsx")))
  not_column_g = list(lista[i].columns)
  column_g = lista[i].values.tolist()[1]
  lista[i] = lista[i].rename(columns={not_column_g[k]: column_g[k] for k in range(len(not_column_g))})
  lista[i] = lista[i].drop_duplicates()
  lista[i] = lista[i].drop([lista[i].index[0], lista[i].index[1]]) # Remove the two first rowa
  lista[i]["Hora"] = lista[i]["Hora"].astype(str) # Convert into str
  lista[i].reset_index(inplace = True) 
  ind_b = lista[i].loc[lista[i]['Hora'].str.contains('B', na=False), :].index 
  lista[i].drop("index", axis = 1, inplace = True)
  lista[i].drop(ind_b, inplace = True) # Remove the 2B terms
  lista[i]["Hora"] = lista[i]["Hora"].str.replace("2A","02") # Replace the 2A by 02
  lista[i]["Hora"] = lista[i]["Hora"].astype("datetime64[ns]") # Convert into datetime again
  lista[i].sort_values(by = "Hora", inplace = True)
  lista[i]["time"] = pd.to_datetime(lista[i]["Hora"])
  lista[i]["date"] = lista[i]["time"].dt.date # Create the columns date and hour
  lista[i]["hour"] = lista[i]["time"].dt.time
  lista[i].drop(["Hora","time"], axis = 1, inplace = True) # Drop the useless columns

Now that we took a column with time and date the columns "Hora" and "Time" are useless. We can remove them.

In [None]:
numbers = list(lista[0].columns[0:14])
time = list(lista[0].columns[14:])
new_order = time + numbers
for i in range(len(lista)):
  lista[i][numbers] = lista[i][numbers].astype(str)
  lista[i][numbers] = lista[i][numbers].astype(float) # Convert the numeric column values into float
  lista[i] = lista[i][new_order] # Change the order of the columns

Next step should be translating the titles into English.

In [None]:
not_translated = numbers 
translated = ["wind","nuclear","charcoal","combined cycle","hydropower","international","photovoltaic solar","thermal solar","renewable thermal","diesel","gas turbine","steam turbine","auxiliar generation","cogeneration and residues"]
for i in range(len(lista)):
  lista[i] = lista[i].rename(columns={not_translated[k]: translated[k] for k in range(len(translated))})
  lista[i] = lista[i].reset_index()
  lista[i] = lista[i].drop(["index"],axis = 1)
  lista[i].drop(["international"],axis = 1, inplace = True) # Remove the international column, since it's not a true energy generation

In [None]:
translated.remove('international') # Remove 'international' from the list
lista_names = ["generation","emission"]
for i in range(len(lista)): # Convert from wide format into long format
  lista[i] = pd.melt(lista[i], id_vars=['date','hour'], value_vars=translated, value_name = lista_names[i], var_name = 'energy_source')


In [None]:
conditions = [(lista[i]["energy_source"] == "wind"), (lista[i]["energy_source"] == "nuclear"), (lista[i]["energy_source"] == "charcoal"), (lista[i]["energy_source"] == "combined cycle"), (lista[i]["energy_source"] == "hydropower"), (lista[i]["energy_source"] == "photovoltaic solar"), (lista[i]["energy_source"] == "thermal solar"), (lista[i]["energy_source"] == "renewable thermal"), (lista[i]["energy_source"] == "diesel"), (lista[i]["energy_source"] == "gas turbine"), (lista[i]["energy_source"] == "steam turbine"), (lista[i]["energy_source"] == "auxiliar generation"), (lista[i]["energy_source"] == "cogeneration and residues")]
values = list(np.arange(13))
for i in range(len(lista)): # Create energy source id and remove energy source
  lista[i]['energy_source_id'] = np.select(conditions, values)
  lista[i] = lista[i].drop(["energy_source"], axis = 1)

In [None]:
# Convert the dataset into a csv file
lista_name = ['generacion','emision']
for i in range(len(lista)):
  lista[i].to_csv(f"output_{lista_name[i]}.csv", index = False)