# Preprocesamiento y Limpieza de Datos

Este notebook prepara el dataset para el modelado mediante la limpieza y estandarización de los datos. Se normalizan variables numéricas y se codifican variables categóricas (`Country`). El resultado es un dataset limpio guardado en `data/procesados/`.

# Índice

1. [Configuraciones del Entorno de Trabajo y Rutas](#1-configuraciones-del-entorno-de-trabajo-y-rutas)

2. [Estandarización y Normalización de Variables.](#2-estandarización-y-normalización-de-variables)

3. [Codificación de Variables Categóricas.](#3-codificación-de-variables-categóricas)

4. [Validación del Nuevo DataFrame](#4-validación-del-nuevo-dataframe)

## 1. Configuraciones del Entorno de Trabajo y Rutas

* Imporación de Librerías

In [29]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder

* Importacióin de Librerías Personalizadas

In [30]:
import sys
import os
sys.path.append(os.path.abspath('..'))
import src.preprocesamiento as pre

* Configuración Global

In [31]:
# Configración de Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)
pd.set_option('display.max_rows', 50)

# Configuación de NumPy
np.set_printoptions(precision=2, suppress=True)

# Configuración de Scikit-learn
scaler = StandardScaler()
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

* Rutas

In [32]:
data_path = os.path.join('..', 'data', 'original', 'global_energy_consumption.csv')
output_data = os.path.join('..', 'data', 'procesados', 'global_energy_consumption_clean.csv')

* Carga de Datos Crudos

In [33]:
df = pre.cargar_datos(data_path)

* Creando una Copia del DataFrame

In [34]:
columnas = ['Total Energy Consumption (TWh)', 'Per Capita Energy Use (kWh)', 
        'Renewable Energy Share (%)', 'Fossil Fuel Dependency (%)', 
        'Carbon Emissions (Million Tons)', 'Energy Price Index (USD/kWh)',
        'Industrial Energy Use (%)', 'Household Energy Use (%)']

df_base = df[columnas].copy()

## 2. Estandarización y Normalización de Variables.

* Aplicar StandardScaler a columnas numéricas.

In [35]:
df_escalados = pre.estandarizacion(df_base, columnas,scaler)

* Combinar con el DataFrame Original

In [36]:
df = pre.combinar_df(df, df_escalados, columnas)

* Estadisticas del Dataset

In [37]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,10000.0,2012.15,7.16,2000.0,2006.0,2012.0,2018.0,2024.0
Total Energy Consumption (TWh),10000.0,0.0,1.0,-1.77,-0.85,0.02,0.86,1.7
Per Capita Energy Use (kWh),10000.0,-0.0,1.0,-1.73,-0.87,0.0,0.85,1.76
Renewable Energy Share (%),10000.0,0.0,1.0,-1.72,-0.86,-0.01,0.87,1.73
Fossil Fuel Dependency (%),10000.0,0.0,1.0,-1.73,-0.87,0.01,0.87,1.74
Carbon Emissions (Million Tons),10000.0,-0.0,1.0,-1.75,-0.87,0.02,0.86,1.73
Energy Price Index (USD/kWh),10000.0,0.0,1.0,-1.71,-0.87,-0.03,0.89,1.73
Industrial Energy Use (%),10000.0,-0.0,1.0,-1.74,-0.85,-0.01,0.87,1.73
Household Energy Use (%),10000.0,-0.0,1.0,-1.75,-0.86,0.01,0.88,1.74


## 3. Codificación de Variables Categóricas.

En esta sección se busca convertir la columna `Country` en formato numérico para algoritmos de ML.

* Codificación de la Variable Categorica con OneHotEncoder

In [38]:
df_codificado = pre.code_categori(df, 'Country', encoder)

# 5) Verifica el resultado
print(df_codificado.columns)
df_codificado.head(10)


Index(['Country', 'Year', 'Total Energy Consumption (TWh)',
       'Per Capita Energy Use (kWh)', 'Renewable Energy Share (%)',
       'Fossil Fuel Dependency (%)', 'Carbon Emissions (Million Tons)',
       'Energy Price Index (USD/kWh)', 'Industrial Energy Use (%)',
       'Household Energy Use (%)', 'Country_Australia', 'Country_Brazil',
       'Country_Canada', 'Country_China', 'Country_Germany', 'Country_India',
       'Country_Japan', 'Country_Russia', 'Country_UK', 'Country_USA'],
      dtype='object')


Unnamed: 0,Country,Year,Total Energy Consumption (TWh),Per Capita Energy Use (kWh),Renewable Energy Share (%),Fossil Fuel Dependency (%),Carbon Emissions (Million Tons),Energy Price Index (USD/kWh),Industrial Energy Use (%),Household Energy Use (%),Country_Australia,Country_Brazil,Country_Canada,Country_China,Country_Germany,Country_India,Country_Japan,Country_Russia,Country_UK,Country_USA
0,Canada,2018,1.54,1.22,-1.37,1.26,0.86,-1.17,0.44,-0.59,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Germany,2020,0.98,0.81,-0.56,-0.15,0.12,-1.48,-0.5,-0.32,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Russia,2002,0.52,1.17,-1.48,-0.28,-1.16,-0.1,1.18,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,Brazil,2010,1.21,-0.99,1.05,-1.4,-0.98,1.5,-0.82,0.3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Canada,2006,-1.51,0.5,1.07,1.48,-1.19,1.58,0.2,-0.19,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,UK,2016,-0.16,-1.59,-0.28,1.19,-0.21,0.59,-1.73,-0.19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,India,2024,1.67,-1.49,-0.82,-0.66,0.77,-1.1,1.52,-0.75,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,Canada,2008,0.9,0.76,-1.31,0.88,0.89,-1.71,1.55,-0.41,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Russia,2020,1.57,-1.64,1.22,0.91,-1.41,-0.56,0.72,-0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9,Brazil,2008,0.94,0.29,1.31,-0.94,-0.02,-1.25,-0.27,0.34,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4. Validación del Nuevo DataFrame

* Verificar que no hay valores faltantes ni duplicados.

In [39]:
df_codificado.isnull().sum()

Country                            0
Year                               0
Total Energy Consumption (TWh)     0
Per Capita Energy Use (kWh)        0
Renewable Energy Share (%)         0
Fossil Fuel Dependency (%)         0
Carbon Emissions (Million Tons)    0
Energy Price Index (USD/kWh)       0
Industrial Energy Use (%)          0
Household Energy Use (%)           0
Country_Australia                  0
Country_Brazil                     0
Country_Canada                     0
Country_China                      0
Country_Germany                    0
Country_India                      0
Country_Japan                      0
Country_Russia                     0
Country_UK                         0
Country_USA                        0
dtype: int64

In [40]:
df_codificado.duplicated().sum()

np.int64(0)

* Verificación de Rangos Numéricos y Categóricos con Resumen Estadístico Final

In [41]:
df_codificado.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,10000.0,2012.15,7.16,2000.0,2006.0,2012.0,2018.0,2024.0
Total Energy Consumption (TWh),10000.0,0.0,1.0,-1.77,-0.85,0.02,0.86,1.7
Per Capita Energy Use (kWh),10000.0,-0.0,1.0,-1.73,-0.87,0.0,0.85,1.76
Renewable Energy Share (%),10000.0,0.0,1.0,-1.72,-0.86,-0.01,0.87,1.73
Fossil Fuel Dependency (%),10000.0,0.0,1.0,-1.73,-0.87,0.01,0.87,1.74
Carbon Emissions (Million Tons),10000.0,-0.0,1.0,-1.75,-0.87,0.02,0.86,1.73
Energy Price Index (USD/kWh),10000.0,0.0,1.0,-1.71,-0.87,-0.03,0.89,1.73
Industrial Energy Use (%),10000.0,-0.0,1.0,-1.74,-0.85,-0.01,0.87,1.73
Household Energy Use (%),10000.0,-0.0,1.0,-1.75,-0.86,0.01,0.88,1.74
Country_Australia,10000.0,0.1,0.3,0.0,0.0,0.0,0.0,1.0


* Guardado del DataFrame Preprocesado

In [42]:
pre.guardar_dataframe(df_codificado, output_data)

DataFrame guardado en ..\data\procesados\global_energy_consumption_clean.csv
