# Electricity production by source

### About Dataset

This dataset provides information on changes in electricity generation by country and the type of energy source used for electricity production, including coal, natural gas, petroleum, nuclear power, and various renewable sources.

### The contents of the dataset

Dataset has 8 264 rows and 12 columns.
Keys for columns:

- Entity - name of country or territory and the world (total for all countries);
- Code - the ISO country codes;
- Year - years from 1965 to 2023;

### The amount of electricity produced per year (measured in terawatt-hours) by sources:

- Coal;
- Gas;
- Nuclear;
- Hydro;
- Solar;
- Oil;
- Wind;
- Bioenergy;
- Other renewables (include waste, geothermal and wave and tidal energy).

## Importación de librerías

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn import metrics

%matplotlib inline

from sklearn import tree, datasets, metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score, cross_validate 
from sklearn.model_selection import KFold, StratifiedKFold, RepeatedKFold, LeaveOneOut 
import math

import warnings
warnings.filterwarnings('ignore')

## Carga de Dataset

In [2]:
df = pd.read_csv("/Users/kamiro/proyecto-1-innovacion-tecnologica-IA/Electricity production by source.csv")

## Entendimiento de los datos

In [15]:
# Ver primeras filas
df.head()

Unnamed: 0,Entity,Code,Year,Coal,Gas,Nuclear,Hydro,Solar,Oil,Wind,Bioenergy,Other renewables
0,Afghanistan,AFG,2000,0.0,0.0,0.0,0.31,0.0,0.17,0.0,0.0,0.0
1,Afghanistan,AFG,2001,0.04,0.0,0.0,0.5,0.0,0.15,0.0,0.0,0.0
2,Afghanistan,AFG,2002,0.04,0.0,0.0,0.56,0.0,0.11,0.0,0.0,0.0
3,Afghanistan,AFG,2003,0.09,0.0,0.0,0.63,0.0,0.19,0.0,0.0,0.0
4,Afghanistan,AFG,2004,0.06,0.0,0.0,0.56,0.0,0.17,0.0,0.0,0.0


In [16]:
# Ver últimas filas
df.tail()

Unnamed: 0,Entity,Code,Year,Coal,Gas,Nuclear,Hydro,Solar,Oil,Wind,Bioenergy,Other renewables
8258,Zimbabwe,ZWE,1996,,,0.0,,,,,,
8259,Zimbabwe,ZWE,1997,,,0.0,,,,,,
8260,Zimbabwe,ZWE,1998,,,0.0,,,,,,
8261,Zimbabwe,ZWE,1999,,,0.0,,,,,,
8262,Zimbabwe,ZWE,2023,,,0.0,,,,,,


In [17]:
# Ver tamaño del dataset
df.shape

(8263, 12)

In [18]:
# Información general
print("\nInformación general del dataset:")
df.info()


Información general del dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8263 entries, 0 to 8262
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Entity            8263 non-null   object 
 1   Code              8263 non-null   object 
 2   Year              8263 non-null   int64  
 3   Coal              5692 non-null   float64
 4   Gas               5626 non-null   float64
 5   Nuclear           8156 non-null   float64
 6   Hydro             7462 non-null   float64
 7   Solar             6668 non-null   float64
 8   Oil               5736 non-null   float64
 9   Wind              6697 non-null   float64
 10  Bioenergy         5317 non-null   float64
 11  Other renewables  4753 non-null   float64
dtypes: float64(9), int64(1), object(2)
memory usage: 774.8+ KB


In [19]:
# Estadísticas descriptivas
print("\n📊 Estadísticas descriptivas:")
print(df.describe(include="all"))  # include="all" para incluir categóricas


📊 Estadísticas descriptivas:
           Entity  Code         Year          Coal          Gas      Nuclear  \
count        8263  8263  8263.000000   5692.000000  5626.000000  8156.000000   
unique        219   219          NaN           NaN          NaN          NaN   
top     Singapore   SGP          NaN           NaN          NaN          NaN   
freq           59    59          NaN           NaN          NaN          NaN   
mean          NaN   NaN  2000.502965     95.788839    50.917572    25.246391   
std           NaN   NaN    15.911960    668.019549   353.908374   181.624466   
min           NaN   NaN  1965.000000      0.000000     0.000000     0.000000   
25%           NaN   NaN  1989.000000      0.000000     0.000000     0.000000   
50%           NaN   NaN  2004.000000      0.000000     0.500000     0.000000   
75%           NaN   NaN  2013.000000      7.622500    14.487500     0.000000   
max           NaN   NaN  2023.000000  10467.930000  6622.930000  2762.240000   

         

In [20]:
# Valores nulos
print("\n🔍 Valores nulos por columna:")
print(df.isnull().sum())


🔍 Valores nulos por columna:
Entity                 0
Code                   0
Year                   0
Coal                2571
Gas                 2637
Nuclear              107
Hydro                801
Solar               1595
Oil                 2527
Wind                1566
Bioenergy           2946
Other renewables    3510
dtype: int64


In [22]:
# Filas duplicadas
print(f"\n🔍 Filas duplicadas: {df.duplicated().sum()}")


🔍 Filas duplicadas: 0


In [26]:
# Identificar espacios extras (dobles, al inicio o al final)
def tiene_espacios(valor):
    if isinstance(valor, str):
        return valor != valor.strip() or "  " in valor
    return False

print("\n🔍 Filas con espacios indebidos:")
print(df[df.map(tiene_espacios).any(axis=1)])


🔍 Filas con espacios indebidos:
Empty DataFrame
Columns: [Entity, Code, Year, Coal, Gas, Nuclear, Hydro, Solar, Oil, Wind, Bioenergy, Other renewables]
Index: []


In [28]:
# Filtramos solo columnas numéricas
df_num = df.select_dtypes(include=['number'])

# Calculamos IQR
Q1 = df_num.quantile(0.25)
Q3 = df_num.quantile(0.75)
IQR = Q3 - Q1

# Detectamos outliers
outliers = ((df_num < (Q1 - 1.5 * IQR)) | (df_num > (Q3 + 1.5 * IQR)))

print("\n🔍 Número de valores atípicos por columna:")
print(outliers.sum())


🔍 Número de valores atípicos por columna:
Year                   0
Coal                1108
Gas                  847
Nuclear             1585
Hydro               1075
Solar               1359
Oil                  835
Wind                1496
Bioenergy            979
Other renewables     526
dtype: int64


In [29]:
# Tipos de datos incorrectos
print("\n🔍 Tipos de datos detectados:")
print(df.dtypes)


🔍 Tipos de datos detectados:
Entity               object
Code                 object
Year                  int64
Coal                float64
Gas                 float64
Nuclear             float64
Hydro               float64
Solar               float64
Oil                 float64
Wind                float64
Bioenergy           float64
Other renewables    float64
dtype: object
