<a href="https://colab.research.google.com/github/jmelendezgeo/Exploratory-analysis-/blob/main/EnegySupply.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

En esta libreta se hace un **análisis exploratorio** de diversos set de datos con cierta información de países. La información pasará por un proceso de **Limpieza de Datos** para poder unificar el formato en un DataFrame con el que continuaremos trabajando para obtener ciertos insights y estadísticas.

In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Data load & Data Cleaning

La información contenida en 'Energy Indicators' corresponde a *Suministro de energía y producción de electricidad renovable* de [United nations](https://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls) del 2013

In [4]:
#First, we are going to load the data contained in an .xls file (excel file). 
#In the file we have information in the header and in the footer so we exclude this information from our DataFrame 
Energy = pd.read_excel('Energy Indicators.xls', skiprows = 17, skipfooter = 38)
Energy.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Petajoules,Gigajoules,%
0,,Afghanistan,Afghanistan,321,10,78.66928
1,,Albania,Albania,102,35,100.0
2,,Algeria,Algeria,1959,51,0.55101
3,,American Samoa,American Samoa,...,...,0.641026
4,,Andorra,Andorra,9,121,88.69565


In [5]:
# The first two columns are unnecessary. We also need to name our columns appropriately 
Energy.drop(columns = ['Unnamed: 0','Unnamed: 1'], inplace = True)
Energy.columns = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
Energy.head()

Unnamed: 0,Country,Energy Supply,Energy Supply per Capita,% Renewable
0,Afghanistan,321,10,78.66928
1,Albania,102,35,100.0
2,Algeria,1959,51,0.55101
3,American Samoa,...,...,0.641026
4,Andorra,9,121,88.69565


In [13]:
# It seems that in our Energy Supply column there is information that is not real (eg '...'). 
# When reviewing the dataset information, the '...' means no data. 
# We will use np.nan to replace these strings
~Energy.applymap(np.isreal)

Unnamed: 0,Country,Energy Supply,Energy Supply per Capita,% Renewable
0,True,False,False,False
1,True,False,False,False
2,True,False,False,False
3,True,True,True,False
4,True,False,False,False
...,...,...,...,...
222,True,False,False,False
223,True,False,False,False
224,True,False,False,False
225,True,False,False,False


In [19]:
# True's mean that the values are not real. With this we could recognize the strings.
# In our columns, except for 'Country', we should have real numbers

# Replace empty data (as ...) to np.nan values
Energy.replace('\.{2,}',np.nan, regex = True, inplace = True)

# PetaJoule to GigaJoule. Let's convert to the same units of energy 
Energy['Energy Supply'] = Energy['Energy Supply'] * 1000000
Energy[10:15]

Unnamed: 0,Country,Energy Supply,Energy Supply per Capita,% Renewable
10,Aruba,1.2e+37,120.0,14.87069
11,Australia1,5.386e+39,231.0,11.81081
12,Austria,1.391e+39,164.0,72.45282
13,Azerbaijan,5.67e+38,60.0,6.384345
14,Bahamas,4.5e+37,118.0,0.0


In [20]:
# Here are some country names to clean up. This problem is very common. 
# We will pass a filter that removes the information between parentheses and numbers in the name 
# We also correct the names of some countries with a dictionary 
Energy['Country']=Energy['Country'].str.extract('(^[a-zA-Z\s\,]+)',expand = False)
# Corregir algunos nombres de paises
Energy.replace({"Republic of Korea": "South Korea",
            "United States of America": "United States",
            "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
            "China, Hong Kong Special Administrative Region": "Hong Kong",
               "Iran ":'Iran'}, inplace=True)
# Finally, we assign the name of the country as index
Energy = Energy.set_index('Country')


In [21]:
Energy.head()

Unnamed: 0_level_0,Energy Supply,Energy Supply per Capita,% Renewable
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,3.2100000000000003e+38,10.0,78.66928
Albania,1.02e+38,35.0,100.0
Algeria,1.959e+39,51.0,0.55101
American Samoa,,,0.641026
Andorra,9e+36,121.0,88.69565


In [22]:
# With this dataset clean and ready for enrichment, let's look at some metrics 
Energy.describe()

Unnamed: 0,Energy Supply,Energy Supply per Capita,% Renewable
count,222.0,222.0,227.0
mean,2.457982e+39,90.666667,28.086077
std,1.1039149999999998e+40,116.234887,31.903505
min,0.0,2.0,0.0
25%,3.3e+37,21.25,0.022893
50%,1.895e+38,51.0,14.81481
75%,9.9625e+38,117.75,50.16862
max,1.27191e+41,957.0,100.0
