# 1 Data wrangling<a id='1_Data_wrangling'></a>

## 1.1 Contents<a id='2.1_Contents'></a>
* [1 Data wrangling](#2_Data_wrangling)
  * [1.1 Contents](#2.1_Contents)
  * [1.2 Introduction](#2.2_Introduction)
    * [1.2.1 Recap Of Data Science Problem](#2.2.1_Recap_Of_Data_Science_Problem)
    * [1.2.2 Introduction To Notebook](#2.2.2_Introduction_To_Notebook)
  * [1.3 Imports](#2.3_Imports)
  * [1.4 Objectives](#2.4_Objectives)
  * [1.5 Load Data](#2.5_Load_Data)
  * [1.6 Explore The Data](#2.6_Explore_The_Data)
    * [1.6.1 Number Of Missing Values By Column](#2.6.1_Number_Of_Missing_Values_By_Column)
    * [1.6.2 Categorical Features](#2.6.2_Categorical_Features)
    * [1.6.3 Numeric Features](#2.6.3_Numeric_Features)
  * [1.7 Summary](#2.7_Summary)

## 1.2 Introduction<a id='1.2_Introduction'></a>

This step focuses on collecting your data, organizing it, and making sure it's well defined. Some data cleaning will be done at this stage, but mostly focus is on exploring the data to better understand it.

### 1.2.1 Recap Of Data Science Problem<a id='1.2.1_Recap_Of_Data_Science_Problem'></a>

The purpose of this data science project is to come up with a model for energy production and consumption prediction based on data collected from 1980 to 2021. Energy prediction is a data driven approach which can help understanding how much production and consumption increasing or decreasing over the time. It involves gathering and analyzing data related to energy resources to derive insights and make informed decisions. Main objective of this analysis to make of prediction about rate of energy production and consumption. This model will be used to provide guidance for enery companird about production and demand.

### 1.2.2 Introduction To Notebook<a id='1.2.2_Introduction_To_Notebook'></a>

In this notebook, I will use well structured, helpful headings that frequently are self-explanatory, and make a brief note after any results to highlight key takeaways. This is an immense help to anyone reading your notebook and it will greatly help you when I come to summarise your findings. Note down key findings in a final summary at the end of the notebook. This is a great way to ensure important results don't get lost in the middle of  notebooks.

## 1.3 Imports<a id='2.3_Imports'></a>

In [1]:
#Import pandas, matplotlib.pyplot, and seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#from library.sb_utils import save_file

## 1.4 Objectives<a id='1.4_Objectives'></a>

There are some fundamental questions to resolve in this notebook.

* Do you think you may have the data you need to tackle the desired question?
    * Do you have potentially useful features?
* Do you have any fundamental issues with the data?

## 1.5 Load Data<a id='1.5_Load_Data'></a>

In [66]:
# the supplied CSV data file is the raw_data directory
world_data = pd.read_csv('../Data/World Energy Overview.csv')

In [67]:
prod_data = pd.read_csv('../Data/Production_Data/Production_Total.csv')

In [68]:
consum_data = pd.read_csv('../Data/Consumption_Data/Consumption_Total.csv')

First steps in auditing the data are the info method and displaying the first few records with head.

## 1.6 Explore The Data<a id='1.6_Explore_The_Data'></a>


### 1.6.1 Explore The world Data<a id='1.6.1_Explore_The_world_Data'></a>


In [70]:
world_data.head()

Unnamed: 0,Date,Total Fossil Fuels Production,Nuclear Electric Power Production,Total Renewable Energy Production,Total Primary Energy Production,Primary Energy Imports,Primary Energy Exports,Primary Energy Net Imports,Primary Energy Stock Change and Other,Total Fossil Fuels Consumption,Nuclear Electric Power Consumption,Total Renewable Energy Consumption,Total Primary Energy Consumption
0,1973-01-31,4.932632,0.068103,0.403981,5.404715,1.17308,0.125781,1.047299,0.771858,6.747651,0.068103,0.403981,7.223873
1,1973-02-28,4.729582,0.064634,0.3609,5.155115,1.168005,0.120883,1.047122,0.390129,6.163095,0.064634,0.3609,6.592366
2,1973-03-31,4.946902,0.072494,0.400161,5.419556,1.309473,0.13995,1.169523,-0.06764,6.044647,0.072494,0.400161,6.521439
3,1973-04-30,4.716271,0.06407,0.38047,5.160812,1.085169,0.194185,0.890984,-0.110067,5.493184,0.06407,0.38047,5.941729
4,1973-05-31,4.956995,0.062111,0.392141,5.411246,1.162804,0.196775,0.966029,-0.305335,5.613551,0.062111,0.392141,6.07194


In [71]:
world_data.shape

(599, 13)

In [72]:
world_data.describe()

Unnamed: 0,Total Fossil Fuels Production,Nuclear Electric Power Production,Total Renewable Energy Production,Total Primary Energy Production,Primary Energy Imports,Primary Energy Exports,Primary Energy Net Imports,Primary Energy Stock Change and Other,Total Fossil Fuels Consumption,Nuclear Electric Power Consumption,Total Renewable Energy Consumption,Total Primary Energy Consumption
count,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0,599.0
mean,5.034634,0.519567,0.593709,6.147909,1.873459,0.611704,1.261755,0.031835,6.321391,0.519567,0.59162,7.441499
std,0.610126,0.202697,0.193351,0.895076,0.561286,0.544532,0.724907,0.476652,0.708356,0.202697,0.189229,0.946882
min,3.676065,0.062111,0.304328,4.3068,0.710558,0.056798,-0.554743,-0.894627,4.78391,0.062111,0.304328,5.435627
25%,4.683559,0.328635,0.467414,5.590289,1.447091,0.310775,0.856081,-0.327821,5.798086,0.328635,0.467414,6.65905
50%,4.831601,0.594293,0.527479,5.906173,1.848642,0.373346,1.200957,-0.081028,6.338503,0.594293,0.527257,7.617372
75%,5.087384,0.681056,0.685252,6.290192,2.281205,0.680119,1.759529,0.324711,6.78934,0.681056,0.683567,8.111785
max,7.126618,0.780456,1.21879,8.810077,3.14964,2.386337,2.741692,1.551345,8.096323,0.780456,1.199383,9.664299


In [73]:
world_data.columns

Index(['Date', 'Total Fossil Fuels Production',
       'Nuclear Electric Power Production',
       'Total Renewable Energy Production', 'Total Primary Energy Production',
       'Primary Energy Imports', 'Primary Energy Exports',
       'Primary Energy Net Imports', 'Primary Energy Stock Change and Other',
       'Total Fossil Fuels Consumption', 'Nuclear Electric Power Consumption',
       'Total Renewable Energy Consumption',
       'Total Primary Energy Consumption'],
      dtype='object')

In [74]:
world_data.isnull().sum()

Date                                     0
Total Fossil Fuels Production            0
Nuclear Electric Power Production        0
Total Renewable Energy Production        0
Total Primary Energy Production          0
Primary Energy Imports                   0
Primary Energy Exports                   0
Primary Energy Net Imports               0
Primary Energy Stock Change and Other    0
Total Fossil Fuels Consumption           0
Nuclear Electric Power Consumption       0
Total Renewable Energy Consumption       0
Total Primary Energy Consumption         0
dtype: int64

In [75]:
world_data.dtypes

Date                                      object
Total Fossil Fuels Production            float64
Nuclear Electric Power Production        float64
Total Renewable Energy Production        float64
Total Primary Energy Production          float64
Primary Energy Imports                   float64
Primary Energy Exports                   float64
Primary Energy Net Imports               float64
Primary Energy Stock Change and Other    float64
Total Fossil Fuels Consumption           float64
Nuclear Electric Power Consumption       float64
Total Renewable Energy Consumption       float64
Total Primary Energy Consumption         float64
dtype: object

In [76]:
world_data['Date'] = pd.to_datetime(world_data['Date'])


In [77]:
world_data.dtypes

Date                                     datetime64[ns]
Total Fossil Fuels Production                   float64
Nuclear Electric Power Production               float64
Total Renewable Energy Production               float64
Total Primary Energy Production                 float64
Primary Energy Imports                          float64
Primary Energy Exports                          float64
Primary Energy Net Imports                      float64
Primary Energy Stock Change and Other           float64
Total Fossil Fuels Consumption                  float64
Nuclear Electric Power Consumption              float64
Total Renewable Energy Consumption              float64
Total Primary Energy Consumption                float64
dtype: object

In [78]:
world_data['Year'] = world_data['Date'].dt.year
world_data['Month'] = world_data['Date'].dt.month
world_data['Day'] = world_data['Date'].dt.day

In [79]:
world_data.dtypes

Date                                     datetime64[ns]
Total Fossil Fuels Production                   float64
Nuclear Electric Power Production               float64
Total Renewable Energy Production               float64
Total Primary Energy Production                 float64
Primary Energy Imports                          float64
Primary Energy Exports                          float64
Primary Energy Net Imports                      float64
Primary Energy Stock Change and Other           float64
Total Fossil Fuels Consumption                  float64
Nuclear Electric Power Consumption              float64
Total Renewable Energy Consumption              float64
Total Primary Energy Consumption                float64
Year                                              int64
Month                                             int64
Day                                               int64
dtype: object

In [80]:
world_data.head()

Unnamed: 0,Date,Total Fossil Fuels Production,Nuclear Electric Power Production,Total Renewable Energy Production,Total Primary Energy Production,Primary Energy Imports,Primary Energy Exports,Primary Energy Net Imports,Primary Energy Stock Change and Other,Total Fossil Fuels Consumption,Nuclear Electric Power Consumption,Total Renewable Energy Consumption,Total Primary Energy Consumption,Year,Month,Day
0,1973-01-31,4.932632,0.068103,0.403981,5.404715,1.17308,0.125781,1.047299,0.771858,6.747651,0.068103,0.403981,7.223873,1973,1,31
1,1973-02-28,4.729582,0.064634,0.3609,5.155115,1.168005,0.120883,1.047122,0.390129,6.163095,0.064634,0.3609,6.592366,1973,2,28
2,1973-03-31,4.946902,0.072494,0.400161,5.419556,1.309473,0.13995,1.169523,-0.06764,6.044647,0.072494,0.400161,6.521439,1973,3,31
3,1973-04-30,4.716271,0.06407,0.38047,5.160812,1.085169,0.194185,0.890984,-0.110067,5.493184,0.06407,0.38047,5.941729,1973,4,30
4,1973-05-31,4.956995,0.062111,0.392141,5.411246,1.162804,0.196775,0.966029,-0.305335,5.613551,0.062111,0.392141,6.07194,1973,5,31


In [81]:
world_data = world_data.reindex(columns=['Date', 'Year', 'Month', 'Day','Total Fossil Fuels Production',
       'Nuclear Electric Power Production',
       'Total Renewable Energy Production', 'Total Primary Energy Production',
       'Primary Energy Imports', 'Primary Energy Exports',
       'Primary Energy Net Imports', 'Primary Energy Stock Change and Other',
       'Total Fossil Fuels Consumption', 'Nuclear Electric Power Consumption',
       'Total Renewable Energy Consumption',
       'Total Primary Energy Consumption'])


In [82]:
world_data.head()

Unnamed: 0,Date,Year,Month,Day,Total Fossil Fuels Production,Nuclear Electric Power Production,Total Renewable Energy Production,Total Primary Energy Production,Primary Energy Imports,Primary Energy Exports,Primary Energy Net Imports,Primary Energy Stock Change and Other,Total Fossil Fuels Consumption,Nuclear Electric Power Consumption,Total Renewable Energy Consumption,Total Primary Energy Consumption
0,1973-01-31,1973,1,31,4.932632,0.068103,0.403981,5.404715,1.17308,0.125781,1.047299,0.771858,6.747651,0.068103,0.403981,7.223873
1,1973-02-28,1973,2,28,4.729582,0.064634,0.3609,5.155115,1.168005,0.120883,1.047122,0.390129,6.163095,0.064634,0.3609,6.592366
2,1973-03-31,1973,3,31,4.946902,0.072494,0.400161,5.419556,1.309473,0.13995,1.169523,-0.06764,6.044647,0.072494,0.400161,6.521439
3,1973-04-30,1973,4,30,4.716271,0.06407,0.38047,5.160812,1.085169,0.194185,0.890984,-0.110067,5.493184,0.06407,0.38047,5.941729
4,1973-05-31,1973,5,31,4.956995,0.062111,0.392141,5.411246,1.162804,0.196775,0.966029,-0.305335,5.613551,0.062111,0.392141,6.07194


### 1.6.2 Explore The production Data<a id='1.6.2_Explore_The_production_Data'></a>


In [83]:
prod_data.head()

Unnamed: 0,Continent,Country,1980,1981,1982,1983,1984,1985,1986,1987,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Total_production
0,Africa,Algeria,2.803017355,3.03753686,3.224933779,3.606400483,3.859176003,3.907465538,3.968324779,4.218086334,...,6.510417691,6.561749872,6.696192523,6.910709728,6.830371007,6.755066467,6.474600013,5.972769931,6.656881917,238.118858
1,Africa,Angola,0.335098042,0.291293691,0.276262403,0.395407667,0.462550452,0.511269369,0.621432995,0.789428936,...,3.932204112,3.802528446,3.949946917,3.93529724,3.899020984,3.814151292,3.526638469,3.159336901,2.824017641,90.852722
2,Africa,Benin,in,0.0,0.0,0.00896,0.01573,0.01793,0.01793,0.01569,...,1.9082e-05,1.902e-05,5.5914e-05,9.232e-05,5.5278e-05,4.64e-05,7.18e-05,6.19e-05,7.07e-05,0.178363
3,Africa,Botswana,0.008262057,0.008484744,0.009241913,0.008796519,0.00875198,0.009731845,0.010912138,0.010845329,...,0.033334507,0.038240847,0.046562169,0.0419666,0.049654188,0.048447799,0.047231633,0.041991759,0.044392707,0.952582
4,Africa,Burkina Faso,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.001106756,0.00096051,0.000987814,0.00143096,0.001455654,0.002066608,0.00190567,0.001859876,0.002054882,0.033589


In [84]:
prod_data.shape

(229, 45)

In [85]:
prod_data.describe()

Unnamed: 0,Total_production
count,229.0
mean,79.280368
std,308.217753
min,0.0
25%,0.039002
50%,2.2729
75%,27.855034
max,3149.020415


In [86]:
prod_data.columns

Index(['Continent', 'Country', '1980', '1981', '1982', '1983', '1984', '1985',
       '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994',
       '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003',
       '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
       'Total_production'],
      dtype='object')

In [87]:
prod_data.isnull().sum()

Continent           0
Country             0
1980                0
1981                0
1982                0
1983                0
1984                0
1985                0
1986                0
1987                0
1988                0
1989                0
1990                0
1991                0
1992                0
1993                0
1994                0
1995                0
1996                0
1997                0
1998                0
1999                0
2000                0
2001                0
2002                0
2003                0
2004                0
2005                0
2006                0
2007                0
2008                0
2009                0
2010                0
2011                0
2012                0
2013                0
2014                0
2015                0
2016                0
2017                0
2018                0
2019                0
2020                0
2021                0
Total_production    0
dtype: int

In [88]:
prod_data.dtypes

Continent            object
Country              object
1980                 object
1981                 object
1982                 object
1983                 object
1984                 object
1985                 object
1986                 object
1987                 object
1988                 object
1989                 object
1990                 object
1991                 object
1992                 object
1993                 object
1994                 object
1995                 object
1996                 object
1997                 object
1998                 object
1999                 object
2000                 object
2001                 object
2002                 object
2003                 object
2004                 object
2005                 object
2006                 object
2007                 object
2008                 object
2009                 object
2010                 object
2011                 object
2012                 object
2013                

In [99]:
prod_data.Country.value_counts().sum()

229

In [100]:
prod_data.Continent.value_counts().sum()

229

In [103]:
prod_data.Continent.unique()


array(['    Africa', '    Eurasia', '    Europe', '    Asia & Oceania',
       '    Middle East', '    North America',
       '    Central & South America'], dtype=object)

### 1.6.3 Explore The consumption Data<a id='1.6.3_Explore_The_consumption_Data'></a>


In [90]:
consum_data.head()

Unnamed: 0,Continent,Country,1980,1981,1982,1983,1984,1985,1986,1987,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Total_Consumption
0,Africa,Algeria,0.780695167,0.663391323,0.952188116,1.070561843,1.130786713,1.046418247,1.066300962,1.138318654,...,2.037584643,2.299888436,2.35214754,2.33598238,2.365986905,2.522119152,2.624873096,2.480810774,2.581123314,64.746181
1,Africa,Angola,0.058366148,0.057863688,0.062007899,0.066297007,0.059894701,0.073766696,0.072853256,0.075135479,...,0.347761278,0.382763726,0.39921871,0.364251517,0.347458949,0.359226946,0.404865374,0.402317536,0.328628418,7.403765
2,Africa,Benin,0.006526525,0.006574612,0.004640112,0.006731564,0.006673241,0.007698814,0.007501055,0.005942365,...,0.066471109,0.070275151,0.077723341,0.094785008,0.10051639,0.111506331,0.101325936,0.108653617,0.110194887,1.660207
3,Africa,Botswana,0.014748762,0.015462338,0.014427219,0.014460824,0.014444631,0.018865991,0.020060806,0.021144246,...,0.065375734,0.065661654,0.080008274,0.088008537,0.080969008,0.080276228,0.080755726,0.076835026,0.082892488,2.021542
4,Africa,Burkina Faso,0.0060363,0.006019807,0.006019807,0.006019807,0.0060363,0.005819147,0.006886218,0.007191413,...,0.0451141,0.042012755,0.053396002,0.051306252,0.05946458,0.066130673,0.067697934,0.064550875,0.068226015,0.988877


In [91]:
consum_data.shape

(230, 45)

In [92]:
consum_data.describe()

Unnamed: 0,Total_Consumption
count,230.0
mean,79.129864
std,337.588195
min,0.0
25%,0.747999
50%,5.103923
75%,38.877648
max,3830.095107


In [93]:
consum_data.columns

Index(['Continent', 'Country', '1980', '1981', '1982', '1983', '1984', '1985',
       '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994',
       '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003',
       '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
       'Total_Consumption'],
      dtype='object')

In [94]:
consum_data.isnull().sum()

Continent            0
Country              0
1980                 0
1981                 0
1982                 0
1983                 0
1984                 0
1985                 0
1986                 0
1987                 0
1988                 0
1989                 0
1990                 0
1991                 0
1992                 0
1993                 0
1994                 0
1995                 0
1996                 0
1997                 0
1998                 0
1999                 0
2000                 0
2001                 0
2002                 0
2003                 0
2004                 0
2005                 0
2006                 0
2007                 0
2008                 0
2009                 0
2010                 0
2011                 0
2012                 0
2013                 0
2014                 0
2015                 0
2016                 0
2017                 0
2018                 0
2019                 0
2020                 1
2021       

In [95]:
consum_data.dtypes

Continent             object
Country               object
1980                  object
1981                  object
1982                  object
1983                  object
1984                  object
1985                  object
1986                  object
1987                  object
1988                  object
1989                  object
1990                  object
1991                  object
1992                  object
1993                  object
1994                  object
1995                  object
1996                  object
1997                  object
1998                  object
1999                  object
2000                  object
2001                  object
2002                  object
2003                  object
2004                  object
2005                  object
2006                  object
2007                  object
2008                  object
2009                  object
2010                  object
2011                  object
2012          

In [98]:
consum_data.Country.value_counts().sum()

230

In [101]:
consum_data.Continent.value_counts().sum()


230

In [102]:
consum_data.Continent.unique()


array(['Africa', 'Asia & Oceania', 'Middle East',
       'Central & South America', 'North America', 'Europe', 'Eurasia'],
      dtype=object)