# Dataset description

The aim of this notebook is to provide some description of the dataset that I will use for my project.

## 1. How were the data obtained ? (Question 4 of the Homework)

### Data sources

This dataset was obtained from 3 different sources:

- [World Bank](https://www.worldbank.org/en/home)
- [International Energy Agency](https://www.iea.org/data-and-statistics/data-product/world-energy-investment-2021-datafile)
- [Our World in Data](https://ourworldindata.org/renewable-energy)

### Data collection

#### The World Bank
The *World Bank* gets its data from different reliable sources and surveys. To compile data, they use the following methodologies:

- **Data Consistency**: Data from different sources may have inconsistencies due to differences in timing and reporting practices. The World Bank attempts to present data consistently in terms of definition, timing, and methods in its publications.
- **Change in Terminology**: The *World Bank* has adopted new terminology in line with the 1993 System of National Accounts (SNA). This includes changes in the names of economic indicators, such as GNP becoming GNI, GNP per capita becoming GNI per capita, and others.
- **Aggregation Rules**: Aggregates in *World Bank* data are based on regional and income classifications of economies. They are approximations of unknown totals or average values due to missing data.
- **Growth Rates**: Growth rates are calculated as annual averages and represented as percentages using methods like least squares, exponential endpoint, and geometric endpoint.
- **Alternative Conversion Factors**: In cases where official exchange rates are deemed unreliable, alternative conversion factors are used to provide a more accurate representation of currency values for specific countries.

#### The International Energy Agency

The *International Energy Agency* provides datasets about energy through different sectors (Industry, Residential, Services and Transport). They use administrative sources, measurements and surveys to collect data. A list of all their data collection practices can be fond [here](https://www.iea.org/data-and-statistics/data-tools/national-data-collection-practices?).

#### Our World in Data
*Our World in Data* is gathering the most dependable and enlightening data sets related to a specific subject. Their sources are the following ones:

- Specialized institutions, such as the Peace Research Institute Oslo (PRIO).
- Research articles.
- International organizations and statistical agencies, including the OECD, the World Bank, and UN institutions.
- Official data obtained from government sources.

## 2. Data types (Question 1 of the Homework)

In [3]:
import pandas as pd

# Read the dataset
df = pd.read_csv("../data/global-data-on-sustainable-energy.csv")
df

Unnamed: 0,Entity,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),...,Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Density\n(P/Km2),Land Area(Km2),Latitude,Longitude
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,...,302.59482,1.64,760.000000,,,,60,652230.0,33.939110,67.709953
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.60,0.09,0.0,0.50,...,236.89185,1.74,730.000000,,,,60,652230.0,33.939110,67.709953
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,...,210.86215,1.40,1029.999971,,,179.426579,60,652230.0,33.939110,67.709953
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,...,229.96822,1.40,1220.000029,,8.832278,190.683814,60,652230.0,33.939110,67.709953
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,...,204.23125,1.20,1029.999971,,1.414118,211.382074,60,652230.0,33.939110,67.709953
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3644,Zimbabwe,2016,42.561730,29.8,62.88,30000.0,81.90,3.50,0.0,3.32,...,3227.68020,10.00,11020.000460,,0.755869,1464.588957,38,390757.0,-19.015438,29.154857
3645,Zimbabwe,2017,44.178635,29.8,62.33,5570000.0,82.46,3.05,0.0,4.30,...,3068.01150,9.51,10340.000150,,4.709492,1235.189032,38,390757.0,-19.015438,29.154857
3646,Zimbabwe,2018,45.572647,29.9,82.53,10000.0,80.23,3.73,0.0,5.46,...,3441.98580,9.83,12380.000110,,4.824211,1254.642265,38,390757.0,-19.015438,29.154857
3647,Zimbabwe,2019,46.781475,30.1,81.40,250000.0,81.50,3.66,0.0,4.58,...,3003.65530,10.47,11760.000230,,-6.144236,1316.740657,38,390757.0,-19.015438,29.154857


Our dataset contains 3649 entries for 21 features. Let's take a deeper look at the data types of each feature.

In [6]:
df.dtypes

Entity                                                               object
Year                                                                  int64
Access to electricity (% of population)                             float64
Access to clean fuels for cooking                                   float64
Renewable-electricity-generating-capacity-per-capita                float64
Financial flows to developing countries (US $)                      float64
Renewable energy share in the total final energy consumption (%)    float64
Electricity from fossil fuels (TWh)                                 float64
Electricity from nuclear (TWh)                                      float64
Electricity from renewables (TWh)                                   float64
Low-carbon electricity (% electricity)                              float64
Primary energy consumption per capita (kWh/person)                  float64
Energy intensity level of primary energy (MJ/$2017 PPP GDP)         float64
Value_co2_em

This dataset mainle containes Time Series data. Most of the features are measurements taken at different points in time (each year), grouped by country. The only exception is the `Entity` feature, which is a categorical feature. We can also pinpoint that the `Density` feature has a weird data type: `object`. Usually, a density muste be given as a number (a int or a float). Let's take a look at the values of this feature.

In [8]:
df['Density\\n(P/Km2)'].unique()

array(['60', '105', '18', '26', '223', '17', '104', '590', '3', '109',
       '123', '41', '2,239', '1,265', '668', '47', '383', '108', '1281',
       '20', '64', '4', '25', '76', '463', '95', '56', '274', '8', '13',
       '153', '46', '467', '100', '73', '106', '131', '136', '137', '43',
       '96', '225', '71', '103', '313', '50', '35', '31', '67', '115',
       '49', '119', nan, '9', '239', '57', '240', '81', '331', '167',
       '53', '70', '414', '89', '107', '464', '151', '93', '400', '206',
       '273', '347', '7', '94', '147', '34', '30', '667', '242', '48',
       '203', '99', '1,802', '1,380', '5', '626', '66', '2', '83', '40',
       '541', '508', '16', '55', '19', '226', '15', '287', '58', '368',
       '124', '111', '248', '84', '525', '205', '301', '284', '87', '214',
       '8,358', '114', '341', '219', '68', '152', '110', '393', '229',
       '75', '118', '281', '36', '79', '38'], dtype=object)

The `Density` feature contains numbers represented as strings. This will be a problem for our analysis and modeling. We will have to convert this feature to a numeric type.

## 3. Features (Question 2 of the Homework)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3649 entries, 0 to 3648
Data columns (total 21 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Entity                                                            3649 non-null   object 
 1   Year                                                              3649 non-null   int64  
 2   Access to electricity (% of population)                           3639 non-null   float64
 3   Access to clean fuels for cooking                                 3480 non-null   float64
 4   Renewable-electricity-generating-capacity-per-capita              2718 non-null   float64
 5   Financial flows to developing countries (US $)                    1560 non-null   float64
 6   Renewable energy share in the total final energy consumption (%)  3455 non-null   float64
 7   Electricity from fossil fuels (TW

We can separate the features into 3 categories:

- Potential factors: these features are independant variables.
- Cofactors: these features are variables that may influence the response variable. 
- Response: these features are the ones we might want to predict.

Depending on the question we want to answer, we will use different features. For example, if we want to predict the carbon emissions of a country, we will use the `Value_co2_emissions_kt_by_country` feature as the response variable and the `Electricity from fossil fuels (TWh)` feature as a cofactor. If we want to classify the access to renewable energy, the `Access to clean fuels for cooking` feature will be the response variable and the `Renewable energy share in the total final energy consumption (%)` feature will be a cofactor.

So, it is quite a bit hard to say which features are potential factors, cofactors or response variables. We can still make a table like this one:

| Features | Potential Factors | Cofactors | Response |
| ----------------- | :---------: | :---------: | :---------: |
| Year                                                              | X | | |
| Access to electricity (% of population)                           | X | X | |
| Access to clean fuels for cooking                                 | X | X | X |
| Renewable-electricity-generating-capacity-per-capita              | X | X | |
| Financial flows to developing countries (US $)                    | X | X | X |
| Renewable energy share in the total final energy consumption (%)  | X | X | X |
| Electricity from fossil fuels (TWh)                               | X | X | |
| Electricity from nuclear (TWh)                                    | X | X | |
| Electricity from renewables (TWh)                                 | X | X | |
| Low-carbon electricity (% electricity)                            | X | X | X |
| Primary energy consumption per capita (kWh/person)                | X | X | |
| Energy intensity level of primary energy (MJ/$2017 PPP GDP)       | X | X | |
| Renewables (% equivalent primary energy)                          | X | X | X |
| GDP growth                                                        | X | X | |
| GDP per capita                                                    | X | X | |
| Density (P/km2)                                                   | X | X | |
| Land Area (Km2)                                                   | X | | |
| Latitude                                                          | X | | |
| Longitude                                                         | X | | |
| Value_co2_emissions_kt_by_country                                 | | | X |