# Dataset and preprocessing

Initially, each team member explored potential datasets with sufficient depth and multiple analytical perspectives. During our first meeting, we reviewed these datasets in detail, emphasizing correlations, data quality, and potential insights. Ultimately, we selected two complementary datasets related to global energy production and sustainability indicators:

- **Monthly Energy Production Data** (`data.csv`): Contains electricity production data per country, month, and energy source from January 2010 onward. It includes the columns:
  ```
  ['COUNTRY', 'CODE_TIME', 'TIME', 'YEAR', 'MONTH', 'MONTH_NAME', 
   'PRODUCT', 'VALUE', 'DISPLAY_ORDER', 'yearToDate', 'previousYearToDate', 'share']
  ```
  This file originally had **181,915 rows and 12 columns (~20 MB)**.

- **Global Sustainability Indicators** (`global-data-on-sustainable-energy.csv`): Provides annual indicators per country over multiple decades, including:
  - `Access to electricity (% of population)`
  - `Renewable energy share in the total final energy consumption (%)`
  - `GDP per Capita` (USD)
  - Plus other metrics such as `Access to clean fuels for cooking`
 
This file originally had **3,649 rows and 21 columns**.

By combining a high-frequency production series with annual socio-economic metrics, we can analyze absolute energy production trends, shifts to renewable sources, economic correlations, and electricity accessibility.

---

# Cleaning

Both datasets required thorough cleaning due to structural differences, inconsistent naming conventions, and varying granularity. We addressed this in two distinct phases:

- **Phase 1: Restructuring and Merging**  
  - Harmonized column names for consistency (e.g., `Entity` → `Country`, `Value` → `VALUE`).  
  - Created a unified datetime field (`DATE`) from `YEAR` + `MONTH`.  
  - Excluded irrelevant columns such as regional aggregates (`World`, `OECD`), metadata fields, and unused indicators.

- **Phase 2: Normalisation**  
  - Standardized categorical entries: consolidated energy source labels in `PRODUCT` (Total combustible fuels, Hydro, Wind, Solar) and country names.  
  - **Retained** the existing `share` column from `data.csv`, which already represents each source’s monthly fraction of total production.  
  - Added `YoY_Growth_Renewable`, the year-over-year change in the renewable share percentage, for later analyses.

While we did not export parquet files in this notebook, we recommend saving future cleaned data in parquet format (`.pq`) with gzip compression to reduce file size and preserve datatypes.

After cleaning, the datasets are streamlined for analysis:
- **Monthly production:** 7 core columns, ~150,000 rows  
- **Sustainability indicators:** 5 core columns, ~2,500 rows

---

# Variable descriptions

After cleaning, our key variables fall into these categories:

- **Continuous / Ratio variables:**  
  - `VALUE` (GWh): Monthly electricity production  
  - `share` (0–1): Fraction of total monthly production  
  - `Renewable energy share in the total final energy consumption (%)`  
  - `Access to electricity (% of population)`  
  - `GDP per Capita` (USD)  
  - `YoY_Growth_Renewable` (percentage points): Annual change in renewable share

- **Discrete / Nominal variables:**  
  - `COUNTRY` / `Country`: Name of the country  
  - `PRODUCT`: Energy source category (Total combustible fuels, Hydro, Wind, Solar)

- **Discrete / Interval variables:**  
  - `DATE`: Monthly or annual timestamp

Currently utilized variables in our visualizations include:  
`DATE`, `PRODUCT`, `VALUE`, `share`, `COUNTRY`, `Access to electricity (% of population)`, `Renewable energy share in the total final energy consumption (%)`, `GDP per Capita`, and `YoY_Growth_Renewable`.

---

# Aggregations

Our analyses included specific aggregations for each chosen visualization:

1. **Monthly energy production by source** (cell In [2]):  
   - Summed monthly `VALUE` across all countries, grouped by `PRODUCT` and `DATE` to show absolute GWh trends.

2. **Top-10 countries: Fossil vs Renewable Energy Production** (cell In [6]):  
   - Aggregated annual `VALUE` per country into two categories (combustible vs renewables) and selected the top-10 OECD countries by total production.

3. **Renewable Energy Share & Access to Electricity over time** (cell In [8]):  
   - Computed annual means of `% Renewable Share` and `% Access to electricity (% of population)` for a representative country set (e.g., G20) to illustrate the interplay between energy transition and societal access.

4. **Renewable Energy Share vs GDP per Capita** (cell In [10]):  
   - Merged per-country/year `% Renewable Share` with `GDP per Capita` and plotted an animated scatterplot to explore economic correlations and outliers.

5. **Annual Growth of Renewable Energy Share per Country** (cell In [12]):  
   - Calculated year-over-year differences in `% Renewable Share` (`YoY_Growth_Renewable`) for each country and visualized the results on an interactive choropleth map with a time slider.

6. **Small Multiples: Renewable Share per Country** (cell In [9]):  
   - Aggregated annual `% Renewable Share` for selected countries and displayed them as faceted line charts for easy cross-country comparison.

These aggregated visualizations provide comprehensive insights into global energy production trends, renewable transitions, and socio-economic impacts.


