# PL04 - Data Integration and Processing I

In [2]:
import pandas as pd

In [90]:
energy_consumption = pd.read_csv('DataSets/change-energy-consumption.csv')
renewable_energy = pd.read_csv('DataSets/modern-renewable-prod.csv')
per_capita_energy_use = pd.read_csv('DataSets/per-capita-energy-use.csv')
primary_energy = pd.read_csv('DataSets/primary-energy-cons.csv')

In [33]:
energy_consumption.head()

Unnamed: 0,Entity,Code,Year,Annual change in primary energy consumption (%)
0,Afghanistan,AFG,1981,12.663031
1,Afghanistan,AFG,1982,6.505477
2,Afghanistan,AFG,1983,22.33379
3,Afghanistan,AFG,1984,0.462401
4,Afghanistan,AFG,1985,-2.365375


### **Energy Data Analysis**

This analysis merges multiple datasets related to energy consumption, primary energy, renewable energy, and per capita energy use.


In [91]:
energy_analysis1 = pd.merge(energy_consumption,primary_energy, on=['Entity','Code','Year'], how='left')
energy_analysis2 = pd.merge(energy_analysis1,renewable_energy, on=['Entity','Code','Year'], how='left')
energy_analysis = pd.merge(energy_analysis2, per_capita_energy_use, on=['Entity','Code','Year'], how='left')

In [92]:
energy_analysis

Unnamed: 0,Entity,Code,Year,Annual change in primary energy consumption (%),Primary energy consumption (TWh),Electricity from wind - TWh,Electricity from hydro - TWh,Electricity from solar - TWh,Other renewables including bioenergy - TWh,Primary energy consumption per capita (kWh/person)
0,Afghanistan,AFG,1981,12.663031,8.777320,,,,,786.83690
1,Afghanistan,AFG,1982,6.505477,9.348327,,,,,926.65125
2,Afghanistan,AFG,1983,22.333790,11.436162,,,,,1149.19590
3,Afghanistan,AFG,1984,0.462401,11.489043,,,,,1121.57290
4,Afghanistan,AFG,1985,-2.365375,11.217284,,,,,1067.07090
...,...,...,...,...,...,...,...,...,...,...
11717,Zimbabwe,ZWE,2017,-2.984351,45.256546,0.0,3.97,0.01,0.15,3068.01150
11718,Zimbabwe,ZWE,2018,14.479410,51.809430,0.0,5.05,0.02,0.19,3441.98580
11719,Zimbabwe,ZWE,2019,-10.981565,46.119940,0.0,4.17,0.03,0.19,3003.65530
11720,Zimbabwe,ZWE,2020,-8.940124,41.996760,0.0,3.81,0.02,0.10,2680.13180


### Dropping rows with missing Code values

Ensures all entries have a valid country or region code.

Rows without a Code correspond to continents rather than individual countries.

Removing these rows prevents aggregation errors and ensures consistency in country-level analysis.

The code 'OWID_WRL' represents a global entity that aggregates data from all countries. Since the analysis focuses on individual countries, excluding this row ensures that the dataset only includes country-specific data, preventing distortions in statistical calculations.

In [93]:
energy_analysis = energy_analysis.dropna(subset=['Code'])

In [94]:
energy_analysis = energy_analysis[energy_analysis['Code'] != 'OWID_WRL']

### Common years for all the countries

In [95]:
energy_analysis.groupby("Entity")["Year"].min()

Entity
Afghanistan       1981
Albania           1981
Algeria           1966
American Samoa    1981
Angola            1981
                  ... 
Western Sahara    1981
Yemen             1981
Yugoslavia        1981
Zambia            1981
Zimbabwe          1981
Name: Year, Length: 223, dtype: int64

In [96]:
energy_analysis.groupby("Entity")["Year"].max()

Entity
Afghanistan       2021
Albania           2021
Algeria           2023
American Samoa    2021
Angola            2021
                  ... 
Western Sahara    2021
Yemen             2021
Yugoslavia        1991
Zambia            2021
Zimbabwe          2021
Name: Year, Length: 223, dtype: int64

### Data Filtering for Consistency Across Countries

Yugoslavia no longer exists since 1991, so it only has data up to that year. We’ll remove this country from the dataset.

Most countries have data available up to 2021, so we will discard those that have data extending into 2022 and 2023.

Additionally, most countries have data starting from 1981, so we will eliminate any records prior to this year.

In [97]:
energy_analysis = energy_analysis[energy_analysis["Entity"] != "Yugoslavia"]
energy_analysis = energy_analysis[energy_analysis["Year"] <= 2021]
energy_analysis = energy_analysis[energy_analysis["Year"] >= 1981]

In [98]:
# Checking for missing values to identify columns with incomplete data

energy_analysis.isna().sum()

Entity                                                   0
Code                                                     0
Year                                                     0
Annual change in primary energy consumption (%)          0
Primary energy consumption (TWh)                         0
Electricity from wind - TWh                           2769
Electricity from hydro - TWh                          2569
Electricity from solar - TWh                          2796
Other renewables including bioenergy - TWh            2718
Primary energy consumption per capita (kWh/person)      88
dtype: int64

### Handling Missing Values

The missing values correspond to renewable energy data for the years 1981 to 1999. This likely indicates that renewable energy sources were not widely used or reported during this period for some countries.
To ensure consistency in the dataset and avoid misinterpretation of missing values, we will replace these NA values with 0, assuming that renewable energy usage was negligible or nonexistent at that time.

In [99]:
energy_analysis.fillna(0, inplace=True)

In [100]:
# Exporting our new DataSet into a csv type file

energy_analysis.to_csv('Energy_Analysis.csv', index=False)