# parse_data.ipynb

This notebook parses the data files used for the FP-2 assignment. 

<br>
The dependent and independent variables (DVs and IVs) that we are interested in are:

**DVs**:
- Fossil fuel electricity share ("fossil_share_elec" column in the CSV file)
- Renewable energy electricity share ("renewables_share_elec" column in the CSV file)

**IVs**:
- Low-income countries (in "country" column in CSV file)
- Lower-middle income countries (in "country" column in CSV file)
- Upper-middle income countries (in "country" column in CSV file)
- High-income countries (in "country" column in CSV file)
- Population
- Electricity demand ("electricity_demand" in CSV file)
- Greenhouse gas emissions ("greenhouse_gas_emissions" in CSV file)
- Income group ("FY23" in XLSX file)
<br>

Reading attached datafiles and cleaning the data:

In [13]:
import pandas as pd

df0 = pd.read_csv('owid-energy-data.csv')

selected_cols = ["country", "year", "iso_code", "population", "electricity_demand", "greenhouse_gas_emissions", "fossil_share_elec", "renewables_share_elec"]

energydata = df0[selected_cols]
energydata = energydata[energydata['year'] == 2021]


energydata = energydata.dropna()

energydata = energydata.rename( columns={'country':'Country', 'year':'Year', 'iso_code':'ISO', 'fossil_share_elec':'FF electricity share', 'renewables_share_elec':'RE electricity share', 'population':'Population', 'electricity_demand':'Electricity demand', 'greenhouse_gas_emissions':'GHG emissions'} )

energydata.head()
energydata

Unnamed: 0,Country,Year,ISO,Population,Electricity demand,GHG emissions,FF electricity share,RE electricity share
146,Afghanistan,2021,AFG,40000360.0,6.76,0.12,12.264,87.736
641,Albania,2021,ALB,2849591.0,8.39,0.21,0.000,100.000
765,Algeria,2021,DZA,44761051.0,84.45,54.23,99.054,0.946
810,American Samoa,2021,ASM,49202.0,0.17,0.11,100.000,0.000
934,Angola,2021,AGO,34532382.0,16.85,2.91,24.570,75.430
...,...,...,...,...,...,...,...,...
22314,Venezuela,2021,VEN,28237786.0,82.93,14.45,20.764,79.235
22439,Vietnam,2021,VNM,98935054.0,254.43,120.51,57.226,42.774
22849,Yemen,2021,YEM,37140180.0,2.92,1.72,82.877,17.123
23066,Zambia,2021,ZMB,19603556.0,15.59,1.54,8.014,91.986


In [9]:
data = pd.ExcelFile('income_classification.xlsx')
df1 = pd.read_excel(data, sheet_name = 'Country Analytical History')

FY23_row = 4 
FY23_col = 36
startrow = 11
isocodecol = 0
countrynamecol = 1
income_2021 = df1.iloc[startrow :, [isocodecol, countrynamecol, FY23_col]].copy()
income_2021.columns = ["ISO", "Country Name", "Income Group FY23"]

income_2021 = income_2021.dropna()
income_2021.head(100)

Unnamed: 0,ISO,Country Name,Income Group FY23
11,ALB,Albania,UM
12,DZA,Algeria,LM
13,ASM,American Samoa,UM
14,AND,Andorra,H
15,AGO,Angola,LM
...,...,...,...
106,ITA,Italy,H
107,JAM,Jamaica,UM
108,JPN,Japan,H
109,JOR,Jordan,UM


Merging the two datasets:

In [14]:
energydata["ISO"] = energydata["ISO"].astype(str)
income_2021["ISO"] = income_2021["ISO"].astype(str)

In [17]:
finaldata = energydata.merge(income_2021[["ISO", "Income Group FY23"]], on = "ISO", how = "left")
finaldata = finaldata.dropna()
finaldata.head(100)

Unnamed: 0,Country,Year,ISO,Population,Electricity demand,GHG emissions,FF electricity share,RE electricity share,Income Group FY23
1,Albania,2021,ALB,2849591.0,8.39,0.21,0.000,100.000,UM
2,Algeria,2021,DZA,44761051.0,84.45,54.23,99.054,0.946,LM
3,American Samoa,2021,ASM,49202.0,0.17,0.11,100.000,0.000,UM
4,Angola,2021,AGO,34532382.0,16.85,2.91,24.570,75.430,LM
5,Antigua and Barbuda,2021,ATG,92316.0,0.35,0.22,94.286,5.714,H
...,...,...,...,...,...,...,...,...,...
100,Jordan,2021,JOR,11066310.0,21.58,11.54,77.013,22.987,UM
101,Kazakhstan,2021,KAZ,19743565.0,114.52,95.96,89.060,10.940,UM
102,Kenya,2021,KEN,53219122.0,12.65,1.19,10.178,89.822,LM
103,Kiribati,2021,KIR,128346.0,0.04,0.02,75.000,25.000,LM
