# **Prediction of carbon footprints from country-specific data**

## Stage 1: Data cleaning and preparation

**Input:** excel data file from the original online source

**Output:** CSV file containing the cleaned data ready for visualization and analysis

**Programming language:** Python 3.7

**Libraries used:** pandas, numpy

**Data source**

The data utilized in this project is sourced from the World Bank Group. This dataset offers country-specific information on various parameters, including CO2 emissions, energy consumption, population count, urban population, cereal yield, nationally terrestrial protected areas, GDP, GNI, and more.

The dataset is publicly accessible at https://databank.worldbank.org/reports.aspx?source=world-development-indicators

**About this notebook**
  In this notebook, the gobal data of carbon footprints along with different varibales is taken and performed data cleaning and data preprocessing. The scope of the variables chosen for determining the CO2 emissions is vast and the features covered all the possible dependencies that are responsible for carbon footprints. In this way, it can help build model with more modesty and a better accuracy.

# **1. Notebook setup**

Importing all the necessary libraries.


In [None]:
import pandas as pd
import numpy as np

In [None]:
# define the file name and the data sheet
orig_data_file = "Data_Extract_From_World_Development_Indicators.xlsx"
data_sheet = "Data"

# read the data from the excel file to a pandas DataFrame
data_orig = pd.read_excel(io=orig_data_file, sheet_name=data_sheet)

# **2. Global data overview**

A global overview of the imported data yields the following insights

In [None]:
print("Shape of the original dataset:")
data_orig.shape

Shape of the original dataset:


(1000, 25)

In [None]:
print("Columns:")
data_orig.columns

Columns:


Index(['Country Name', 'Country Code', 'Time', 'Time Code',
       'CO2 emissions (kt) [EN.ATM.CO2E.KT]',
       'CO2 emissions (kg per PPP $ of GDP) [EN.ATM.CO2E.PP.GD]',
       'Cereal yield (kg per hectare) [AG.YLD.CREL.KG]',
       'Foreign direct investment, net inflows (% of GDP) [BX.KLT.DINV.WD.GD.ZS]',
       'Access to electricity (% of population) [EG.ELC.ACCS.ZS]',
       'Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP) [EG.USE.COMM.GD.PP.KD]',
       'Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent) [EN.ATM.GHGO.KT.CE]',
       'Methane emissions (kt of CO2 equivalent) [EN.ATM.METH.KT.CE]',
       'Nitrous oxide emissions (thousand metric tons of CO2 equivalent) [EN.ATM.NOXE.KT.CE]',
       'Urban population growth (annual %) [SP.URB.GROW]',
       'Population in urban agglomerations of more than 1 million [EN.URB.MCTY]',
       'Population growth (annual %) [SP.POP.GROW]',
       'Terrestrial protected areas (% 

In [None]:
print("Column data types:")
data_orig.dtypes

Column data types:


Country Name                                                                                                      object
Country Code                                                                                                      object
Time                                                                                                               int64
Time Code                                                                                                         object
CO2 emissions (kt) [EN.ATM.CO2E.KT]                                                                              float64
CO2 emissions (kg per PPP $ of GDP) [EN.ATM.CO2E.PP.GD]                                                           object
Cereal yield (kg per hectare) [AG.YLD.CREL.KG]                                                                    object
Foreign direct investment, net inflows (% of GDP) [BX.KLT.DINV.WD.GD.ZS]                                          object
Access to electricity (% of popu

In [None]:
print("Overview of the first 5 rows:")
data_orig.head()

Overview of the first 5 rows:


Unnamed: 0,Country Name,Country Code,Time,Time Code,CO2 emissions (kt) [EN.ATM.CO2E.KT],CO2 emissions (kg per PPP $ of GDP) [EN.ATM.CO2E.PP.GD],Cereal yield (kg per hectare) [AG.YLD.CREL.KG],"Foreign direct investment, net inflows (% of GDP) [BX.KLT.DINV.WD.GD.ZS]",Access to electricity (% of population) [EG.ELC.ACCS.ZS],"Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP) [EG.USE.COMM.GD.PP.KD]",...,Population growth (annual %) [SP.POP.GROW],Terrestrial protected areas (% of total land area) [ER.LND.PTLD.ZS],GDP (current US$) [NY.GDP.MKTP.CD],"GNI per capita, Atlas method (current US$) [NY.GNP.PCAP.CD]","Population, total [SP.POP.TOTL]",Urban population [SP.URB.TOTL],Agricultural land (% of land area) [AG.LND.AGRI.ZS],Fossil fuel energy consumption (% of total) [EG.USE.COMM.FO.ZS],CO2 emissions (metric tons per capita) [EN.ATM.CO2E.PC],CO2 emissions from transport (% of total fuel combustion) [EN.CO2.TRAN.ZS]
0,Argentina,ARG,2000,YR2000,132265.5,0.309098,3461.8,3.665791,95.680473,89.637633,...,1.133277,..,284203750000.0,7430,37070774,33045629,46.958187,88.38751,3.567918,28.890324
1,Argentina,ARG,2001,YR2001,125255.2,0.299469,3398.6,0.806164,95.511063,89.201029,...,1.099171,..,268696750000.0,6960,37480493,33480950,46.993266,85.994544,3.341877,28.380686
2,Argentina,ARG,2002,YR2002,117462.1,0.310337,3275.7,2.198958,96.096001,97.29496,...,1.073538,..,97724004251.86021,4020,37885028,33910889,47.031268,85.803408,3.100489,28.007565
3,Argentina,ARG,2003,YR2003,127653.5,0.303881,3308.7,1.294811,96.297951,95.65188,...,1.032361,..,127586973492.17664,3640,38278164,34330154,47.174981,86.014714,3.334891,26.73343
4,Argentina,ARG,2004,YR2004,141376.4,0.300607,3658.8,2.505018,96.505371,95.838259,...,1.015337,..,164657930452.78662,3360,38668796,34747780,47.318695,89.324968,3.656085,25.858186


In [None]:
print("Descriptive statistics of the columns:")
data_orig.describe()

Descriptive statistics of the columns:


Unnamed: 0,Time,CO2 emissions (kt) [EN.ATM.CO2E.KT],Methane emissions (kt of CO2 equivalent) [EN.ATM.METH.KT.CE],Nitrous oxide emissions (thousand metric tons of CO2 equivalent) [EN.ATM.NOXE.KT.CE],Urban population growth (annual %) [SP.URB.GROW],Population growth (annual %) [SP.POP.GROW],"Population, total [SP.POP.TOTL]",Urban population [SP.URB.TOTL],Agricultural land (% of land area) [AG.LND.AGRI.ZS],CO2 emissions (metric tons per capita) [EN.ATM.CO2E.PC]
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2009.5,506752.6,112114.9,40775.992034,1.854553,1.232634,108212900.0,56261310.0,39.911981,5.827997
std,5.769167,1343848.0,200944.9,80809.679671,1.735825,1.439599,249701400.0,109527500.0,21.303937,5.05028
min,2000.0,1078.12,527.4696,353.611402,-4.980317,-5.280078,281205.0,259836.0,2.693886,0.053064
25%,2004.75,39761.6,12942.29,5105.012425,0.753377,0.462917,11281540.0,6607584.0,21.191421,1.548603
50%,2009.5,118942.0,39765.76,16384.81454,1.450682,0.996109,37967780.0,22992790.0,42.677896,5.240251
75%,2014.25,379198.8,85978.12,38693.245785,2.658206,1.748553,89318690.0,49753090.0,55.12782,8.255466
max,2019.0,10762820.0,1163215.0,551682.7235,18.580685,18.127984,1407745000.0,848982900.0,80.888475,28.138659


In [None]:
data_orig_info = pd.read_excel(io=orig_data_file, sheet_name='Series - Metadata')

data_orig_info['Indicator Name'].unique()

array(['CO2 emissions (kt)', 'CO2 emissions (kg per PPP $ of GDP)',
       'Cereal yield (kg per hectare)',
       'Foreign direct investment, net inflows (% of GDP)',
       'Access to electricity (% of population)',
       'Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP)',
       'Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)',
       'Methane emissions (kt of CO2 equivalent)',
       'Nitrous oxide emissions (thousand metric tons of CO2 equivalent)',
       'Urban population growth (annual %)',
       'Population in urban agglomerations of more than 1 million',
       'Population growth (annual %)',
       'Terrestrial protected areas (% of total land area)',
       'GDP (current US$)', 'GNI per capita, Atlas method (current US$)',
       'Population, total', 'Urban population',
       'Agricultural land (% of land area)',
       'Fossil fuel energy consumption (% of total)',
       'CO2 emissions (metric tons per capi

In [None]:
data_orig_info['Code'].unique()

array(['EN.ATM.CO2E.KT', 'EN.ATM.CO2E.PP.GD', 'AG.YLD.CREL.KG',
       'BX.KLT.DINV.WD.GD.ZS', 'EG.ELC.ACCS.ZS', 'EG.USE.COMM.GD.PP.KD',
       'EN.ATM.GHGO.KT.CE', 'EN.ATM.METH.KT.CE', 'EN.ATM.NOXE.KT.CE',
       'SP.URB.GROW', 'EN.URB.MCTY', 'SP.POP.GROW', 'ER.LND.PTLD.ZS',
       'NY.GDP.MKTP.CD', 'NY.GNP.PCAP.CD', 'SP.POP.TOTL', 'SP.URB.TOTL',
       'AG.LND.AGRI.ZS', 'EG.USE.COMM.FO.ZS', 'EN.ATM.CO2E.PC',
       'EN.CO2.TRAN.ZS'], dtype=object)

**Insights from the overview:**



*   shape: 24 columsn, 605 rows
*   all columns are of type "object", "float64"
*   The 'Indicator Name' from the 'Series - Metadata' sheet contains the country-specific features required for analysis







The data series available can be summarized into the following country-specific parameter/feature categories:



*   **various emissions of greenhouse gases:** CO2, CH4, N2O, others
*   **population-specific parameters:** population count, urban population, population growth
*   **country economic indicators:** GDP, GNI
*   **land-related parameters:** cereal yield, agricultural land, terrestrial protected areas
*   **energy use:** access to electricity, fossil fuel energy consumption

# **3. Data Cleaning**

In [None]:
data_clean = data_orig

print("Number of rows:")
print(data_clean.shape[0])

Number of rows:
1000


In [None]:
print("Original number of columns:")
print(data_clean.shape[1])

data_clean = data_clean.drop(['Time Code', 'Country Name'], axis='columns')

print("Current number of columns:")
print(data_clean.shape[1])

Original number of columns:
25
Current number of columns:
23


### 3.1 Transform the ".." strings and emplty cells ("") into NaN values

In [None]:
data_clean.iloc[2:,:] = data_clean.iloc[2:,:].replace({'':np.nan, '..':np.nan, '...': np.nan})

In [None]:
data_clean

Unnamed: 0,Country Code,Time,CO2 emissions (kt) [EN.ATM.CO2E.KT],CO2 emissions (kg per PPP $ of GDP) [EN.ATM.CO2E.PP.GD],Cereal yield (kg per hectare) [AG.YLD.CREL.KG],"Foreign direct investment, net inflows (% of GDP) [BX.KLT.DINV.WD.GD.ZS]",Access to electricity (% of population) [EG.ELC.ACCS.ZS],"Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP) [EG.USE.COMM.GD.PP.KD]","Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent) [EN.ATM.GHGO.KT.CE]",Methane emissions (kt of CO2 equivalent) [EN.ATM.METH.KT.CE],...,Population growth (annual %) [SP.POP.GROW],Terrestrial protected areas (% of total land area) [ER.LND.PTLD.ZS],GDP (current US$) [NY.GDP.MKTP.CD],"GNI per capita, Atlas method (current US$) [NY.GNP.PCAP.CD]","Population, total [SP.POP.TOTL]",Urban population [SP.URB.TOTL],Agricultural land (% of land area) [AG.LND.AGRI.ZS],Fossil fuel energy consumption (% of total) [EG.USE.COMM.FO.ZS],CO2 emissions (metric tons per capita) [EN.ATM.CO2E.PC],CO2 emissions from transport (% of total fuel combustion) [EN.CO2.TRAN.ZS]
0,ARG,2000,132265.5,0.309098,3461.8,3.665791,95.680473,89.637633,-8326.802734,119811.10500,...,1.133277,..,284203750000,7430,37070774,33045629,46.958187,88.38751,3.567918,28.890324
1,ARG,2001,125255.2,0.299469,3398.6,0.806164,95.511063,89.201029,-5126.261719,120443.19620,...,1.099171,..,268696750000,6960,37480493,33480950,46.993266,85.994544,3.341877,28.380686
2,ARG,2002,117462.1,0.310337,3275.7,2.198958,96.096001,97.29496,-4499.005859,123719.92240,...,1.073538,,97724004251.860199,4020.0,37885028,33910889,47.031268,85.803408,3.100489,28.007565
3,ARG,2003,127653.5,0.303881,3308.7,1.294811,96.297951,95.65188,-3526.477539,131292.92610,...,1.032361,,127586973492.176636,3640.0,38278164,34330154,47.174981,86.014714,3.334891,26.73343
4,ARG,2004,141376.4,0.300607,3658.8,2.505018,96.505371,95.838259,-6496.553711,132880.07480,...,1.015337,,164657930452.786621,3360.0,38668796,34747780,47.318695,89.324968,3.656085,25.858186
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,ZMB,2015,4956.6,0.090993,3026.4,7.447417,31.1,,1439.153206,15821.29853,...,3.191896,,21251216798.776245,1540.0,16248230,6809146,32.063923,,0.305055,
996,ZMB,2016,5315.3,0.095406,2432.2,3.16252,35.37722,,1849.415859,15598.34229,...,3.147407,38.05,20958412538.309345,1340.0,16767761,7115902,32.063923,,0.316995,
997,ZMB,2017,6810.7,0.115956,2489.9,4.280501,40.3,,,14998.32105,...,3.113595,37.87001,25873601260.835304,1270.0,17298054,7434012,32.063923,,0.393726,
998,ZMB,2018,7857.2,0.125567,2168.1,1.552319,40.22934,,,15218.42231,...,3.061888,37.87001,26311506435.059845,1400.0,17835893,7762359,32.063923,,0.440527,


### 3.2 Transform all data columns into a numerical data type

In [None]:
data_clean2 = data_clean.applymap(lambda x: pd.to_numeric(x, errors='ignore'))
# Errors are ignored in order to avoid error messages about the first two columns, which don't need to be transformed
# into numeric type anyway

print("Print the column data types after transformation:")
data_clean2.dtypes

Print the column data types after transformation:


Country Code                                                                                                      object
Time                                                                                                               int64
CO2 emissions (kt) [EN.ATM.CO2E.KT]                                                                              float64
CO2 emissions (kg per PPP $ of GDP) [EN.ATM.CO2E.PP.GD]                                                          float64
Cereal yield (kg per hectare) [AG.YLD.CREL.KG]                                                                   float64
Foreign direct investment, net inflows (% of GDP) [BX.KLT.DINV.WD.GD.ZS]                                         float64
Access to electricity (% of population) [EG.ELC.ACCS.ZS]                                                         float64
Energy use (kg of oil equivalent) per $1,000 GDP (constant 2017 PPP) [EG.USE.COMM.GD.PP.KD]                      float64
Other greenhouse gas emissions, 

### 3.3 Rename the column names

In [None]:
new_col_names = ['country', 'year', 'co2_kt', 'co2_per_gdp', 'cereal_yield', 'fdi_perc_gdp', 'elec_access_perc', 'en_per_cap', 'other_ghg_ttl', 'ch4_ttl', 'n2o_ttl', 'urb_pop_growth_perc', 'pop_urb_aggl_perc', 'pop_growth_perc', 'prot_area_perc', 'gdp', 'gni_per_cap', 'pop', 'urb_pop', 'agr_land', 'fsfl_cons', 'co2_per_cap', 'co2_trans']
# rename all the column names with comprehensible shorter versions
data_clean2.columns = new_col_names

In [None]:
data_clean2.head()

Unnamed: 0,country,year,co2_kt,co2_per_gdp,cereal_yield,fdi_perc_gdp,elec_access_perc,en_per_cap,other_ghg_ttl,ch4_ttl,...,pop_growth_perc,prot_area_perc,gdp,gni_per_cap,pop,urb_pop,agr_land,fsfl_cons,co2_per_cap,co2_trans
0,ARG,2000,132265.5,0.309098,3461.8,3.665791,95.680473,89.637633,-8326.802734,119811.105,...,1.133277,..,284203800000.0,7430.0,37070774,33045629,46.958187,88.38751,3.567918,28.890324
1,ARG,2001,125255.2,0.299469,3398.6,0.806164,95.511063,89.201029,-5126.261719,120443.1962,...,1.099171,..,268696800000.0,6960.0,37480493,33480950,46.993266,85.994544,3.341877,28.380686
2,ARG,2002,117462.1,0.310337,3275.7,2.198958,96.096001,97.29496,-4499.005859,123719.9224,...,1.073538,,97724000000.0,4020.0,37885028,33910889,47.031268,85.803408,3.100489,28.007565
3,ARG,2003,127653.5,0.303881,3308.7,1.294811,96.297951,95.65188,-3526.477539,131292.9261,...,1.032361,,127587000000.0,3640.0,38278164,34330154,47.174981,86.014714,3.334891,26.73343
4,ARG,2004,141376.4,0.300607,3658.8,2.505018,96.505371,95.838259,-6496.553711,132880.0748,...,1.015337,,164657900000.0,3360.0,38668796,34747780,47.318695,89.324968,3.656085,25.858186


### 3.4 Filtering the years by missing values

In [None]:
print("check the amount of missing values in each column")
data_clean2.isnull().sum()

check the amount of missing values in each column


country                  0
year                     0
co2_kt                   0
co2_per_gdp             42
cereal_yield            15
fdi_perc_gdp            42
elec_access_perc         9
en_per_cap             274
other_ghg_ttl          150
ch4_ttl                  0
n2o_ttl                  0
urb_pop_growth_perc      0
pop_urb_aggl_perc       60
pop_growth_perc          0
prot_area_perc         798
gdp                     22
gni_per_cap             31
pop                      0
urb_pop                  0
agr_land                 0
fsfl_cons              246
co2_per_cap              0
co2_trans              265
dtype: int64

In [None]:
all_vars_clean = data_clean2

#define an array with the unique year values
years_count_missing = dict.fromkeys(all_vars_clean['year'].unique(), 0)
for ind, row in all_vars_clean.iterrows():
    years_count_missing[row['year']] += row.isnull().sum()

# sort the years by missing values
years_missing_sorted = dict(sorted(years_count_missing.items(), key=lambda item: item[1]))

# print the missing values for each year
print("missing values by year:")
for key, val in years_missing_sorted.items():
    print(key, ":", val)

missing values by year:
2011 : 65
2012 : 65
2013 : 65
2009 : 66
2010 : 66
2002 : 67
2003 : 67
2004 : 67
2005 : 67
2006 : 67
2007 : 67
2008 : 67
2014 : 67
2000 : 70
2001 : 70
2016 : 159
2015 : 165
2017 : 209
2018 : 209
2019 : 209


### 3.5 Filtering the countries by missing values

In [None]:
# check the amount of missing values by country

# define an array with the unique country values
countries_count_missing = dict.fromkeys(all_vars_clean['country'].unique(), 0)

# iterate through all rows and count the amount of NaN values for each country
for ind, row in all_vars_clean.iterrows():
    countries_count_missing[row['country']] += row.isnull().sum()

# sort the countries by missing values
countries_missing_sorted = dict(sorted(countries_count_missing.items(), key=lambda item: item[1]))

# print the missing values for each country
print("missing values by country:")
for key, val in countries_missing_sorted.items():
    print(key, ":", val)

missing values by country:
ARG : 32
AUS : 32
FRA : 32
DEU : 32
ITA : 32
JPN : 32
KOR : 32
MEX : 32
NLD : 32
ESP : 32
CHE : 32
GBR : 32
USA : 32
BEL : 32
CHL : 32
DNK : 32
FIN : 32
NZL : 32
NOR : 32
PRT : 32
POL : 32
SWE : 32
BRA : 34
CHN : 34
IND : 34
IDN : 34
RUS : 34
SAU : 34
BGD : 34
COL : 34
CMR : 34
EGY : 34
ETH : 34
IRQ : 34
MYS : 34
NPL : 34
NGA : 34
PAK : 34
PHL : 34
ZAF : 34
UKR : 34
LBY : 36
ARE : 36
ZMB : 36
NAM : 54
LKA : 54
ISL : 67
CUB : 89
AFG : 94
PRK : 138


This output would suggest to remove rows for countries with more than 90 missing values

In [None]:
print("number of missing values in the whole dataset before filtering the countries:")
print(all_vars_clean.isnull().sum().sum())
print("number of rows before filtering the countries:")
print(all_vars_clean.shape[0])


# filter only rows for countries with less than 90 missing values
countries_filter = []
for key, val in countries_missing_sorted.items():
    if val<90:
        countries_filter.append(key)

all_vars_clean = all_vars_clean[all_vars_clean['country'].isin(countries_filter)]

print("number of missing values in the whole dataset after filtering the countries:")
print(all_vars_clean.isnull().sum().sum())
print("number of rows after filtering the countries:")
print(all_vars_clean.shape[0])

number of missing values in the whole dataset before filtering the countries:
1954
number of rows before filtering the countries:
1000
number of missing values in the whole dataset after filtering the countries:
1722
number of rows after filtering the countries:
960


### 3.6 Checking the features (columns) for missing values

In [None]:
all_vars_clean.isnull().sum()

country                  0
year                     0
co2_kt                   0
co2_per_gdp             20
cereal_yield            15
fdi_perc_gdp            20
elec_access_perc         0
en_per_cap             234
other_ghg_ttl          144
ch4_ttl                  0
n2o_ttl                  0
urb_pop_growth_perc      0
pop_urb_aggl_perc       60
pop_growth_perc          0
prot_area_perc         766
gdp                      0
gni_per_cap              2
pop                      0
urb_pop                  0
agr_land                 0
fsfl_cons              221
co2_per_cap              0
co2_trans              240
dtype: int64

Once the years and countries with the highest number of missing values were filtered, some features, prot_area_perc, en_per_cap, other_ghg_ttl, fsfl_cons, co2_trans, pop_urb_aggl_perc still exhibited a considerable number of missing values. Eliminating these features would result in a significant reduction in the available observations. Consequently, the subsequent action involves removing these columns from the dataset.

In [None]:
# remove features with more than 20 missing values

from itertools import compress

# create a boolean mapping of features with more than 20 missing values
vars_bad = all_vars_clean.isnull().sum()>20

# remove the columns corresponding to the mapping of the features with many missing values
all_vars_clean2 = all_vars_clean.drop(compress(data = all_vars_clean.columns, selectors = vars_bad), axis='columns')

print("Remaining missing values per column:")
print(all_vars_clean2.isnull().sum())

Remaining missing values per column:
country                 0
year                    0
co2_kt                  0
co2_per_gdp            20
cereal_yield           15
fdi_perc_gdp           20
elec_access_perc        0
ch4_ttl                 0
n2o_ttl                 0
urb_pop_growth_perc     0
pop_growth_perc         0
gdp                     0
gni_per_cap             2
pop                     0
urb_pop                 0
agr_land                0
co2_per_cap             0
dtype: int64


In [None]:
# delete rows with any number of missing values
all_vars_clean3 = all_vars_clean2.dropna(axis='rows', how='any')

print("Remaining missing values per column:")
print(all_vars_clean3.isnull().sum())

print("Final shape of the cleaned dataset:")
print(all_vars_clean3.shape)

Remaining missing values per column:
country                0
year                   0
co2_kt                 0
co2_per_gdp            0
cereal_yield           0
fdi_perc_gdp           0
elec_access_perc       0
ch4_ttl                0
n2o_ttl                0
urb_pop_growth_perc    0
pop_growth_perc        0
gdp                    0
gni_per_cap            0
pop                    0
urb_pop                0
agr_land               0
co2_per_cap            0
dtype: int64
Final shape of the cleaned dataset:
(923, 17)


# **4. Export of the cleaned data frame to a file**

Now that the dataset has been rearranged and cleaned of missing values, it can be exported to a csv file (without the row index) for further analysis:

In [None]:
# export the clean dataframe to a csv file
all_vars_clean3.to_csv('data_cleaned.csv', index=False)

# **Conclusion**

Finally, after detailed data cleaning by eliminating the null values and removing the duplicates and insignificant features, we have the cleaned dataset on which we will be performing our data visualizations to get a better understanding of the features.