#  PMLMDM Final Project: The Causal Effects of Economic Development on Human Rights, Democracy and the Rule of Law 
__Mehmet Atakan Çavuşlu - MMA191<br>
Tor Anders Høksås - MMA191__
***

# Introduction

It is a long popular debate whether economic development and human righs, democracy are tied, and is there any causality between these elements in a country-level. We see that often times economically high-performing countries also have more freedom and the human rights are considered to be very importing aspect in the policies of aforementioned countries. There are also some counter-arguements about the same issue. 

In fact, it became a chicken and an egg problem, often times having no real answer whether economic development brings democracy and human rights, or vice versa: being democratic and putting importance on human rights brings wealth and economic development.

This research aims to get statistical insighths about the relationship between economic development and the status of democracy for a country.

***

### HYPOTHESIS
__H1:__ There is a significant correlation between economic development and the status
of human rights, democracy, press freedom and the rule of law for a country<br>
__H0:__ There is not a significant correlation between economic development and the status of human rights, democracy, press freedom and the rule of law for a country

In [46]:
# Import basic statistical and visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wbdata
import datetime
from sklearn import linear_model
import statsmodels.api as sm
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Data Collection
To use in the research, following data are collected to be able to assess _Democracy, Human Rights and The Rule of Law_ in a country (mentioned thereafter as _human data_):

- __Democracy Index 2018, The Economist Intelligence Unit__<br>
https://www.eiu.com/topic/democracy-index<br>
Index based on _Electoral Process and Pluralism, Functioning of Government, Political Participation, Political Culture, Civil Liberties_
- __World Press Freedom Index 2019, Reporters Without Borders__<br>
https://rsf.org/en/ranking<br>
Index based on _Pluralism, Media Independence, Environment and Self-censorship, Legislative Framework, Transparency, Infrastructure, Abuses_
- __Human Rights Index 2014, Our World In Data University of Oxford__<br>
https://ourworldindata.org/human-rights<br>
Index based on _Human Rights Protection from Political Oppression, Human Rights Violations_
- __Rule of Law Index 2019, World Justice Project__<br>
https://worldjusticeproject.org/our- work/research-and-data/wjp-rule-law-index-2019<br>
Index based on _Accountability, Just Laws, Open Government, Accessible & Impartial Dispute Resolution_

The data about economic development is collected from WorldBank API (https://data.worldbank.org) containing the following indicatiors (mentioned thereafter as _economic data_):

- __'NY.GDP.PCAP.CD':__ 'GDP per capita'
- __'PA.NUS.PPP':__ 'PPP conversion factor'
- __'FP.CPI.TOTL.ZG':__ 'Inflation'
- __'SL.UEM.TOTL.ZS':__ 'Unemployment (%)'
- __'SP.RUR.TOTL.ZS':__ 'Rural Population (%)'
- __'SL.TLF.CACT.NE.ZS':__ 'Labor force participation rate (%)'

Variables such as GNI Per Capita, GDP per capita adjusted for PPP, GDP per person employed are not needed and discarded due to high multicollinearity with above variables.
***
### Collection of Human Data

In [2]:
# Read Rule of law index data from World Justice Project
wjp_data = pd.read_excel('wjp_rule_of_law_index.xlsx', sheet_name='WJP ROL Index 2019 Scores')
wjp_data = wjp_data[wjp_data['Country'] == 'WJP Rule of Law Index: Overall Score'].transpose()
wjp_data.columns = wjp_data.iloc[0]
wjp_data.reset_index(inplace=True)
wjp_data = wjp_data.reindex(wjp_data.index.drop(0))
wjp_data.reset_index(inplace=True)
wjp_data.drop(['level_0'], axis=1, inplace=True)
wjp_data.columns = ['Country', 'Overall Rule of Law Index']
wjp_data.set_index('Country', inplace=True)
wjp_data.head()

Unnamed: 0_level_0,Overall Rule of Law Index
Country,Unnamed: 1_level_1
Afghanistan,0.34764
Albania,0.506076
Algeria,0.505867
Angola,0.41263
Antigua and Barbuda,0.627957


In [3]:
# Read Press Freedom index data from Reporters Without Borders
press_freedom_data = pd.read_csv('press_freedom.csv', decimal=",")
press_freedom_data = press_freedom_data[["EN_country","Score 2019"]]
press_freedom_data.columns = ['Country', 'Press Freedom Score']
press_freedom_data.set_index('Country', inplace=True)
press_freedom_data.head()

Unnamed: 0_level_0,Press Freedom Score
Country,Unnamed: 1_level_1
Norway,7.82
Finland,7.9
Sweden,8.31
Netherlands,8.63
Denmark,9.87


In [4]:
# Read Human Rights Protection index data from Our World in Data, provided by Oxford University
human_rights_data = pd.read_csv('human_rights_protection.csv')
human_rights_data = human_rights_data[human_rights_data['Year'] == 2014]
human_rights_data = human_rights_data[["Entity","Human Rights Protection Scores – by Christopher Farris and Keith Schnakenberg"]]
human_rights_data.reset_index(inplace=True)
human_rights_data.drop(['index'], axis=1, inplace=True)
human_rights_data.columns = ['Country', 'Human Rights Protection Score']
human_rights_data.set_index('Country', inplace=True)
human_rights_data.head()

Unnamed: 0_level_0,Human Rights Protection Score
Country,Unnamed: 1_level_1
Afghanistan,-1.2816
Albania,0.365031
Algeria,0.01559
Andorra,3.26585
Angola,-0.603419


In [5]:
# Read Human Rights Violations index data from Our World in Data, provided by Oxford University
human_rights_violation = pd.read_csv('human_rights_violations.csv')
human_rights_violation = human_rights_violation[human_rights_violation['Year'] == 2014]
human_rights_violation = human_rights_violation[['Entity', 'Unnamed: 3']]
human_rights_violation.reset_index(inplace=True)
human_rights_violation.drop(['index'], axis=1, inplace=True)
human_rights_violation.columns = ['Country', 'Human Rights Violations']
human_rights_violation.set_index('Country', inplace=True)
human_rights_violation.head()

Unnamed: 0_level_0,Human Rights Violations
Country,Unnamed: 1_level_1
Afghanistan,8.3
Albania,5.5
Algeria,7.4
Angola,7.0
Antigua and Barbuda,4.7


In [6]:
# Read Democracy index data from The Economist Intelligence Unit
democracy_data = pd.read_excel('democracy_index.xlsx')
democracy_data = democracy_data[['Countries','Overall score ']]
democracy_data.reset_index(inplace=True)
democracy_data.dropna(inplace=True)
democracy_data.drop(['index'], axis=1, inplace=True)
democracy_data.columns = ['Country', 'Democracy Index']
democracy_data.set_index('Country', inplace=True)
democracy_data.head()

Unnamed: 0_level_0,Democracy Index
Country,Unnamed: 1_level_1
Afghanistan,2.97
Albania,5.98
Algeria,3.5
Angola,3.62
Argentina,7.02


After reading all the indexes to assess _Democracy, Human Rights and The Rule of Law_ in a country-level, a new dataframe with all these indices called __human_data__ is created.

In [7]:
human_data = pd.concat([wjp_data, press_freedom_data, human_rights_data, human_rights_violation, democracy_data], axis=1, sort=True)
human_data['Overall Rule of Law Index'] = pd.to_numeric(human_data['Overall Rule of Law Index'])
human_data.head()

Unnamed: 0,Overall Rule of Law Index,Press Freedom Score,Human Rights Protection Score,Human Rights Violations,Democracy Index
Afghanistan,0.34764,36.55,-1.2816,8.3,2.97
Albania,0.506076,29.84,0.365031,5.5,5.98
Algeria,0.505867,45.75,0.01559,7.4,3.5
Andorra,,24.63,3.26585,,
Angola,0.41263,34.96,-0.603419,7.0,3.62


### Basic EDA and Data Preprocessing on Human Data

In [8]:
# Data count for each column
human_data.count()

Overall Rule of Law Index        126
Press Freedom Score              180
Human Rights Protection Score    196
Human Rights Violations          179
Democracy Index                  167
dtype: int64

In [9]:
# Total Missing data per column
human_data.isnull().sum()

Overall Rule of Law Index        151
Press Freedom Score               97
Human Rights Protection Score     81
Human Rights Violations           98
Democracy Index                  110
dtype: int64

#### Handling Missing Data on Human Data
There is lots of missing data for a lot of countries in the table. Omitting NaN values would significantly limit the model implications since it leaves such a small amount of data available.

Popular and easier Data Imputation methods such as mean/median would not work in our case, because a lot of the data missing are from edge countries, which means values of not around average but rather on the edges of the scale.

To overcome this issue, an imputing method of __KNN Imptuting__ is used. The detailed explanation of such method gathered from scikitlearn docs is:

_Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close. By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors._

This would prove useful since it is expected that countries with other given variables similar should perform similarly in the NaN values. 

To make the imputations better, before KNN imputing, the rows with more than 2 NaN values can be omitted. Since having so much NaNs simply make it hard to make imputations about that country.

In [10]:
# Omitting rows with more than 2 NaN values
human_data_omitted = human_data.dropna(thresh=2)

In [11]:
# Impute NaN with KNNImputer
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
human_data_filled_array = imputer.fit_transform(human_data_omitted)
human_data_filled = pd.DataFrame(human_data_filled_array)
human_data_filled.columns = ['Overall Rule of Law Index', 'Press Freedom Score', 'Human Rights Protection Score', 'Human Rights Violations', 'Democracy Index']
human_data_filled.index = human_data_omitted.index
human_data_filled.isnull().sum()

Overall Rule of Law Index        0
Press Freedom Score              0
Human Rights Protection Score    0
Human Rights Violations          0
Democracy Index                  0
dtype: int64

In [12]:
human_data_filled.count()

Overall Rule of Law Index        186
Press Freedom Score              186
Human Rights Protection Score    186
Human Rights Violations          186
Democracy Index                  186
dtype: int64

In [13]:
human_data_filled.corr()

Unnamed: 0,Overall Rule of Law Index,Press Freedom Score,Human Rights Protection Score,Human Rights Violations,Democracy Index
Overall Rule of Law Index,1.0,-0.613701,0.784069,-0.832201,0.635293
Press Freedom Score,-0.613701,1.0,-0.669364,0.760259,-0.710291
Human Rights Protection Score,0.784069,-0.669364,1.0,-0.792795,0.589848
Human Rights Violations,-0.832201,0.760259,-0.792795,1.0,-0.802528
Democracy Index,0.635293,-0.710291,0.589848,-0.802528,1.0


In [14]:
human_data_omitted.corr()

Unnamed: 0,Overall Rule of Law Index,Press Freedom Score,Human Rights Protection Score,Human Rights Violations,Democracy Index
Overall Rule of Law Index,1.0,-0.644426,0.830458,-0.841896,0.518674
Press Freedom Score,-0.644426,1.0,-0.66524,0.765425,-0.58941
Human Rights Protection Score,0.830458,-0.66524,1.0,-0.824914,0.33012
Human Rights Violations,-0.841896,0.765425,-0.824914,1.0,-0.737144
Democracy Index,0.518674,-0.58941,0.33012,-0.737144,1.0


It can be seen that correlation between variables are not changed significantly after handling NaN values with KNN method. While it can be somewhat safely said it will not affect the results, further checks on the data before and after imputation might be needed, to check for validity. 

This is left out as future work, for now this filled data is accepted satisfactory enough to use in the following models.

From above tables, it seems like theres some high collinearity between some variables, although that is expected it can affect the model inferences if handled bad. Checking VIF also might be good idea after adding outcome variable.

### Collection of Economic Data

In [15]:
# Set indicators and date to read from WorldBank API
indicators = {
    'NY.GDP.PCAP.CD':'GDP per capita',
    'PA.NUS.PPP':'PPP conversion factor',
    'FP.CPI.TOTL.ZG':'Inflation',
    'SL.UEM.TOTL.ZS':'Unemployment (%)',
    'SP.RUR.TOTL.ZS':'Rural Population (%)',
    'SL.TLF.CACT.NE.ZS':'Labor force participation rate (%)',
            }
data_date = datetime.datetime(2017, 1, 1)

In [16]:
# Get the economic data as pandas dataframe from WB API, with given indicators and date
economic_data = wbdata.get_dataframe(indicators, data_date=data_date)

### Basic EDA and Data Preprocessing on Human Data

In [17]:
economic_data.count()

GDP per capita                        247
PPP conversion factor                 196
Inflation                             221
Unemployment (%)                      233
Rural Population (%)                  260
Labor force participation rate (%)    130
dtype: int64

In [18]:
# Total Missing data per column
economic_data.isnull().sum()

GDP per capita                         17
PPP conversion factor                  68
Inflation                              43
Unemployment (%)                       31
Rural Population (%)                    4
Labor force participation rate (%)    134
dtype: int64

Most of the columns have low amount of NaNs, only exception being _Labor force participation rate_. It seems like the same KNN Imputing can be applied here, since we expect countries to behave similar economic wise, if all the other indicators are similar.

Similary before KNN imputing, the rows with more than 2 NaN values are omitted.

In [19]:
# Omitting rows with more than 2 NaN values
economic_data_omitted = economic_data.dropna(thresh=3)
# Impute NaN with KNNImputer
economic_data_filled_array = imputer.fit_transform(economic_data_omitted)
economic_data_filled = pd.DataFrame(economic_data_filled_array)
economic_data_filled.columns = ['GDP per capita', 'PPP conversion factor', 'Inflation', 'Unemployment (%)', 'Rural Population (%)', 'Labor force participation rate (%)']
economic_data_filled.index = economic_data_omitted.index
economic_data_filled.isnull().sum()

GDP per capita                        0
PPP conversion factor                 0
Inflation                             0
Unemployment (%)                      0
Rural Population (%)                  0
Labor force participation rate (%)    0
dtype: int64

In [20]:
economic_data_filled.count()

GDP per capita                        244
PPP conversion factor                 244
Inflation                             244
Unemployment (%)                      244
Rural Population (%)                  244
Labor force participation rate (%)    244
dtype: int64

In [21]:
economic_data_filled.corr()

Unnamed: 0,GDP per capita,PPP conversion factor,Inflation,Unemployment (%),Rural Population (%),Labor force participation rate (%)
GDP per capita,1.0,-0.136329,-0.062622,-0.118976,-0.623888,0.347456
PPP conversion factor,-0.136329,1.0,0.0891,-0.022497,0.083802,-0.041076
Inflation,-0.062622,0.0891,1.0,0.034752,0.061299,-0.01371
Unemployment (%),-0.118976,-0.022497,0.034752,1.0,-0.076626,-0.178043
Rural Population (%),-0.623888,0.083802,0.061299,-0.076626,1.0,-0.309196
Labor force participation rate (%),0.347456,-0.041076,-0.01371,-0.178043,-0.309196,1.0


In [22]:
economic_data_omitted.corr()

Unnamed: 0,GDP per capita,PPP conversion factor,Inflation,Unemployment (%),Rural Population (%),Labor force participation rate (%)
GDP per capita,1.0,-0.136904,-0.296855,-0.124348,-0.619396,0.364914
PPP conversion factor,-0.136904,1.0,0.082275,-0.033872,0.08575,-0.02913
Inflation,-0.296855,0.082275,1.0,0.07881,0.201101,-0.082152
Unemployment (%),-0.124348,-0.033872,0.07881,1.0,-0.092445,-0.344268
Rural Population (%),-0.619396,0.08575,0.201101,-0.092445,1.0,-0.410876
Labor force participation rate (%),0.364914,-0.02913,-0.082152,-0.344268,-0.410876,1.0


Similary, we see correlations did not skewed too much, so again for this research we can safely ignore the unwanted effects of imputing NaNs.

## Combined table of Human Data and Economic Data
A final dataframe is created to be used on the models, with combining imputed human data and the imputed economic data.

In [23]:
combined_data = pd.concat([human_data_filled, economic_data_filled], axis=1, sort=True)
combined_data.head()

Unnamed: 0,Overall Rule of Law Index,Press Freedom Score,Human Rights Protection Score,Human Rights Violations,Democracy Index,GDP per capita,PPP conversion factor,Inflation,Unemployment (%),Rural Population (%),Labor force participation rate (%)
Afghanistan,0.34764,36.55,-1.2816,8.3,2.97,556.302139,17.205558,4.975952,11.184,74.75,47.305
Albania,0.506076,29.84,0.365031,5.5,5.98,4532.890162,41.231113,1.986661,13.75,40.617,58.299999
Algeria,0.505867,45.75,0.01559,7.4,3.5,4044.298372,38.855751,5.591116,11.996,27.948,36.91
Andorra,0.658829,24.63,3.26585,4.9,6.74,,,,,,
Angola,0.41263,34.96,-0.603419,7.0,3.62,4095.812942,92.951721,31.691686,7.119,35.161,45.450701


In [24]:
combined_data.count()

Overall Rule of Law Index             186
Press Freedom Score                   186
Human Rights Protection Score         186
Human Rights Violations               186
Democracy Index                       186
GDP per capita                        244
PPP conversion factor                 244
Inflation                             244
Unemployment (%)                      244
Rural Population (%)                  244
Labor force participation rate (%)    244
dtype: int64

At this point, it harms more than it gives benefits to use imputing on the missin variables, since we have a pattern of NaNs in some rows. The model would do well with 186 country data, so it is decided do emit NaNs here.

In [25]:
combined_data_omitted = combined_data.dropna()

In [26]:
combined_data_omitted.count()

Overall Rule of Law Index             159
Press Freedom Score                   159
Human Rights Protection Score         159
Human Rights Violations               159
Democracy Index                       159
GDP per capita                        159
PPP conversion factor                 159
Inflation                             159
Unemployment (%)                      159
Rural Population (%)                  159
Labor force participation rate (%)    159
dtype: int64

### Create a Composite Economic Index
Right now, the outcome variable of human data consists of 5 variables. To be able to use our models, we need only one outcome variable with several dependent variables.

To make it happen, it is tried to create a socalled __composite index__ combining these 5 human data variables into one final outcome variable.

In order to prevent unbalance caused by high numbers, all relevant variables are standardized using MinMaxScaler beforehand and the sign and interpretation of the specific indices are taken into account.

After these manipulations, the _composite index_ is calculated by basic mean of these 5 human data variables. Some other methods with different weights can be tried out with relevant backing information, but in this research we go with the basic mean.

In [29]:
import warnings
warnings.filterwarnings('ignore')
# Scale human data indexes to prevent skewing mean from high values
## pip install mlxtend
from mlxtend.preprocessing import minmax_scaling
human_data_scaled = minmax_scaling(combined_data_omitted, columns=[
    'Overall Rule of Law Index',
    'Press Freedom Score',
    'Human Rights Protection Score',
    'Human Rights Violations',
    'Democracy Index'
])
human_data_scaled['Press Freedom Score'] = -human_data_scaled['Press Freedom Score']
human_data_scaled['Composite Index'] = human_data_scaled.mean(axis=1)
combined_data_omitted['Composite Index'] = human_data_scaled['Composite Index']
combined_data_omitted.head()

Unnamed: 0,Overall Rule of Law Index,Press Freedom Score,Human Rights Protection Score,Human Rights Violations,Democracy Index,GDP per capita,PPP conversion factor,Inflation,Unemployment (%),Rural Population (%),Labor force participation rate (%),Composite Index
Afghanistan,0.34764,36.55,-1.2816,8.3,2.97,556.302139,17.205558,4.975952,11.184,74.75,47.305,0.137464
Albania,0.506076,29.84,0.365031,5.5,5.98,4532.890162,41.231113,1.986661,13.75,40.617,58.299999,0.314657
Algeria,0.505867,45.75,0.01559,7.4,3.5,4044.298372,38.855751,5.591116,11.996,27.948,36.91,0.210869
Angola,0.41263,34.96,-0.603419,7.0,3.62,4095.812942,92.951721,31.691686,7.119,35.161,45.450701,0.181526
Antigua and Barbuda,0.627957,27.385,1.12972,4.7,6.06,15383.415188,2.093501,2.432488,6.3075,75.287,61.66415,0.373186


In [42]:
# Subsetting the dataframe with relevant features for models.
combined_data_subset = combined_data_omitted[['GDP per capita', 'PPP conversion factor', 'Inflation', 'Unemployment (%)', 'Rural Population (%)', 'Labor force participation rate (%)', 'Composite Index']]
combined_data_subset.columns = ['GDP_per_capita', 'PPP_conversion_factor', 'Inflation', 'Unemployment', 'Rural_population', 'Labor_force_participation_rate', 'Composite_index']
combined_data_subset.head(5)

Unnamed: 0,GDP_per_capita,PPP_conversion_factor,Inflation,Unemployment,Rural_population,Labor_force_participation_rate,Composite_index
Afghanistan,556.302139,17.205558,4.975952,11.184,74.75,47.305,0.137464
Albania,4532.890162,41.231113,1.986661,13.75,40.617,58.299999,0.314657
Algeria,4044.298372,38.855751,5.591116,11.996,27.948,36.91,0.210869
Angola,4095.812942,92.951721,31.691686,7.119,35.161,45.450701,0.181526
Antigua and Barbuda,15383.415188,2.093501,2.432488,6.3075,75.287,61.66415,0.373186


Now the correlations and multicollinearity as well as VIF can be checked, using the full combined data.

In [43]:
combined_data_subset.corr()

Unnamed: 0,GDP_per_capita,PPP_conversion_factor,Inflation,Unemployment,Rural_population,Labor_force_participation_rate,Composite_index
GDP_per_capita,1.0,-0.161471,-0.196181,-0.116402,-0.586754,0.346534,0.576105
PPP_conversion_factor,-0.161471,1.0,0.176185,-0.082598,0.159567,0.11129,-0.236385
Inflation,-0.196181,0.176185,1.0,0.030421,0.228833,0.043785,-0.256201
Unemployment,-0.116402,-0.082598,0.030421,1.0,-0.129352,-0.155563,0.105061
Rural_population,-0.586754,0.159567,0.228833,-0.129352,1.0,-0.354672,-0.318649
Labor_force_participation_rate,0.346534,0.11129,0.043785,-0.155563,-0.354672,1.0,0.166852
Composite_index,0.576105,-0.236385,-0.256201,0.105061,-0.318649,0.166852,1.0


At first glance, it seems like there is no big problem. Let us also check the VIF:

In [50]:
%%capture
#gather features
features = "+".join(combined_data_subset.columns)

# get y and X dataframes based on this regression:
y, X = dmatrices('Composite_index ~' + features, combined_data_subset, return_type='dataframe')

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

In [52]:
vif.round(1)

Unnamed: 0,VIF Factor,features
0,78.6,Intercept
1,2.3,GDP_per_capita
2,1.1,PPP_conversion_factor
3,1.1,Inflation
4,1.2,Unemployment
5,1.8,Rural_population
6,1.3,Labor_force_participation_rate
7,1.7,Composite_index


As it can be seen from VIF table, all of the factors are small enough, which means there is no sign of strong collinearity to affect our model performance.

There is no need to remove any feature because of collinearity in this case. It is finalized that this subset will be used in the models.

## Exploratory Data Analysis (EDA) on the Combined Data

***
****
***

In [None]:
combined_data.corr()

In [None]:
X = combined_data[[
    'GDP per capita',
    'PPP conversion factor',
    'Inflation',
    'Unemployment (%)',
    'Rural Population (%)',
    'Labor force participation rate (%)'
]]
Y = combined_data['Composite Index']

In [None]:
regr = linear_model.LinearRegression()
regr.fit(X, Y)

In [None]:
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

Rural Population has very low 't' value, drop it and run OLS again. (Feature selection, elimination)

In [None]:
X = combined_data[[
    'GDP per capita',
    'PPP conversion factor',
    'Inflation',
    'Unemployment (%)',
    'Labor force participation rate (%)'
]]
Y = combined_data['Composite Index']
regr = linear_model.LinearRegression()
regr.fit(X, Y)
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

Similarly, dropping lowest 't' value feature: Labor Force Participation Rate

In [None]:
X = combined_data[[
    'GDP per capita',
    'PPP conversion factor',
    'Inflation',
    'Unemployment (%)',
]]
Y = combined_data['Composite Index']
regr = linear_model.LinearRegression()
regr.fit(X, Y)
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

Increase in adjusted R square is seen. One more iteration with dropping lowest 't' value feature.

In [None]:
X = combined_data[[
    'GDP per capita',
    'Inflation',
    'Unemployment (%)',
]]
Y = combined_data['Composite Index']
regr = linear_model.LinearRegression()
regr.fit(X, Y)
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

Adjusted R square decreases again. So the 3rd model will be used. It is applied again below:

In [None]:
X = combined_data[[
    'GDP per capita',
    'PPP conversion factor',
    'Inflation',
    'Unemployment (%)',
]]
Y = combined_data['Composite Index']
regr = linear_model.LinearRegression()
regr.fit(X, Y)
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

In [None]:
combined_data.head()

To see further optimization (There is a high orders of magnitudes difference in coefficients in GDP and PPP conversion) GDP and PPP features will be scaled down.

In [None]:
combined_data['GDP per capita'].describe()

In [None]:
combined_data['PPP conversion factor'].describe()

In [None]:
combined_data['Inflation'].describe()

In [None]:
combined_data['Unemployment (%)'].describe()

The easiest way will be using MinMaxScaling to scale down the features to same level. Though this will cause the data to be more meaningless in raw numbers, it will cause coefficients to be similar.

In [None]:
combined_data_scaled = minmax_scaling(combined_data, columns=[
    'GDP per capita',
    'PPP conversion factor',
    'Inflation',
    'Unemployment (%)',
])
combined_data_scaled['Composite Index'] = combined_data['Composite Index']
combined_data_scaled.head(10)

In [None]:
X = combined_data_scaled[[
    'GDP per capita',
    'PPP conversion factor',
    'Inflation',
    'Unemployment (%)',
]]
Y = combined_data_scaled['Composite Index']
regr = linear_model.LinearRegression()
regr.fit(X, Y)
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

It is seen that, statistical values such as 't' and 'F' did not change as expected. Only coefficients are scaled down to the same level.

The most prominent feature is 'GDP per Capita'. Let's analyse its isolated relation with composite index

In [None]:
sns.regplot('GDP per capita', 'Composite Index', combined_data_scaled)

In [None]:
sns.regplot('PPP conversion factor', 'Composite Index', combined_data_scaled)

As seen in the graph, 'PPP conversion factor' is significantly skewed thus using it in a regression yields unsatisfactory results. It is eliminated.

In [None]:
X = combined_data_scaled[[
    'GDP per capita',
    'Inflation',
    'Unemployment (%)',
]]
Y = combined_data_scaled['Composite Index']
regr = linear_model.LinearRegression()
regr.fit(X, Y)
X = sm.add_constant(X)
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

In [None]:
sns.regplot('Unemployment (%)', 'Composite Index', combined_data_scaled)