# Towards A Happier Future 
### (and how Machine Learning can help)
-----------------------------------------------------------------------------------------------------------------------------
This train is reaching it's destination. In the previous parts it became obvious that using Machine Learning algorithms and specifically ensemble methods can offer highly accurate predictive models. It is commonly accepted that even though ML offers greater predictive capacity than classical techniques in statistics in many situations, it is inferior when it comes to interpretation and contribution to scientific theory. It is also a fact that even though ML is being widely used in many scientific fields, Economists are skeptical in adding ML to their toolbox. In his [paper](http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf), Hal Varian argues that ML can provide useful insights for Economists, that linear models are uncapable of. In this part, a decision tree will be used to provide answers that will lead to better public policy and a comparison will be made with a linear model. In particular, the following questions will be answered:

* How happiness is affected by income?
* How can we classify countries with respect to their happiness levels?

The methodology that will be used can be broken in steps:
* Filter 1: Feature selection using Randomised Lasso (stability selection)
* Filter 2: Recursive feature elimination with cross validation
* Filter 3: Feature selection by manually removing features that can be used as instruments of the same variable or are not considered general enough for the task at hand
* Modeling 1: Using selected features to train a linear regression model 
* Modeling 2: Using selected features to train a decision tree regression model

### Filter 1: Randomized Lasso

In [41]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RandomizedLasso
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

Importing data...

In [12]:
dfremake = pd.read_csv('C:\\Users\\nikos\\Desktop\\thesisdata\\2005to2014remake.csv',index_col=0)

In [13]:
dfremake.head()

Unnamed: 0_level_0,Life Ladder,Population ages 65 and above (% of total),Private credit bureau coverage (% of adults),"Improved water source, rural (% of rural population with access)",Time required to register property (days),"Mortality rate, infant (per 1,000 live births)",Documents to import (number),"Unemployment, total (% of total labor force)","Population, ages 0-14 (% of total)","Labor force participation rate, total (% of total population ages 15-64) (modeled ILO estimate)",...,Merchandise exports to high-income economies (% of total merchandise exports),"Merchandise exports by the reporting economy, residual (% of total merchandise exports)",Proportion of seats held by women in national parliaments (%),"GDP per capita, PPP (current international $)",Tax payments (number),Lifetime risk of maternal death (1 in: rate varies by country),Improved sanitation facilities (% of population with access),Procedures to build a warehouse (number),Merchandise exports (current US$),Year
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Australia,7.340688,12.913187,100.0,100.0,4.5,4.8,7.0,5.0,19.790417,75.400002,...,66.267907,1.561053,24.7,32559.459287,13.0,7700.0,100.0,10.0,106097200000.0,2005
Belgium,7.26229,17.174057,0.0,100.0,132.0,4.1,4.0,8.4,17.023689,66.699997,...,91.054637,0.607397,34.7,33042.899284,11.0,7800.0,99.5,10.0,334400100000.0,2005
Canada,7.418048,13.107962,100.0,99.0,16.5,5.3,3.0,6.7,17.673543,77.699997,...,95.051256,0.032781,21.1,35973.491511,9.0,7400.0,99.8,11.0,360475200000.0,2005
Czech Republic,6.439257,13.986953,37.9,99.7,124.0,4.4,6.0,7.9,14.743135,70.400002,...,93.50639,0.172625,17.0,22286.45719,27.0,11600.0,99.1,21.0,78110300000.0,2005
Denmark,8.018934,15.14824,7.7,100.0,42.0,4.1,3.0,4.8,18.733752,79.699997,...,93.062462,0.830054,36.9,34079.959762,10.0,6700.0,99.6,7.0,85120850000.0,2005


In [10]:
#A helper method for pretty-printing linear models
def pretty_print_linear(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

Performing stability selection with Randomized Lasso (information can be found [here](https://stat.ethz.ch/~nicolai/stability.pdf))

In [23]:
X=dfremake.drop("Life Ladder",axis=1)
X = StandardScaler().fit_transform(X)
y= dfremake["Life Ladder"]
model = RandomizedLasso(alpha=0.0008,random_state=2,sample_fraction=.5)
model.fit(X,y)

RandomizedLasso(alpha=0.0008, eps=2.2204460492503131e-16, fit_intercept=True,
        max_iter=500, memory=None, n_jobs=1, n_resampling=200,
        normalize=True, pre_dispatch='3*n_jobs', precompute='auto',
        random_state=2, sample_fraction=0.5, scaling=0.5,
        selection_threshold=0.25, verbose=False)

The following features were chosen

In [24]:
filterdcols = model.get_support()
filterdcols = pd.Series(True).append(pd.Series(filterdcols),ignore_index=True)
frame = dfremake[dfremake.columns[filterdcols]]
list(frame.columns)[1:]

['Population ages 65 and above (% of total)',
 'Private credit bureau coverage (% of adults)',
 'Improved water source, rural (% of rural population with access)',
 'Time required to register property (days)',
 'Mortality rate, infant (per 1,000 live births)',
 'Documents to import (number)',
 'Unemployment, total (% of total labor force)',
 'Foreign direct investment, net inflows (BoP, current US$)',
 'GDP growth (annual %)',
 'Time required to start a business (days)',
 'Rural population (% of total population)',
 'Merchandise exports to high-income economies (% of total merchandise exports)',
 'Proportion of seats held by women in national parliaments (%)',
 'GDP per capita, PPP (current international $)',
 'Lifetime risk of maternal death (1 in: rate varies by country)',
 'Improved sanitation facilities (% of population with access)',
 'Procedures to build a warehouse (number)',
 'Merchandise exports (current US$)',
 'Year']

### Filter 2: Recursive Feature Elimination with Cross Validation

In [25]:
y = frame.iloc[:,0]
X = frame.drop(["Life Ladder"],axis=1)
reg = LinearRegression()
rfecv = RFECV(estimator=reg)
rfecv.fit(X, y)
newdf = frame.drop(X.columns[rfecv.ranking_>1],axis=1)

In [27]:
#All features were ranked as 1 except Merchandise exports 
rfecv.ranking_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1])

### Filter 3: Manual Selection

In [29]:
dfremake = pd.read_csv('C:\\Users\\nikos\\Desktop\\thesisdata\\2005to2014remake2.csv',index_col=0)

In [31]:
list(dfremake.columns)

['Life Ladder',
 'Merchandise exports to high-income economies (% of total merchandise exports)',
 'Population ages 65 and above (% of total)',
 'Unemployment, total (% of total labor force)',
 'Proportion of seats held by women in national parliaments (%)',
 'GDP per capita, PPP (current international $)',
 'GDP growth (annual %)',
 'Time required to register property (days)',
 'Improved sanitation facilities (% of population with access)',
 'Rural population (% of total population)',
 'Year',
 'Documents to import (number)']

### Modeling 1: Linear regression

In [39]:
y = dfremake.iloc[:,0]
X = dfremake.drop(["Life Ladder"],axis=1)
X = StandardScaler().fit_transform(X)
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,random_state=1)
model = LinearRegression()
print("R^2 = "+str(r2_score(ytest,model.fit(Xtrain,ytrain).predict(Xtest))))

R^2 = 0.683004593558


The model:

In [36]:
print("Linear model:", pretty_print_linear(model.coef_,names =dfremake.drop(["Life Ladder"],axis=1).columns.values,sort=True))

Linear model: 0.39 * Improved sanitation facilities (% of population with access) + -0.266 * Rural population (% of total population) + 0.258 * GDP per capita, PPP (current international $) + 0.224 * Proportion of seats held by women in national parliaments (%) + -0.214 * Unemployment, total (% of total labor force) + -0.174 * Documents to import (number) + 0.146 * Merchandise exports to high-income economies (% of total merchandise exports) + -0.118 * Population ages 65 and above (% of total) + -0.047 * Year + -0.047 * Time required to register property (days) + 0.012 * GDP growth (annual %)


In [37]:
#the intercept of the model
model.intercept_

5.4642888046032718

The regressors of the model are presented with respect to their importance. The intuition behind the sorting is that if we standardize all features before training the model, then those with larger partial coefficients will be more linearly related than those with smaller coefficients. 

### Modeling 2: Decision Tree

In [48]:
X=dfremake.drop("Life Ladder",axis=1)
y=dfremake["Life Ladder"]
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,random_state=0)
model = DecisionTreeRegressor(max_depth=3,min_samples_split=50,min_samples_leaf=10)
print("R^2 = "+str(r2_score(ytest,model.fit(Xtrain,ytrain).predict(Xtest))))

R^2 = 0.745608164115


We have a significant increase in predictive power but what is more important is the insights we gain from it's visualization.

![](https://raw.githubusercontent.com/nikosga/Thesis_Project/master/pics/treehappy.png)

### Discussion / Results

* Top-Down interpretation: The features that are used as criteria in the top of the tree are considered by the model as more crucial and general, compared to those in the lower levels of the tree. 
* Color-tone interpretation: The brighter coloured nodes represent higher levels of the dependent variable (happiness)
* From previous chapters it was made clear that GDP per capita plays a major role in happiness. The linear model fails to pinpoint the exact impact it has on happiness since the effect of DGP per capita is evenly spread in all happiness levels. By interpreting it's partial coefficient we can state that, keeping other things constant, for a unit increase in GDP per capita we have a 0.26 of a unit increase in a country's happiness on average. This is quite misleading and its impact was strong since GDP was used as the main indicator of well-being. This is not the case in our decision tree since it's made clear that GDP per capita should be used as a criterion for values under 30K $. The first branch seperates countries in 2 categories wrt to their GDP per capita status and then it's used again in the next branch. This shows the relevance of GDP per capita in countries with values under 30K. On the contrary, countries with values over this threshold should focus their policy on other measures of development. 
* Starting from left to right we can classify the countries in the following categories with respect to the criteria they meet. The first category are the countries that struggle with vitality issues. Countries that people don't reach ages over 65 due to health, hunger or war. They don't have access to clean water or food. They belong to the lowest happiness level. In the next level we have countries that face different types of problems. They have high levels of unemployment and even if they don't face the same problems with the previous category they are still considered unhappy. Next, we find countries that can be separated with respect to their exports and  especially to high income economies. Searching more on these countries we find that the ones with high levels of exports in high income economies produce mainly machinery equipment and chemicals while those with lower levels of that variable export coal, petroleum and oil. So this variable can be considered an instrument of higher education in the workforce.  The last category contains countries that have fulfilled the previous targets and focus to higher level values such as respect, equality and justice to name but a few. Additional income isn't likely to increase their hapiness and so their policy should have different goals. 
* While the first categories need to pay attention in increasing their GDP, this is not the case for the countries belonging in the higher happiness level categories. Seats held by women in national parliaments is used as an instrument to measures the difference that countries have in the organization of their society and their quality of values. Specifically, we can see that the increase in happiness by increasing the life expectancy in the lower levels is more or less the same with the increase in happiness by paying attention to the values of the societies in higher levels of happiness.

The following results are extemely useful. By classifying countries in different happiness categories we can implement different policy measures to each one. Health and income are crucial in the lower levels, while education, values, equality and justice in higher levels.

![](https://raw.githubusercontent.com/nikosga/Thesis_Project/master/pics/PYRAMIDHAPPY.png)

## The End
-----------------------------------------------------------------------------------------------------------------------------