# World Happiness Analysis

## Context & Content
The happiness scores and rankings are based on answers to the main life evaluation question asked in the poll. This question asks respondents to think of a ladder with the *best possible life for them being a 10* and the *worst possible life being a 0* and to rate their own current lives on that scale.

The columns following the happiness score estimate the extent to which each of **six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity** – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

## Problem Statement
Knowing from the sample dataset (2016) listing the happiness scores of representatives in each country and the factors that contribute to these scores, how accurate can we predict the happiness scores of each country in 2017?

In [1]:
# Importing libraries for data analysis
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from math import sqrt

# Importing libraries for visualization
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Importing the dataset
# 2016
dataset2016 = pd.read_csv('dataset/2016.csv')
X_2016 = dataset2016.iloc[:, :-1].values
y_2016 = dataset2016.iloc[:, 8].values

In [9]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X_2016[:, 0] = labelencoder.fit_transform(X_2016[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X_2016 = onehotencoder.fit_transform(X_2016).toarray()

# Avoiding the Dummy Variable Trap
X_2016 = X_2016[:, 1:]

In [4]:
# Splitting the dataset into the Training set and Test set
X_2016_train, X_2016_test, y_2016_train, y_2016_test = train_test_split(X_2016, y_2016, test_size = 0.2, random_state = 0)

We will make use of 3 different methods to create regression models, to find out the most accurate model to act as the final model for the prediction of happiness score for the year 2017.

#### Regression Models
- Multiple Linear Regression
- Decision Tree
- Artificial Neural Network

### Multiple Linear Regression

In [31]:
# MODEL 1: Making use of Multiple Linear Regression
#Fitting Multiple Linear Regression to the Training set 
from sklearn.linear_model import LinearRegression
ml_regressor = LinearRegression()
ml_regressor.fit(X_2016_train, y_2016_train)

# Predicting the Test set results
y_2016_pred_mlr = ml_regressor.predict(X_2016_test)

# Building the optimal model using Backward Elimination
import statsmodels.formula.api as sm
X_2016 = np.append(arr = np.ones((155, 1)).astype(int), values = X_2016, axis = 1)
print(X_2016)
X_opt = X_2016[:,:]

#Step 2
ml_regressor_OLS = sm.OLS(endog = y_2016, exog = X_opt).fit()

#Step 3 
X_opt = X_2016[:,:] #removed state
ml_regressor_OLS = sm.OLS(endog = y_2016, exog = X_opt).fit()
ml_regressor_OLS.summary()





[[1.      1.      1.      ... 0.07112 0.31268 2.14558]
 [1.      1.      1.      ... 0.05301 0.1684  1.92816]
 [1.      1.      1.      ... 0.16157 0.07044 3.40904]
 ...
 [1.      1.      1.      ... 0.05892 0.09821 1.97295]
 [1.      1.      1.      ... 0.11479 0.17866 2.58991]
 [1.      1.      1.      ... 0.08582 0.18503 2.4427 ]]


  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
  * (1 - self.rsquared))
  return self.ssr/self.df_resid
  return np.dot(wresid, wresid) / self.df_resid


0,1,2,3
Dep. Variable:,y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,
Method:,Least Squares,F-statistic:,0.0
Date:,"Sat, 23 Feb 2019",Prob (F-statistic):,
Time:,03:41:45,Log-Likelihood:,4550.6
No. Observations:,155,AIC:,-8791.0
Df Residuals:,0,BIC:,-8319.0
Df Model:,154,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0132,inf,0,,,
x1,0.0132,inf,0,,,
x2,0.0132,inf,0,,,
x3,0.0132,inf,0,,,
x4,0.0132,inf,0,,,
x5,0.0132,inf,0,,,
x6,0.0132,inf,0,,,
x7,0.0132,inf,0,,,
x8,0.0132,inf,0,,,

0,1,2,3
Omnibus:,20.791,Durbin-Watson:,0.031
Prob(Omnibus):,0.0,Jarque-Bera (JB):,32.592
Skew:,-0.7,Prob(JB):,8.37e-08
Kurtosis:,4.756,Cond. No.,231.0


**Multiple Linear Regression**: Compute the root mean squared error (% accuracy) between the predicted value and the actual value.

In [42]:
# MODEL 1: Making use of Multiple Linear Regression (RMSE)
# rms = sqrt(mean_squared_error(y_actual, y_pred))

### Decision Tree

In [37]:
# MODEL 2: Making use of Decision Trees


**Decision Tree**: Compute the root mean squared error (% accuracy) between the predicted value and the actual value.

In [None]:
# MODEL 2: Making use of Decision Trees (RMSE)
# rms = sqrt(mean_squared_error(y_actual, y_predicted))

### Artificial Neural Networks

In [None]:
# MODEL 3: Making use of Artificial Neural Networks


**Artificial Neural Networks**: Compute the root mean squared error (% accuracy) between the predicted value and the actual value.

In [None]:
# MODEL 3: Making use of Artificial Neural Networks (RMSE)
# rms = sqrt(mean_squared_error(y_actual, y_predicted))

## Conclusion
The most accurate model that can predict the happiness scores is the (____) model.

The final model will thus be this model, and will be used to predict the happiness scores for the 2017 dataset.

## Predicting 2017's Happiness Scores

In [51]:
# 2017
dataset2017 = pd.read_csv('dataset/2017.csv')
X_2017 = dataset2017.iloc[:, :-1].values
y_2017 = dataset2017.iloc[:, 8].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X_2017[:, 0] = labelencoder.fit_transform(X_2017[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X_2017 = onehotencoder.fit_transform(X_2017).toarray()

# Avoiding the Dummy Variable Trap
X_2017 = X_2017[:, 1:]

In [53]:
# Predicting the 2017 dataset results using the model
# y_pred = regressor.predict(X_2016_test)