This tutorial is inspired by Kaggle user **sabihaif** whom performed the same [analysis](https://www.kaggle.com/sabihaif/world-happiness-report-analysis) to the world happiness 2015 dataset. This analysis is applied to the world happiness dataset for 2019, which can be downladed from here [here](https://www.kaggle.com/PromptCloudHQ/world-happiness-report-2019/downloads/world-happiness-report-2019.zip/1)

In [1]:
import pandas as pd
from matplotlib import pyplot
import seaborn as sns
from plotly.offline import init_notebook_mode, iplot # plotly offline mode
init_notebook_mode(connected=True) 
import plotly.graph_objs as go # plotly graphical object
%matplotlib notebook

  return f(*args, **kwds)


In [2]:
data = pd.read_csv("world-happiness-report-2019.csv")
data = data.dropna()
data.head()

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP per capita,Healthy life expectancy
0,Finland,1,4,41.0,10.0,2.0,5.0,4.0,47.0,22.0,27.0
1,Denmark,2,13,24.0,26.0,4.0,6.0,3.0,22.0,14.0,23.0
2,Norway,3,8,16.0,29.0,3.0,3.0,8.0,11.0,7.0,12.0
3,Iceland,4,9,3.0,3.0,1.0,7.0,45.0,3.0,15.0,13.0
4,Netherlands,5,1,12.0,25.0,15.0,19.0,12.0,7.0,12.0,18.0


Country (region) Name of the country.

LadderCantril Ladder is a measure of life satisfaction.

SD of Ladder Standard deviation of the ladder.

Positive affect Measure of positive emotion.

Negative affect Measure of negative emotion.

Social support The extent to which Social support contributed to the calculation of the Happiness Score.

Freedom The extent to which Freedom contributed to the calculation of the Happiness Score.

Corruption The extent to which Perception of Corruption contributes to Happiness Score.

Generosity The extent to which Generosity contributed to the calculation of the Happiness Score.

Log of GDP per capita The extent to which GDP contributes to the calculation of the Happiness Score.

Healthy life expectancy The extent to which Life expectancy contributed to the calculation of the Happiness Score.


In [3]:
data.sample(5)

Unnamed: 0,Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP per capita,Healthy life expectancy
123,Tunisia,124,79,147.0,132.0,121.0,143.0,101.0,144.0,84.0,67.0
19,Czech Republic,20,20,74.0,22.0,24.0,58.0,121.0,117.0,32.0,31.0
128,Sierra Leone,129,153,139.0,149.0,135.0,116.0,112.0,79.0,145.0,146.0
99,Nepal,100,128,137.0,134.0,87.0,67.0,65.0,46.0,127.0,95.0
30,Panama,31,121,7.0,48.0,41.0,32.0,104.0,88.0,51.0,33.0


In [4]:
data.describe()

Unnamed: 0,Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP per capita,Healthy life expectancy
count,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0
mean,79.157143,78.45,78.242857,79.157143,77.5,78.828571,75.7,78.85,79.014286,75.478571
std,45.700664,46.121255,44.331627,44.506126,45.815787,45.108972,42.656011,44.727782,43.35631,43.979961
min,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0
25%,40.75,39.75,40.75,40.75,36.75,39.75,39.75,40.75,41.75,36.75
50%,79.5,77.5,78.5,78.5,77.5,79.5,76.5,79.5,78.5,77.5
75%,119.25,119.25,116.25,117.25,118.25,118.25,112.25,116.25,117.25,113.25
max,156.0,156.0,154.0,154.0,155.0,155.0,148.0,155.0,152.0,150.0


In [5]:
data.columns

Index(['Country (region)', 'Ladder', 'SD of Ladder', 'Positive affect',
       'Negative affect', 'Social support', 'Freedom', 'Corruption',
       'Generosity', 'Log of GDP\nper capita', 'Healthy life\nexpectancy'],
      dtype='object')

In [6]:
data.shape

(140, 11)

In [7]:
f, ax = pyplot.subplots(figsize = (10, 10))
sns.heatmap(data.corr(), annot = True, linewidth = 0.3, fmt = ".1f", ax = ax)
pyplot.show()

<IPython.core.display.Javascript object>

According to the correlation map, the strongest possitive correlations are seen to take place between Ladder (life satisfaction) and social support, GDP per capita, and Healthy life expentancy as well. 

In [8]:
data_plot = data.loc[:, ['Ladder', 'Social support', 'Log of GDP\nper capita', 'Healthy life\nexpectancy', 'Corruption']]
data_plot.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0xcafdf6dfd0>

In [9]:
data_plot.plot(subplots = True)

<IPython.core.display.Javascript object>

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000000CAFDFEDC88>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000000CAFE082278>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000000CAFE0CC588>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000000CAFE102CF8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000000CAFE68BE10>],
      dtype=object)

In [10]:
data_plot.plot(kind = "scatter", x ='Social support', y = 'Ladder')
data_plot.plot(kind = "scatter", x ='Log of GDP\nper capita', y = 'Ladder')
data_plot.plot(kind = "scatter", x ='Healthy life\nexpectancy', y = 'Ladder')
data_plot.plot(kind = "scatter", x ='Corruption', y = 'Ladder')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0xcafe87de10>

The big correlation between the variables we exposed previously can be clearly corroborated from these plots, where for the first three plots, it can be seen a strong linear correlation wheareas for the last plot, there is no apparent relation between the variables. Suggesting that there is not relation between life satisfaction and corruption, which seems to be kind of counterintiutive. 

In [11]:
data.rename(columns={'Country (region)':'Country'}, 
                 inplace=True)
 

In [12]:
data.columns

Index(['Country', 'Ladder', 'SD of Ladder', 'Positive affect',
       'Negative affect', 'Social support', 'Freedom', 'Corruption',
       'Generosity', 'Log of GDP\nper capita', 'Healthy life\nexpectancy'],
      dtype='object')

In [13]:
ladder_score = []
Ladders = list(data.Ladder)
for l in Ladders:
    ladder_score.append(l)

In [14]:
data_map = [dict(
        type='choropleth',
        colorscale = 'Rainbow',
        locationmode = 'country names',
        locations = data['Country'],
        z = ladder_score,
        text = data['Country'],
        colorbar = dict(
        title = 'Ladder Score', 
        titlefont=dict(size=25),
        tickfont=dict(size=18))
)]
layout = dict(
    title = 'Ladder Score',
    titlefont = dict(size=40),
    geo = dict(
        showframe = True,
        showcoastlines = True,
        projection = dict(type = 'equirectangular')
        )
)
choromap = go.Figure(data = data_map, layout = layout)
iplot(choromap, validate=False)

In this plot, the purpler the color, the happier the country. 

Now, we will apply linear regression

In [15]:
import sklearn

In [16]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

x = data['Healthy life\nexpectancy'].values.reshape(-1,1)
y = data['Ladder'].values.reshape(-1,1)

In [17]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state=0)

In [18]:
lin_reg.fit(x_train,y_train)
y_pred = lin_reg.predict(x_test)
pyplot.figure()
pyplot.scatter(x_test, y_test)
pyplot.plot(x_test,y_pred)
pyplot.show()

<IPython.core.display.Javascript object>

In [19]:
b0 = lin_reg.intercept_
b1 = lin_reg.coef_
print('equation of the line is: ',b1,'x +',b0)

equation of the line is:  [[0.84182275]] x + [17.44284709]


In [20]:
xtest = pd.DataFrame(x_test)
ypred = pd.DataFrame(y_pred)
prediction = pd.concat([xtest,ypred],axis=1)
prediction.columns = ['xtest','ypred']
prediction.sort_values(by='xtest', ascending=False, axis = 0, inplace = True)
prediction.head()

Unnamed: 0,xtest,ypred
7,139.0,134.45621
38,133.0,129.405273
14,132.0,128.563451
8,131.0,127.721628
6,125.0,122.670691


In [21]:
xtest = pd.DataFrame(x_test)
ytest = pd.DataFrame(y_test)
test = pd.concat([xtest,ytest],axis=1)
test.columns = ['xtest','ytest']
test.sort_values(by='xtest', ascending=False, axis = 0, inplace = True)
test.head()

Unnamed: 0,xtest,ytest
7,139.0,154
38,133.0,102
14,132.0,139
8,131.0,138
6,125.0,147


Multiple Linear Regression

In [22]:
x1 = data[['Social support','Log of GDP\nper capita','Healthy life\nexpectancy']].values
y1 = data['Ladder'].values.reshape(-1,1)
from sklearn.model_selection import train_test_split
x1_train, x1_test, y1_train, y1_test = train_test_split(x1,y1, test_size = 0.33, random_state=0)
mlp = LinearRegression()
mlp.fit(x1_train,y1_train)
y1_predict = pd.DataFrame(mlp.predict(x1_test))
y1_test = pd.DataFrame(y1_test)

In [23]:
comp  = pd.concat([y1_predict,y1_test],axis=1)
comp.columns = ['y1_predict','y1_test']
comp.sort_values(by='y1_test', ascending=False, axis = 0, inplace = True)
comp.head()

Unnamed: 0,y1_predict,y1_test
7,143.775522,154
6,138.644263,147
14,143.040856,139
8,121.831499,138
43,105.819999,137
