# Imports

In [17]:
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

In this notebook we want to build the simpliest vanila clustering model. Basically, we will not consider any metrics we have chosen before. Instead, we will use `hapiness_score` itself.

First, let us import the dataset we need:

In [3]:
world_metrics_subset_path = "https://raw.githubusercontent.com/mivelikikh/what_makes_us_happy/main/data/world_metrics_subset.csv"
world_metrics_subset = pd.read_csv(world_metrics_subset_path)

world_metrics_subset

Unnamed: 0,country,development_index,life_expect,life_exp60,basic_water,gdp_per_capita,eco_footprint,pf_rol,ef_legal,adult_mortality,infant_mort,age1-4mort,happiness_score
0,Angola,0.52,62.63262,17.34829,55.08428,4665.91,0.93,3.451814,2.963635,237.96940,0.057900,0.007520,3.866
1,Burundi,0.39,60.09811,16.59126,60.20415,276.69,0.80,2.961470,3.495487,290.18580,0.052420,0.006450,2.905
2,Benin,0.48,61.08568,17.20543,66.32024,746.83,1.41,4.129480,3.822761,242.37410,0.066690,0.009390,3.484
3,Burkina Faso,0.39,60.32101,15.48575,48.26772,671.07,1.21,4.860575,3.687657,254.60270,0.055795,0.008635,3.739
4,Botswana,0.69,66.05297,17.42258,89.40444,7743.50,3.83,5.641684,5.950516,249.24130,0.032560,0.002040,3.974
...,...,...,...,...,...,...,...,...,...,...,...,...,...
132,New Zealand,0.91,82.24739,25.29202,100.00000,37488.30,5.60,7.868546,8.715280,66.05728,0.003975,0.000235,7.334
133,Japan,0.89,84.16616,26.39402,98.97000,46201.60,5.02,7.643490,7.586987,50.82619,0.001980,0.000195,5.921
134,Cambodia,0.55,69.36723,17.36710,76.94537,877.64,1.21,2.566741,4.277907,170.49700,0.027600,0.001110,3.907
135,South Korea,0.89,82.66409,25.26966,99.67540,24155.80,5.69,7.438183,6.391154,60.81405,0.002955,0.000125,5.835


As we said, we do not really need all the metrics for this step. We just want to split the target feature into several clusters (quantiles). It would be our very first visual representation of happiness scores spacial distribution.

First, we need to drop all the columns except `country` and `happiness_score`:

In [4]:
vanila_subset = world_metrics_subset[['country', 'happiness_score']]
vanila_subset

Unnamed: 0,country,happiness_score
0,Angola,3.866
1,Burundi,2.905
2,Benin,3.484
3,Burkina Faso,3.739
4,Botswana,3.974
...,...,...
132,New Zealand,7.334
133,Japan,5.921
134,Cambodia,3.907
135,South Korea,5.835


Now we need to sort the observations in the descending order:

In [6]:
vanila_subset_sorted = vanila_subset.sort_values('happiness_score', ascending=False)
vanila_subset_sorted

Unnamed: 0,country,happiness_score
111,Denmark,7.526
115,Switzerland,7.509
84,Norway,7.498
105,Finland,7.413
54,Canada,7.404
...,...,...
29,Guinea,3.607
15,Rwanda,3.515
2,Benin,3.484
11,Togo,3.303


The main step is to split the set into 4 quantiles. For the consistency we will assume that countries in Q1 are the happiest states in the world, Q2 will represent countries whose citizens are satisfied with their lives, Q3 is for states whose citizens are NOT satisfied with their way of living, and Q4 is the set of unhappiest countries.

In [7]:
vanila_subset_sorted['quantile'] = pd.qcut(vanila_subset_sorted['happiness_score'], q=4, labels=False)
vanila_subset_sorted.head()

Unnamed: 0,country,happiness_score,quantile
111,Denmark,7.526,3
115,Switzerland,7.509,3
84,Norway,7.498,3
105,Finland,7.413,3
54,Canada,7.404,3


As we obtained quantiles for each country, we are able to plot clustering graph:

In [26]:
fig = px.scatter(vanila_subset_sorted, x='happiness_score', y='country', color='quantile')

fig.update_layout(
    title="Vanila Clusterisation of Countries' Happiness Scores",
    xaxis_title='Happiness Score',
    yaxis_title='Country',
    yaxis_tickfont={'size': 8},
    xaxis_tickfont={'size': 8},
    width=700,
    height=700
)
fig.show()