**Where is this data set from?**

- The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. This dataset is a subset of the larger report, which includes data from various sources such as the Gallup World Poll and other national surveys. The data was extracted from the World Happiness Report and made available for public use. However, the original data was collected by various researchers and organizations as part of their ongoing efforts to measure and understand happiness and well-being around the world.

    We use observed data on the six variables and estimates of their associations with life evaluations to explain the variation across countries. They include GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption. Our happiness rankings are not based on any index of these six factors – the scores are instead based on individuals’ own assessments of their lives, in particular, their answers to the single-item Cantril ladder life-evaluation question, much as epidemiologists estimate the extent to which life expectancy is affected by factors such as smoking, exercise, and diet

Detailed information about each of the Predictors:

1. **Log GDP per capita** is in terms of Purchasing Power Parity (PPP) adjusted to a constant 2017 international dollars, taken from the World Development Indicators (WDI) by the World Bank (version 17, metadata last updated on January 22, 2023). See Statistical Appendix 1 for more details. GDP data for 2022 are not yet available, so we extend the GDP time series from 2021 to 2022 using country-specific forecasts of real GDP growth from the OECD Economic Outlook No. 112 (November 2022) or, if missing, from the World Bank’s Global Economic Prospects (last updated: January 10, 2023), after adjustment for population growth. The equation uses the natural log of GDP per capita, as this form fits the data significantly better than GDP per capita.

2. The time series for **Healthy life expectancy at birth** is constructed based on data from the World Health Organization (WHO) Global Health Observatory data repository, with data available for 2005, 2010, 2015, 2016, and 2019. To match this report’s sample period (2005-2022), interpolation and extrapolation are used. See Statistical Appendix 1 for more details.

3. **Social support** (0-1) is the national average of the binary responses (0=no, 1=yes) to the Gallup World Poll (GWP) question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”

4. **Freedom to make life choices** (0-1) is the national average of binary responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”

5. **Generosity** is the residual of regressing the national average of GWP responses to the donation question “Have you donated money to a charity in the past month?” on log GDP per capita.

6. **Perceptions of corruption** (0-1) are the average of binary answers to two GWP questions: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” Where data for government corruption are missing, the perception of business corruption is used as the overall corruption perception measure.

7. **Positive affect** is defined as the average of previous-day effects measures for laughter, enjoyment, and interest. The inclusion of interest (first added for World Happiness Report 2022), gives us three components in each of positive and negative affect, and slightly improves the equation fit in column 4. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday?

8. **Negative affect** is defined as the average of previous-day effects measures for worry, sadness, and anger.

9. **Life ladder**: Life evaluations from the Gallup World Poll provide the basis for the annual happiness rankings. They are based on answers to the main life evaluation question. The Cantril ladder asks respondents to think of a ladder, with the **best possible life for them being a 10 and the worst possible life being a 0**. They are then asked to rate their own current lives **on a 0 to 10 scale**. The rankings are from nationally representative samples over three years.

10. **Confidence in National Government**: The "Confidence in National Government" variable in the World Happiness Report is calculated based on the following question asked in the Gallup World Poll:

    "Do you have confidence in the national government?"

    Respondents are given the following options to choose from:

    "Yes, always"
    
    "Yes, sometimes"
    
    "No, rarely"

    "No, never"

    "Don't know"

    **The variable is calculated as the percentage of respondents who answer "Yes, always" or "Yes, sometimes" to this question.**

    This variable is one of several social factors that are included in the calculation of the World Happiness Report's overall happiness score for each country. The report combines data on social factors such as income, social support, life expectancy, freedom to make life choices, generosity, and perceptions of corruption to arrive at a comprehensive measure of happiness.

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Visualización
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Evaluar linealidad de las relaciones entre las variables
# ------------------------------------------------------------------------------
from scipy.stats import shapiro, kstest

# Gestión de los warnings
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")

# Configuración
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames


In [2]:
from src.tolookandcompare import transform_headtail, transform_info

In [3]:
from src import soporte_eda as sp_eda

In [5]:
df = pd.read_csv ('files/World_Happiness_Report.csv')

df.head(2)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545


In [6]:
df.sample(10)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
1829,State of Palestine,,2007,4.151054,8.180532,0.711819,61.897499,0.365296,-0.079686,0.84418,0.515242,0.412328,
203,Bolivia,Latin America and Caribbean,2011,5.778874,8.813483,0.816783,61.900002,0.781674,-0.040564,0.824854,0.688716,0.361486,0.334732
464,Croatia,Central and Eastern Europe,2012,6.027635,10.092166,0.775818,67.540001,0.54191,-0.247485,0.92386,0.571653,0.271041,0.311119
290,Burundi,Sub-Saharan Africa,2011,3.705894,6.694147,0.42224,51.52,0.489863,-0.059415,0.677108,0.571715,0.190345,0.8512
719,Ghana,Sub-Saharan Africa,2020,5.319483,8.568557,0.642703,58.375,0.82372,0.198711,0.847025,0.674681,0.252728,0.618855
440,Congo (Kinshasa),,2016,4.521935,6.928858,0.864155,52.825001,0.637367,-0.023077,0.875,0.610231,0.222411,0.281825
114,Azerbaijan,Commonwealth of Independent States,2013,5.481178,9.592311,0.76969,62.540001,0.671957,-0.171054,0.69882,0.516089,0.242455,0.748924
747,Guatemala,Latin America and Caribbean,2015,6.464987,9.002782,0.822837,61.5,0.86864,0.048504,0.821655,0.826123,0.310554,0.272381
2081,United States,North America and ANZ,2016,6.8036,10.984834,0.896751,66.474998,0.757893,0.139648,0.73892,0.736574,0.264204,0.297206
655,Gabon,Sub-Saharan Africa,2011,4.255401,9.557162,0.652702,54.459999,0.771872,-0.211016,0.850831,0.564418,0.263955,0.534545


In [7]:
transform_headtail(df, 'Country Name')

Valores únicos: 165
Número de registros: 2199
Valores nulos: 0
Registros duplicados: 2034
dtype: object
---------------------------------
Country Name
Argentina     0.77
Costa Rica    0.77
Brazil        0.77
Bolivia       0.77
Bangladesh    0.77
Name: proportion, dtype: float64
Country Name
Cuba        0.05
Maldives    0.05
Guyana      0.05
Oman        0.05
Suriname    0.05
Name: proportion, dtype: float64
---------------------------------
Media: nan
Mediana: nan
Moda: nan


In [8]:
transform_headtail(df, 'Generosity')

Valores únicos: 2126
Número de registros: 2199
Valores nulos: 73
Registros duplicados: 73
dtype: float64
---------------------------------
Generosity
 NaN         3.32
-0.069513    0.05
 0.167652    0.05
 0.190809    0.05
 0.121316    0.05
Name: proportion, dtype: float64
Generosity
-0.144077    0.05
-0.134858    0.05
-0.132990    0.05
-0.129371    0.05
-0.018664    0.05
Name: proportion, dtype: float64
---------------------------------
Media: 0.00
Mediana: -0.02
Moda: -0.33752656


In [22]:
transform_headtail(df, 'Confidence In National Government')

Valores únicos: 1838
Número de registros: 2199
Valores nulos: 361
Registros duplicados: 361
dtype: float64
---------------------------------
Confidence In National Government
NaN         16.42
0.611545     0.05
0.299357     0.05
0.307386     0.05
0.435440     0.05
Name: proportion, dtype: float64
Confidence In National Government
0.387677    0.05
0.344929    0.05
0.263297    0.05
0.267581    0.05
0.612072    0.05
Name: proportion, dtype: float64
---------------------------------
Media: 0.48
Mediana: 0.47
Moda: 0.06876874


In [9]:
sp_eda.resumen_df(df)


Forma del DataFrame: (2199, 13)

Tipos de datos:
Country Name                         float64
Regional Indicator                    object
Year                                   int64
Life Ladder                          float64
Log GDP Per Capita                   float64
Social Support                       float64
Healthy Life Expectancy At Birth     float64
Freedom To Make Life Choices         float64
Generosity                           float64
Perceptions Of Corruption            float64
Positive Affect                      float64
Negative Affect                      float64
Confidence In National Government    float64
dtype: object

Valores nulos:
Country Name                         2199
Regional Indicator                    112
Year                                    0
Life Ladder                             0
Log GDP Per Capita                     20
Social Support                         13
Healthy Life Expectancy At Birth       54
Freedom To Make Life Choices           33

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Country Name,0.0,,,,,,,
Year,2199.0,2014.161437,4.718736,2005.0,2010.0,2014.0,2018.0,2022.0
Life Ladder,2199.0,5.479226,1.125529,1.281271,4.64675,5.432437,6.30946,8.018934
Log GDP Per Capita,2179.0,9.389766,1.153387,5.526723,8.499764,9.498955,10.373216,11.663788
Social Support,2186.0,0.810679,0.120952,0.228217,0.746609,0.835535,0.904792,0.987343
Healthy Life Expectancy At Birth,2145.0,63.294583,6.901104,6.72,59.119999,65.050003,68.5,74.474998
Freedom To Make Life Choices,2166.0,0.747858,0.14015,0.257534,0.656528,0.769821,0.859382,0.985178
Generosity,2126.0,9.6e-05,0.161083,-0.337527,-0.112116,-0.022671,0.09207,0.702708
Perceptions Of Corruption,2083.0,0.745195,0.185837,0.035198,0.688139,0.799654,0.868827,0.983276
Positive Affect,2175.0,0.652143,0.105922,0.178886,0.571684,0.663063,0.737936,0.883586



Resumen estadístico (categóricas):


Unnamed: 0,count,unique,top,freq
Regional Indicator,2087,10,Sub-Saharan Africa,443


In [10]:
df['Country Name'].dtype

dtype('float64')

In [11]:
df['Country Name']

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
2194   NaN
2195   NaN
2196   NaN
2197   NaN
2198   NaN
Name: Country Name, Length: 2199, dtype: float64

In [12]:
df["Country Name"] = df["Country Name"].astype(str)

In [13]:
df['Country Name']

0       nan
1       nan
2       nan
3       nan
4       nan
       ... 
2194    nan
2195    nan
2196    nan
2197    nan
2198    nan
Name: Country Name, Length: 2199, dtype: object

In [10]:
# df['Country Name'].astype('O')

In [15]:
# Hay que volver a cargar el df forzando la conversión del dato
df = pd.read_csv ('files/World_Happiness_Report.csv')

In [16]:
df = pd.read_csv('files/World_Happiness_Report.csv', dtype={'Country Name': 'object'})

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2199 entries, 0 to 2198
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Country Name                       2199 non-null   object 
 1   Regional Indicator                 2087 non-null   object 
 2   Year                               2199 non-null   int64  
 3   Life Ladder                        2199 non-null   float64
 4   Log GDP Per Capita                 2179 non-null   float64
 5   Social Support                     2186 non-null   float64
 6   Healthy Life Expectancy At Birth   2145 non-null   float64
 7   Freedom To Make Life Choices       2166 non-null   float64
 8   Generosity                         2126 non-null   float64
 9   Perceptions Of Corruption          2083 non-null   float64
 10  Positive Affect                    2175 non-null   float64
 11  Negative Affect                    2183 non-null   float

____

In [18]:
df['Country Name'].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: Country Name, dtype: object

In [19]:
df['Country Name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Estonia', 'Eswatini', 'Ethiopia', 'Finland',
       'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guyana', 'Haiti', 'Honduras',
       'Hong Kong S.A.R. of China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Ivory Coast', 'Jamaica', 'Japan', 

In [20]:
df['Country Name'].value_counts()

Country Name
Argentina     17
Costa Rica    17
Brazil        17
Bolivia       17
Bangladesh    17
              ..
Cuba           1
Maldives       1
Guyana         1
Oman           1
Suriname       1
Name: count, Length: 165, dtype: int64

Insights: 

- Hay que pasar la columna 'Country' de float a objeto. Gran número de nulos por este motivo. 

- Social Support Según la información inicial es el promedio nacional de las respuestas binarias (0=no, 1=sí) pero los datos son continuos de 0 a 1 

- Freedom To Make Life Choices Según la información inicial es el promedio nacional de las respuestas binarias a la pregunta de la GWP "¿Está satisfecho o insatisfecho con su libertad para elegir qué hacer con su vida?"

- Perceptions Of Corruption son el promedio de las respuestas binarias a dos preguntas del GWP: "¿Está extendida la corrupción en el gobierno o no?" y "¿Está extendida la corrupción en las empresas o no?". Cuando faltan datos sobre corrupción gubernamental, se utiliza la percepción de corrupción empresarial como medida general de percepción de corrupción.

- Confidence In National Government Los encuestados pueden elegir entre las siguientes opciones:

    "Sí, siempre"

    "Sí, a veces"

    "No, rara vez"

    "No, nunca"

    "No sé"

    La variable se calcula como el porcentaje de encuestados que responden "Sí, siempre" o "Sí, a veces" a esta pregunta.