**TRANSFORMACIONES - files/World_Happiness_Report.csv**

*SIN IMPUTAR - solo nuevas columnas de % y binario (yes/no)*

__________

**Where is this data set from?**

- The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. This dataset is a subset of the larger report, which includes data from various sources such as the Gallup World Poll and other national surveys. The data was extracted from the World Happiness Report and made available for public use. However, the original data was collected by various researchers and organizations as part of their ongoing efforts to measure and understand happiness and well-being around the world.

    We use observed data on the six variables and estimates of their associations with life evaluations to explain the variation across countries. They include GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption. Our happiness rankings are not based on any index of these six factors ‚Äì the scores are instead based on individuals‚Äô own assessments of their lives, in particular, their answers to the single-item Cantril ladder life-evaluation question, much as epidemiologists estimate the extent to which life expectancy is affected by factors such as smoking, exercise, and diet

Detailed information about each of the Predictors:

1. **Log GDP per capita** is in terms of Purchasing Power Parity (PPP) adjusted to a constant 2017 international dollars, taken from the World Development Indicators (WDI) by the World Bank (version 17, metadata last updated on January 22, 2023). See Statistical Appendix 1 for more details. GDP data for 2022 are not yet available, so we extend the GDP time series from 2021 to 2022 using country-specific forecasts of real GDP growth from the OECD Economic Outlook No. 112 (November 2022) or, if missing, from the World Bank‚Äôs Global Economic Prospects (last updated: January 10, 2023), after adjustment for population growth. The equation uses the natural log of GDP per capita, as this form fits the data significantly better than GDP per capita.

2. The time series for **Healthy life expectancy at birth** is constructed based on data from the World Health Organization (WHO) Global Health Observatory data repository, with data available for 2005, 2010, 2015, 2016, and 2019. To match this report‚Äôs sample period (2005-2022), interpolation and extrapolation are used. See Statistical Appendix 1 for more details.

3. **Social support** - *Conversion: % y yes/no*

    **Social support** (0-1) is the national average of the binary responses (0=no, 1=yes) to the Gallup World Poll (GWP) question ‚ÄúIf you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?‚Äù

4.  **Freedom to make life choices** - *Conversion: % y yes/no* 

    **Freedom to make life choices** (0-1) is the national average of binary responses to the GWP question ‚ÄúAre you satisfied or dissatisfied with your freedom to choose what you do with your life?‚Äù

5. **Generosity** is the residual of regressing the national average of GWP responses to the donation question ‚ÄúHave you donated money to a charity in the past month?‚Äù on log GDP per capita.

6.  **Perceptions of corruption** - *Conversion: % y yes/no*  

    **Perceptions of corruption** (0-1) are the average of binary answers to two GWP questions: ‚ÄúIs corruption widespread throughout the government or not?‚Äù and ‚ÄúIs corruption widespread within businesses or not?‚Äù Where data for government corruption are missing, the perception of business corruption is used as the overall corruption perception measure.

7. **Positive affect** - *Conversion: % y yes/no* 
**Positive affect** is defined as the average of previous-day effects measures for laughter, enjoyment, and interest. The inclusion of interest (first added for World Happiness Report 2022), gives us three components in each of positive and negative affect, and slightly improves the equation fit in column 4. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday?

8. **Negative affect** - *Conversion: % y yes/no* 
**Negative affect** is defined as the average of previous-day effects measures for worry, sadness, and anger.

9. **Life ladder**: Life evaluations from the Gallup World Poll provide the basis for the annual happiness rankings. They are based on answers to the main life evaluation question. The Cantril ladder asks respondents to think of a ladder, with the **best possible life for them being a 10 and the worst possible life being a 0**. They are then asked to rate their own current lives **on a 0 to 10 scale**. The rankings are from nationally representative samples over three years.

10. **Confidence in National Government**: The "Confidence in National Government" variable in the World Happiness Report is calculated based on the following question asked in the Gallup World Poll:

    "Do you have confidence in the national government?"

    Respondents are given the following options to choose from:

    - "Yes, always"
    - "Yes, sometimes"
    - "No, rarely"
    - "No, never"
    - "Don't know"

    **The variable is calculated as the percentage of respondents who answer "Yes, always" or "Yes, sometimes" to this question.**

    This variable is one of several social factors that are included in the calculation of the World Happiness Report's overall happiness score for each country. The report combines data on social factors such as income, social support, life expectancy, freedom to make life choices, generosity, and perceptions of corruption to arrive at a comprehensive measure of happiness.

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Visualizaci√≥n
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Evaluar linealidad de las relaciones entre las variables
# ------------------------------------------------------------------------------
from scipy.stats import shapiro, kstest

# Scripts de soporte
# -------------------------------------------------
from src.tolookandcompare import to_doc_info, to_doc_headtail, transform_info, transform_headtail

from src import soporte_eda as sp_eda
from src.soporte_eda import resumen_df

# Gesti√≥n de los warnings
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")

# Configuraci√≥n
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames


**Leer/subir el .csv**

**TRANSFORMACION de `Country Name` al subir el .csv**

In [2]:
# Se edito src --> tolookandcompare_v2.py que estaba rompiendo 'Country Name'
df = pd.read_csv('files/World_Happiness_Report.csv') 

df.sample(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
307,Cambodia,Southeast Asia,2020,4.376985,8.360816,0.724423,61.700001,0.963075,0.050262,0.863054,0.77077,0.389852,
1118,Lesotho,Sub-Saharan Africa,2011,4.897515,7.78504,0.824085,41.52,0.61826,-0.089229,0.767676,0.754062,0.17001,0.396459
185,Benin,Sub-Saharan Africa,2013,3.479413,7.934523,0.576823,53.779999,0.78324,-0.084524,0.855956,0.645734,0.216339,0.546921
2104,Uruguay,Latin America and Caribbean,2022,6.670853,10.084121,0.904825,67.5,0.877969,-0.051668,0.631337,0.774694,0.267485,
1448,Nigeria,Sub-Saharan Africa,2010,4.760276,8.488231,0.823823,51.5,0.565351,0.065344,0.910719,0.758522,0.190343,0.288011


In [3]:
df.shape

(2199, 13)

In [4]:
df['Country Name'].dtype

dtype('O')

In [5]:
# NULOS - 0
df['Country Name'].isna().sum()

np.int64(0)

In [6]:
df['Country Name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Estonia', 'Eswatini', 'Ethiopia', 'Finland',
       'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guyana', 'Haiti', 'Honduras',
       'Hong Kong S.A.R. of China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Ivory Coast', 'Jamaica', 'Japan', 

In [7]:
df['Country Name'].value_counts()

Country Name
Argentina     17
Costa Rica    17
Brazil        17
Bolivia       17
Bangladesh    17
              ..
Cuba           1
Maldives       1
Guyana         1
Oman           1
Suriname       1
Name: count, Length: 165, dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2199 entries, 0 to 2198
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Country Name                       2199 non-null   object 
 1   Regional Indicator                 2087 non-null   object 
 2   Year                               2199 non-null   int64  
 3   Life Ladder                        2199 non-null   float64
 4   Log GDP Per Capita                 2179 non-null   float64
 5   Social Support                     2186 non-null   float64
 6   Healthy Life Expectancy At Birth   2145 non-null   float64
 7   Freedom To Make Life Choices       2166 non-null   float64
 8   Generosity                         2126 non-null   float64
 9   Perceptions Of Corruption          2083 non-null   float64
 10  Positive Affect                    2175 non-null   float64
 11  Negative Affect                    2183 non-null   float

_________________________

**TRANSFORMACION de las Columnas (5):**

- `Social Support`
- `Freedom To Make Life Choices`
- `Perceptions Of Corruption`
- `Positive Affect`
- `Negative Affect`

Columnas Nuevas: 
- valor numerico decimal en % *(% de si)* 
- valor categorico en binario *(No = hasta .5, Yes = .5+)*, NaN categorico a *"Data not available"*
- sin imputar nulos

In [9]:
# Lista de columnas originales que queremos transformar
cols = [
    "Social Support",
    "Freedom To Make Life Choices",
    "Perceptions Of Corruption",
    "Positive Affect",
    "Negative Affect"
]

for col in cols:
    print(f"\nüîé Procesando columna: {col}")

    # 1. Contar nulos antes de transformar
    n_nulos = df[col].isna().sum()
    print(f"   ‚û§ Nulos encontrados: {n_nulos}")

    # 2. Convertir el nombre a Title Case para consistencia en las columnas nuevas
    col_title = col.title()

    # 3. Crear columna en formato porcentaje
    df[f"{col_title} (%)"] = (df[col] * 100).round(1)
    print(f"   ‚úî Columna creada: '{col_title} (%)'")

    # 4. Crear columna categ√≥rica Yes/No
    df[f"{col_title} (Yes/No)"] = df[col].apply(
        lambda x: "Yes" if pd.notnull(x) and x >= 0.5 else
                  "No" if pd.notnull(x) else
                  "Data not available"
    )
    print(f"   ‚úî Columna creada: '{col_title} (Yes/No)'")

    # 5. Mostrar ejemplo r√°pido de los primeros valores transformados
    print(df[[col, f"{col_title} (%)", f"{col_title} (Yes/No)"]].head(3))




üîé Procesando columna: Social Support
   ‚û§ Nulos encontrados: 13
   ‚úî Columna creada: 'Social Support (%)'
   ‚úî Columna creada: 'Social Support (Yes/No)'
   Social Support  Social Support (%) Social Support (Yes/No)
0        0.450662                45.1                      No
1        0.552308                55.2                     Yes
2        0.539075                53.9                     Yes

üîé Procesando columna: Freedom To Make Life Choices
   ‚û§ Nulos encontrados: 33
   ‚úî Columna creada: 'Freedom To Make Life Choices (%)'
   ‚úî Columna creada: 'Freedom To Make Life Choices (Yes/No)'
   Freedom To Make Life Choices  Freedom To Make Life Choices (%)  \
0                      0.718114                              71.8   
1                      0.678896                              67.9   
2                      0.600127                              60.0   

  Freedom To Make Life Choices (Yes/No)  
0                                   Yes  
1                      

*NaNs % se mantienen porque Tableau/Power BI pueden graficarla sin errores.*

In [10]:
df.sample(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government,Social Support (%),Social Support (Yes/No),Freedom To Make Life Choices (%),Freedom To Make Life Choices (Yes/No),Perceptions Of Corruption (%),Perceptions Of Corruption (Yes/No),Positive Affect (%),Positive Affect (Yes/No),Negative Affect (%),Negative Affect (Yes/No)
333,Canada,North America and ANZ,2012,7.415144,10.739143,0.948128,70.919998,0.917961,0.286125,0.465602,0.775569,0.229332,0.523448,94.8,Yes,91.8,Yes,46.6,No,77.6,Yes,22.9,No
1223,Mali,Sub-Saharan Africa,2022,4.210548,7.645282,0.641625,55.799999,0.817643,-0.019203,0.745647,0.655435,0.407665,,64.2,Yes,81.8,Yes,74.6,Yes,65.5,Yes,40.8,No
2148,Vietnam,Southeast Asia,2016,5.062267,9.053184,0.876324,65.0,0.894351,-0.109294,0.79924,0.487257,0.22255,,87.6,Yes,89.4,Yes,79.9,Yes,48.7,No,22.3,No
209,Bolivia,Latin America and Caribbean,2017,5.650553,9.017354,0.778662,63.0,0.883905,-0.121653,0.819262,0.655217,0.433944,0.427633,77.9,Yes,88.4,Yes,81.9,Yes,65.5,Yes,43.4,No
1019,Kenya,Sub-Saharan Africa,2011,4.40531,8.249104,0.846308,54.02,0.708659,0.012077,0.922664,0.705513,0.227972,0.458982,84.6,Yes,70.9,Yes,92.3,Yes,70.6,Yes,22.8,No


In [11]:
df['Country Name'].dtype

dtype('O')

In [12]:
# NULOS - sigue siendo 0
df['Country Name'].isna().sum()

np.int64(0)

____

#### **GUARDAR nuevo .csv "World Happiness Report limpio sin imputar":** 

In [13]:
df.to_csv('files/World_Happiness_Report_limpio_sin_imputar.csv', index=False)

In [14]:
df.head(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government,Social Support (%),Social Support (Yes/No),Freedom To Make Life Choices (%),Freedom To Make Life Choices (Yes/No),Perceptions Of Corruption (%),Perceptions Of Corruption (Yes/No),Positive Affect (%),Positive Affect (Yes/No),Negative Affect (%),Negative Affect (Yes/No)
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072,45.1,No,71.8,Yes,88.2,Yes,41.4,No,25.8,No
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545,55.2,Yes,67.9,Yes,85.0,Yes,48.1,No,23.7,No
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357,53.9,Yes,60.0,Yes,70.7,Yes,51.7,Yes,27.5,No
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386,52.1,Yes,49.6,No,73.1,Yes,48.0,No,26.7,No
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544,52.1,Yes,53.1,Yes,77.6,Yes,61.4,Yes,26.8,No
