**TRANSFORMACIONES - files/World_Happiness_Report.csv**

*IMPUTANDO los nulos, nuevas columnas de % y binario (yes/no)*

__________

**Where is this data set from?**

- The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. This dataset is a subset of the larger report, which includes data from various sources such as the Gallup World Poll and other national surveys. The data was extracted from the World Happiness Report and made available for public use. However, the original data was collected by various researchers and organizations as part of their ongoing efforts to measure and understand happiness and well-being around the world.

    We use observed data on the six variables and estimates of their associations with life evaluations to explain the variation across countries. They include GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption. Our happiness rankings are not based on any index of these six factors ‚Äì the scores are instead based on individuals‚Äô own assessments of their lives, in particular, their answers to the single-item Cantril ladder life-evaluation question, much as epidemiologists estimate the extent to which life expectancy is affected by factors such as smoking, exercise, and diet

Detailed information about each of the Predictors:

1. **Log GDP per capita** is in terms of Purchasing Power Parity (PPP) adjusted to a constant 2017 international dollars, taken from the World Development Indicators (WDI) by the World Bank (version 17, metadata last updated on January 22, 2023). See Statistical Appendix 1 for more details. GDP data for 2022 are not yet available, so we extend the GDP time series from 2021 to 2022 using country-specific forecasts of real GDP growth from the OECD Economic Outlook No. 112 (November 2022) or, if missing, from the World Bank‚Äôs Global Economic Prospects (last updated: January 10, 2023), after adjustment for population growth. The equation uses the natural log of GDP per capita, as this form fits the data significantly better than GDP per capita.

2. The time series for **Healthy life expectancy at birth** is constructed based on data from the World Health Organization (WHO) Global Health Observatory data repository, with data available for 2005, 2010, 2015, 2016, and 2019. To match this report‚Äôs sample period (2005-2022), interpolation and extrapolation are used. See Statistical Appendix 1 for more details.

3. **Social support** - *Conversion: % -- yes/no - imputar mediana*

    **Social support** (0-1) is the national average of the binary responses (0=no, 1=yes) to the Gallup World Poll (GWP) question ‚ÄúIf you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?‚Äù

4.  **Freedom to make life choices** - *Conversion: % -- yes/no - imputar mediana*

    **Freedom to make life choices** (0-1) is the national average of binary responses to the GWP question ‚ÄúAre you satisfied or dissatisfied with your freedom to choose what you do with your life?‚Äù

5. **Generosity** is the residual of regressing the national average of GWP responses to the donation question ‚ÄúHave you donated money to a charity in the past month?‚Äù on log GDP per capita.

6.  **Perceptions of corruption** - *Conversion: % -- yes/no - imputar mediana*  

    **Perceptions of corruption** (0-1) are the average of binary answers to two GWP questions: ‚ÄúIs corruption widespread throughout the government or not?‚Äù and ‚ÄúIs corruption widespread within businesses or not?‚Äù Where data for government corruption are missing, the perception of business corruption is used as the overall corruption perception measure.

7.  **Positive affect** *Conversion: % -- yes/no - imputar mediana*  

    **Positive affect** is defined as the average of previous-day effects measures for laughter, enjoyment, and interest. The inclusion of interest (first added for World Happiness Report 2022), gives us three components in each of positive and negative affect, and slightly improves the equation fit in column 4. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday?

8.  **Negative affect** *Conversion: % -- yes/no - imputar mediana* 

    **Negative affect** is defined as the average of previous-day effects measures for worry, sadness, and anger.

9. **Life ladder**: Life evaluations from the Gallup World Poll provide the basis for the annual happiness rankings. They are based on answers to the main life evaluation question. The Cantril ladder asks respondents to think of a ladder, with the **best possible life for them being a 10 and the worst possible life being a 0**. They are then asked to rate their own current lives **on a 0 to 10 scale**. The rankings are from nationally representative samples over three years.

10. **Confidence in National Government**: The "Confidence in National Government" variable in the World Happiness Report is calculated based on the following question asked in the Gallup World Poll:

    "Do you have confidence in the national government?"

    Respondents are given the following options to choose from:

    - "Yes, always"
    - "Yes, sometimes"
    - "No, rarely"
    - "No, never"
    - "Don't know"

    **The variable is calculated as the percentage of respondents who answer "Yes, always" or "Yes, sometimes" to this question.**

    This variable is one of several social factors that are included in the calculation of the World Happiness Report's overall happiness score for each country. The report combines data on social factors such as income, social support, life expectancy, freedom to make life choices, generosity, and perceptions of corruption to arrive at a comprehensive measure of happiness.

In [1]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Visualizaci√≥n
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Evaluar linealidad de las relaciones entre las variables
# ------------------------------------------------------------------------------
from scipy.stats import shapiro, kstest

# scripts de soporte
# -------------------------------------------------
from src.tolookandcompare import to_doc_info, to_doc_headtail, transform_info, transform_headtail
from src.tolookandcompare import transform_headtail, transform_info

from src import soporte_eda as sp_eda
from src.soporte_eda import resumen_df

# Gesti√≥n de los warnings
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")

# Configuraci√≥n
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames


**TRANSFORMACION de `Country Name` al subir el .csv**

Insights: 
- Hay que pasar la columna `Country Name` de float a objeto. Gran n√∫mero de nulos por este motivo. 

In [2]:
# Hay que volver a cargar el df forzando la conversi√≥n del dato
# df = pd.read_csv ('files/World_Happiness_Report.csv') - original entraba con combo de float, etc. de 'Country Name'
df = pd.read_csv('files/World_Happiness_Report.csv', dtype={'Country Name': 'object'})

df.head(2)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545


In [3]:
df.sample(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
530,Dominican Republic,Latin America and Caribbean,2010,4.735021,9.447546,0.859969,64.400002,0.823903,-0.077425,0.779742,0.706626,0.281695,0.450314
1559,Philippines,Southeast Asia,2014,5.31255,8.841846,0.8133,61.84,0.902186,-0.017416,0.787219,0.787263,0.334037,0.687083
985,Jordan,Middle East and North Africa,2011,5.539328,9.383053,0.877919,66.879997,0.759565,-0.152627,,0.550934,0.260324,
637,Finland,Western Europe,2022,7.728998,10.814193,0.974395,71.224998,0.958609,0.102147,0.190207,0.741323,0.191473,
684,Georgia,Commonwealth of Independent States,2019,4.891836,9.615089,0.674976,64.699997,0.810534,-0.262912,0.647223,0.502835,0.24371,0.40521


In [4]:
df['Country Name'].dtype

dtype('O')

In [5]:
df['Country Name']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
2194       Zimbabwe
2195       Zimbabwe
2196       Zimbabwe
2197       Zimbabwe
2198       Zimbabwe
Name: Country Name, Length: 2199, dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2199 entries, 0 to 2198
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Country Name                       2199 non-null   object 
 1   Regional Indicator                 2087 non-null   object 
 2   Year                               2199 non-null   int64  
 3   Life Ladder                        2199 non-null   float64
 4   Log GDP Per Capita                 2179 non-null   float64
 5   Social Support                     2186 non-null   float64
 6   Healthy Life Expectancy At Birth   2145 non-null   float64
 7   Freedom To Make Life Choices       2166 non-null   float64
 8   Generosity                         2126 non-null   float64
 9   Perceptions Of Corruption          2083 non-null   float64
 10  Positive Affect                    2175 non-null   float64
 11  Negative Affect                    2183 non-null   float

____

In [7]:
df['Country Name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Estonia', 'Eswatini', 'Ethiopia', 'Finland',
       'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guyana', 'Haiti', 'Honduras',
       'Hong Kong S.A.R. of China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Ivory Coast', 'Jamaica', 'Japan', 

In [8]:
df['Country Name'].value_counts()

Country Name
Argentina     17
Costa Rica    17
Brazil        17
Bolivia       17
Bangladesh    17
              ..
Cuba           1
Maldives       1
Guyana         1
Oman           1
Suriname       1
Name: count, Length: 165, dtype: int64

____

**TRANSFORMACION de las Columnas (5):**

- `Social Support`
- `Freedom To Make Life Choices`
- `Perceptions Of Corruption`
- `Positive Affect`
- `Negative Affect`

Columnas Nuevas: 
- valor numerico decimal en % *(% de si)* 
- valor categorico en binario *(No = hasta .5, Yes = .5+)*, NaN categorico a *"Data not available"*
- imputar nulos a mediana

In [9]:
def transformar_columna_binaria(df, columna):
    """
    Aplica el mismo proceso para columnas binarias del WHR:
    1. Reporta la cantidad de nulos.
    2. Calcula la mediana global.
    3. Imputa los NaNs con la mediana.
    4. Crea columna categ√≥rica (Yes/No).
    5. Crea columna num√©rica en porcentaje.
    6. Renombra la columna original a '(Original)'.
    """

    print(f"\nüîé Procesando columna: {columna}")

    # 1. Contar y mostrar cantidad de valores nulos antes de imputar
    nulos = df[columna].isna().sum()
    print(f"   ‚û§ Nulos antes de imputar: {nulos}")

    # 2. Calcular la mediana global
    mediana = df[columna].median()
    print(f"   ‚û§ Mediana global calculada: {mediana:.4f}")

    # 3. Imputar NaNs con la mediana
    df[columna] = df[columna].fillna(mediana)

    # 4. Crear columna Yes/No con umbral 0.5
    df[f"{columna} (Yes/No)"] = df[columna].apply(
        lambda x: "yes" if x >= 0.5 else "no"
    )

    # 5. Crear columna porcentual (num√©rica)
    df[f"{columna} (%)"] = (df[columna] * 100).round(1)

    # 6. Renombrar la columna original imputada
    df.rename(columns={columna: f"{columna} (Original)"}, inplace=True)

    print(f"   ‚úî Transformaciones completadas para '{columna}'")


In [10]:
columnas_transformar = [
    "Social Support",
    "Freedom To Make Life Choices",
    "Perceptions Of Corruption",
    "Positive Affect",
    "Negative Affect"
]

In [11]:
for col in columnas_transformar:
    transformar_columna_binaria(df, col)


üîé Procesando columna: Social Support
   ‚û§ Nulos antes de imputar: 13
   ‚û§ Mediana global calculada: 0.8355
   ‚úî Transformaciones completadas para 'Social Support'

üîé Procesando columna: Freedom To Make Life Choices
   ‚û§ Nulos antes de imputar: 33
   ‚û§ Mediana global calculada: 0.7698
   ‚úî Transformaciones completadas para 'Freedom To Make Life Choices'

üîé Procesando columna: Perceptions Of Corruption
   ‚û§ Nulos antes de imputar: 116
   ‚û§ Mediana global calculada: 0.7997
   ‚úî Transformaciones completadas para 'Perceptions Of Corruption'

üîé Procesando columna: Positive Affect
   ‚û§ Nulos antes de imputar: 24
   ‚û§ Mediana global calculada: 0.6631
   ‚úî Transformaciones completadas para 'Positive Affect'

üîé Procesando columna: Negative Affect
   ‚û§ Nulos antes de imputar: 16
   ‚û§ Mediana global calculada: 0.2607
   ‚úî Transformaciones completadas para 'Negative Affect'


#### **GUARDAR nuevo .csv "World Happiness Report limpio imputar mediana":** 

In [12]:
df.to_csv('files/World_Happiness_Report_limpio_imputar_mediana.csv', index=False)

In [13]:
df.head(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support (Original),Healthy Life Expectancy At Birth,Freedom To Make Life Choices (Original),Generosity,Perceptions Of Corruption (Original),Positive Affect (Original),Negative Affect (Original),Confidence In National Government,Social Support (Yes/No),Social Support (%),Freedom To Make Life Choices (Yes/No),Freedom To Make Life Choices (%),Perceptions Of Corruption (Yes/No),Perceptions Of Corruption (%),Positive Affect (Yes/No),Positive Affect (%),Negative Affect (Yes/No),Negative Affect (%)
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072,no,45.1,yes,71.8,yes,88.2,no,41.4,no,25.8
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545,yes,55.2,yes,67.9,yes,85.0,no,48.1,no,23.7
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357,yes,53.9,yes,60.0,yes,70.7,yes,51.7,no,27.5
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386,yes,52.1,no,49.6,yes,73.1,no,48.0,no,26.7
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544,yes,52.1,yes,53.1,yes,77.6,yes,61.4,no,26.8
