**TRANSFORMACIONES - files/World_Happiness_Report.csv**

__________

**Where is this data set from?**

- The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. This dataset is a subset of the larger report, which includes data from various sources such as the Gallup World Poll and other national surveys. The data was extracted from the World Happiness Report and made available for public use. However, the original data was collected by various researchers and organizations as part of their ongoing efforts to measure and understand happiness and well-being around the world.

    We use observed data on the six variables and estimates of their associations with life evaluations to explain the variation across countries. They include GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption. Our happiness rankings are not based on any index of these six factors ‚Äì the scores are instead based on individuals‚Äô own assessments of their lives, in particular, their answers to the single-item Cantril ladder life-evaluation question, much as epidemiologists estimate the extent to which life expectancy is affected by factors such as smoking, exercise, and diet

Detailed information about each of the Predictors:

1. **Log GDP per capita** is in terms of Purchasing Power Parity (PPP) adjusted to a constant 2017 international dollars, taken from the World Development Indicators (WDI) by the World Bank (version 17, metadata last updated on January 22, 2023). See Statistical Appendix 1 for more details. GDP data for 2022 are not yet available, so we extend the GDP time series from 2021 to 2022 using country-specific forecasts of real GDP growth from the OECD Economic Outlook No. 112 (November 2022) or, if missing, from the World Bank‚Äôs Global Economic Prospects (last updated: January 10, 2023), after adjustment for population growth. The equation uses the natural log of GDP per capita, as this form fits the data significantly better than GDP per capita.

2. The time series for **Healthy life expectancy at birth** is constructed based on data from the World Health Organization (WHO) Global Health Observatory data repository, with data available for 2005, 2010, 2015, 2016, and 2019. To match this report‚Äôs sample period (2005-2022), interpolation and extrapolation are used. See Statistical Appendix 1 for more details.

3. **Social support** - *Conversion: % y yes/no*

    **Social support** (0-1) is the national average of the binary responses (0=no, 1=yes) to the Gallup World Poll (GWP) question ‚ÄúIf you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?‚Äù

4.  **Freedom to make life choices** - *Conversion: % y yes/no* 

    **Freedom to make life choices** (0-1) is the national average of binary responses to the GWP question ‚ÄúAre you satisfied or dissatisfied with your freedom to choose what you do with your life?‚Äù

5. **Generosity** is the residual of regressing the national average of GWP responses to the donation question ‚ÄúHave you donated money to a charity in the past month?‚Äù on log GDP per capita.

6.  **Perceptions of corruption** - *Conversion: % y yes/no*  

    **Perceptions of corruption** (0-1) are the average of binary answers to two GWP questions: ‚ÄúIs corruption widespread throughout the government or not?‚Äù and ‚ÄúIs corruption widespread within businesses or not?‚Äù Where data for government corruption are missing, the perception of business corruption is used as the overall corruption perception measure.

7. **Positive affect** is defined as the average of previous-day effects measures for laughter, enjoyment, and interest. The inclusion of interest (first added for World Happiness Report 2022), gives us three components in each of positive and negative affect, and slightly improves the equation fit in column 4. The general form for the affect questions is: Did you experience the following feelings during a lot of the day yesterday?

8. **Negative affect** is defined as the average of previous-day effects measures for worry, sadness, and anger.

9. **Life ladder**: Life evaluations from the Gallup World Poll provide the basis for the annual happiness rankings. They are based on answers to the main life evaluation question. The Cantril ladder asks respondents to think of a ladder, with the **best possible life for them being a 10 and the worst possible life being a 0**. They are then asked to rate their own current lives **on a 0 to 10 scale**. The rankings are from nationally representative samples over three years.

10. **Confidence in National Government**: The "Confidence in National Government" variable in the World Happiness Report is calculated based on the following question asked in the Gallup World Poll:

    "Do you have confidence in the national government?"

    Respondents are given the following options to choose from:

    - "Yes, always"
    - "Yes, sometimes"
    - "No, rarely"
    - "No, never"
    - "Don't know"

    **The variable is calculated as the percentage of respondents who answer "Yes, always" or "Yes, sometimes" to this question.**

    This variable is one of several social factors that are included in the calculation of the World Happiness Report's overall happiness score for each country. The report combines data on social factors such as income, social support, life expectancy, freedom to make life choices, generosity, and perceptions of corruption to arrive at a comprehensive measure of happiness.

In [None]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Visualizaci√≥n
# ------------------------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns

# Evaluar linealidad de las relaciones entre las variables
# ------------------------------------------------------------------------------
from scipy.stats import shapiro, kstest

# scripts de soporte
# -------------------------------------------------
from scripts.tolookandcompare import to_doc_info, to_doc_headtail, transform_info, transform_headtail
from scripts.tolookandcompare import transform_headtail, transform_info

from scripts import soporte_eda as sp_eda
from scripts.soporte_eda import resumen_df

# Gesti√≥n de los warnings
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")

# Configuraci√≥n
# -----------------------------------------------------------------------
pd.set_option('display.max_columns', None) # para poder visualizar todas las columnas de los DataFrames


**TRANSFORMACION de `Country Name` al subir el .csv**

Insights: 
- Hay que pasar la columna `Country Name` de float a objeto. Gran n√∫mero de nulos por este motivo. 

In [None]:
# Hay que volver a cargar el df forzando la conversi√≥n del dato
# df = pd.read_csv ('files/World_Happiness_Report.csv') - original entraba con combo de float, etc. de 'Country Name'
df = pd.read_csv('data/raw/World_Happiness_Report.csv', dtype={'Country Name': 'object'})

df.head(2)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545


In [3]:
df.sample(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
94,Austria,Western Europe,2010,7.302679,10.855984,0.914193,69.900002,0.89598,0.126924,0.546145,0.710302,0.155793,0.486447
1078,Laos,Southeast Asia,2011,4.70375,8.537691,0.690878,57.779999,0.881634,0.456966,0.587322,0.74624,0.225278,0.981804
241,Botswana,Sub-Saharan Africa,2022,3.435275,9.629346,0.750399,54.724998,0.739403,-0.214621,0.83094,0.623351,0.286919,
2174,Zambia,Sub-Saharan Africa,2014,4.345837,8.12443,0.706223,52.040001,0.811825,-0.011231,0.808841,0.638976,0.327384,0.606339
381,China,East Asia,2006,4.560495,8.696139,0.747011,65.660004,,,,0.657659,0.16958,


In [4]:
df['Country Name'].dtype

dtype('O')

In [5]:
df['Country Name']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
2194       Zimbabwe
2195       Zimbabwe
2196       Zimbabwe
2197       Zimbabwe
2198       Zimbabwe
Name: Country Name, Length: 2199, dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2199 entries, 0 to 2198
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Country Name                       2199 non-null   object 
 1   Regional Indicator                 2087 non-null   object 
 2   Year                               2199 non-null   int64  
 3   Life Ladder                        2199 non-null   float64
 4   Log GDP Per Capita                 2179 non-null   float64
 5   Social Support                     2186 non-null   float64
 6   Healthy Life Expectancy At Birth   2145 non-null   float64
 7   Freedom To Make Life Choices       2166 non-null   float64
 8   Generosity                         2126 non-null   float64
 9   Perceptions Of Corruption          2083 non-null   float64
 10  Positive Affect                    2175 non-null   float64
 11  Negative Affect                    2183 non-null   float

____

In [7]:
df['Country Name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Estonia', 'Eswatini', 'Ethiopia', 'Finland',
       'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guyana', 'Haiti', 'Honduras',
       'Hong Kong S.A.R. of China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Ivory Coast', 'Jamaica', 'Japan', 

In [8]:
df['Country Name'].value_counts()

Country Name
Argentina     17
Costa Rica    17
Brazil        17
Bolivia       17
Bangladesh    17
              ..
Cuba           1
Maldives       1
Guyana         1
Oman           1
Suriname       1
Name: count, Length: 165, dtype: int64

____

#### - **Modificaci√≥n columna `Social Support`:** 

Insights: 
- `Social Support` Seg√∫n la informaci√≥n inicial es el promedio nacional de las respuestas binarias (0=no, 1=s√≠) pero los datos son continuos de 0 a 1 

In [9]:
# CODIGO PARA COLUMNA SOCIAL SUPPORT: RELLENAR NULOS, PASAR A % Y CREAR NUEVA COLUMNA CON DATOS BINARIOS


# 1. C√°lculo de la Mediana Global
mediana_global = df['Social Support'].median()
print(f"\nüí° Mediana Global Calculada: {mediana_global:.4f}")

# 2. Imputaci√≥n (Rellenar NaNs)
# Sobreescribimos la columna original 'Social Support' con los NaNs rellenados
df['Social Support'] = df['Social Support'].fillna(mediana_global)

# 3. Definici√≥n de la Funci√≥n de Conversi√≥n
def clasificar_soporte_social(valor_soporte):
    """
    Clasifica un valor float de soporte social en 'si' o 'no'.
    Regla: 'si' si valor >= 0.5 (umbral), 'no' si valor < 0.5.
    """
    umbral = 0.5
    
    # Asumimos que los NaNs ya fueron tratados en el Paso 2
    if valor_soporte >= umbral:
        return "yes"
    else:
        return "no"

# 4. Aplicaci√≥n de la Funci√≥n a la Columna Imputada (Genera 'Social_Support_Clasificado')
# NOTA: Utilizamos un nombre de columna temporal 'Social_Support_Clasificado' 
# para guardar la clasificaci√≥n binaria.
df['Social_Support_binary'] = df['Social Support'].apply(clasificar_soporte_social)

# 5. Transformar a Porcentaje y Renombrar

# Primero: Aplicar la conversi√≥n a porcentaje (multiplicar por 100 y convertir a string con formato)
df['Social Support'] = df['Social Support'].apply(lambda x: f"{x * 100:.2f}")

# Segundo: Renombrar la columna 'Social Support'
df.rename(columns={'Social Support': 'Social Support (%)'}, inplace=True)


üí° Mediana Global Calculada: 0.8355


In [10]:
df.head(2)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support (%),Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government,Social_Support_binary
0,Afghanistan,South Asia,2008,3.72359,7.350416,45.07,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072,no
1,Afghanistan,South Asia,2009,4.401778,7.508646,55.23,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545,yes


___

#### - **Modificaci√≥n columna `Freedom To Make Life Choices`:** 

Insights:
- `Freedom To Make Life Choices` Seg√∫n la informaci√≥n inicial es el promedio nacional de las respuestas binarias a la pregunta de la GWP "¬øEst√° satisfecho o insatisfecho con su libertad para elegir qu√© hacer con su vida?"

- Hay un total de 33 `nulos` 

In [11]:
# NULOS - 33
df['Freedom To Make Life Choices'].isna().sum()

np.int64(33)

#### Rellenar `nulos` columna `Freedom To Make Life Choices`con mediana(?): 

In [12]:
# CODIGO PARA COLUMNA 'Freedom To Make Life Choices': RELLENAR NULOS, PASAR A % Y CREAR NUEVA COLUMNA CON DATOS BINARIOS


# 1. C√°lculo de la Mediana Global
mediana_global_freedom = df['Freedom To Make Life Choices'].median()
print(f"\nüí° Mediana Global Calculada: {mediana_global_freedom:.4f}")

# 2. Imputaci√≥n (Rellenar NaNs)
# Sobreescribimos la columna original 'Social Support' con los NaNs rellenados
df['Freedom To Make Life Choices'] = df['Freedom To Make Life Choices'].fillna(mediana_global_freedom)

# 3. Definici√≥n de la Funci√≥n de Conversi√≥n
def clasificar_columna_freedom(valor_soporte):
    """
    Clasifica un valor float de estar satisfecho con la libertad de tomar decisiones en 'si' o 'no'.
    Regla: 'si' si valor >= 0.5 (umbral), 'no' si valor < 0.5.
    """
    umbral = 0.5
    
    # Asumimos que los NaNs ya fueron tratados en el Paso 2
    if valor_soporte >= umbral:
        return "yes"
    else:
        return "no"

# 4. Aplicaci√≥n de la Funci√≥n a la Columna Imputada 
# NOTA: Utilizamos un nombre de columna temporal 'Freedom_Choices_binario' 
# para guardar la clasificaci√≥n binaria.
df['Freedom_Satisfied'] = df['Freedom To Make Life Choices'].apply(clasificar_columna_freedom)

# 5. Transformar a Porcentaje y Renombrar

# Primero: Aplicar la conversi√≥n a porcentaje (multiplicar por 100 y convertir a string con formato)
df['Freedom To Make Life Choices'] = df['Freedom To Make Life Choices'].apply(lambda x: f"{x * 100:.2f}")

# Segundo: Renombrar la columna 'Social Support'
df.rename(columns={'Freedom To Make Life Choices': 'Freedom To Make Life Choices (%)'}, inplace=True)


üí° Mediana Global Calculada: 0.7698


In [13]:
df.head(2)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support (%),Healthy Life Expectancy At Birth,Freedom To Make Life Choices (%),Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government,Social_Support_binary,Freedom_Satisfied
0,Afghanistan,South Asia,2008,3.72359,7.350416,45.07,50.5,71.81,0.167652,0.881686,0.414297,0.258195,0.612072,no,yes
1,Afghanistan,South Asia,2009,4.401778,7.508646,55.23,50.799999,67.89,0.190809,0.850035,0.481421,0.237092,0.611545,yes,yes


_____

#### Modificaci√≥n columna `Percepctions of Corruption`: 

Insights:
- `Perceptions Of Corruption` son el promedio de las respuestas binarias a dos preguntas del GWP: "¬øEst√° extendida la corrupci√≥n en el gobierno o no?" y "¬øEst√° extendida la corrupci√≥n en las empresas o no?". Cuando faltan datos sobre corrupci√≥n gubernamental, se utiliza la percepci√≥n de corrupci√≥n empresarial como medida general de percepci√≥n de corrupci√≥n.

- Confidence In National Government Los encuestados pueden elegir entre las siguientes opciones:

    -"S√≠, siempre"
    -"S√≠, a veces"
    -"No, rara vez"
    -"No, nunca"
    -"No s√©"

    La variable se calcula como el porcentaje de encuestados que responden "S√≠, siempre" o "S√≠, a veces" a esta pregunta.

- Hay 116 `nulos`

In [14]:
# NULOS - 116
df['Perceptions Of Corruption'].isna().sum()

np.int64(116)

___

#### Rellenar `nulos` columna `Perceptions of Corruption`con mediana(?): 

In [15]:
# CODIGO PARA COLUMNA 'Perceptions Of Corruption': RELLENAR NULOS, PASAR A % Y CREAR NUEVA COLUMNA CON DATOS BINARIOS


# 1. C√°lculo de la Mediana Global
mediana_global_freedom = df['Perceptions Of Corruption'].median()
print(f"\nüí° Mediana Global Calculada: {mediana_global_freedom:.4f}")

# 2. Imputaci√≥n (Rellenar NaNs)
# Sobreescribimos la columna original 'Social Support' con los NaNs rellenados
df['Perceptions Of Corruption'] = df['Perceptions Of Corruption'].fillna(mediana_global_freedom)

# 3. Definici√≥n de la Funci√≥n de Conversi√≥n
def clasificar_columna_freedom(valor_soporte):
    """
    Clasifica un valor float de la percepcion de corrupcion como 'si' o 'no'.
    Regla: 'si' si valor >= 0.5 (umbral), 'no' si valor < 0.5.
    """
    umbral = 0.5
    
    # Asumimos que los NaNs ya fueron tratados en el Paso 2
    if valor_soporte >= umbral:
        return "yes"
    else:
        return "no"

# 4. Aplicaci√≥n de la Funci√≥n a la Columna Imputada 
# NOTA: Utilizamos un nombre de columna temporal 'Freedom_Choices_binario' 
# para guardar la clasificaci√≥n binaria.
df['Perceptions_final'] = df['Perceptions Of Corruption'].apply(clasificar_columna_freedom)

# 5. Transformar a Porcentaje y Renombrar

# Primero: Aplicar la conversi√≥n a porcentaje (multiplicar por 100 y convertir a string con formato)
df['Perceptions Of Corruption'] = df['Perceptions Of Corruption'].apply(lambda x: f"{x * 100:.2f}")

# Segundo: Renombrar la columna 'Social Support'
df.rename(columns={'Perceptions Of Corruption': 'Perceptions Of Corruption (%)'}, inplace=True)


üí° Mediana Global Calculada: 0.7997


In [16]:
df.head(2)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support (%),Healthy Life Expectancy At Birth,Freedom To Make Life Choices (%),Generosity,Perceptions Of Corruption (%),Positive Affect,Negative Affect,Confidence In National Government,Social_Support_binary,Freedom_Satisfied,Perceptions_final
0,Afghanistan,South Asia,2008,3.72359,7.350416,45.07,50.5,71.81,0.167652,88.17,0.414297,0.258195,0.612072,no,yes,yes
1,Afghanistan,South Asia,2009,4.401778,7.508646,55.23,50.799999,67.89,0.190809,85.0,0.481421,0.237092,0.611545,yes,yes,yes


#### - **Modificaci√≥n columna `Positive Affect`, `Negative Affect`: (abajo - ChatGPT)** 

#### Rellenar `nulos` columna `Positive Affect`, `Negative Affect` con mediana(?): 

In [17]:
def transformar_columna_binaria(df, columna):
    """
    Para columnas binarias tipo WHR:
    1. Calcula la mediana global
    2. Imputa los NaNs con la mediana
    3. Crea columna (Yes/No)
    4. Crea columna en porcentaje
    5. Renombra la columna original a formato (%)
    """

    print(f"\nProcesando columna: {columna}")

    # 1. Calcular mediana global
    mediana = df[columna].median()
    print(f"üí° Mediana global calculada: {mediana:.4f}")

    # 2. Imputaci√≥n de valores faltantes
    df[columna] = df[columna].fillna(mediana)

    # 3. Crear columna yes/no
    df[f"{columna} (Yes/No)"] = df[columna].apply(
        lambda x: "yes" if x >= 0.5 else "no"
    )

    # 4. Crear columna porcentaje (num√©rica en formato float, no string)
    df[f"{columna} (%)"] = (df[columna] * 100).round(2)

    # 5. Mantener la columna original o eliminarla (tu compa√±era la sustitu√≠a)
    # Para seguir su estilo: renombramos la original
    df.rename(columns={columna: f"{columna} (Original)"}, inplace=True)

    print(f"‚úî Transformaciones completadas para {columna}")


In [18]:
transformar_columna_binaria(df, "Positive Affect")
transformar_columna_binaria(df, "Negative Affect")



Procesando columna: Positive Affect
üí° Mediana global calculada: 0.6631
‚úî Transformaciones completadas para Positive Affect

Procesando columna: Negative Affect
üí° Mediana global calculada: 0.2607
‚úî Transformaciones completadas para Negative Affect


#### **GUARDAR nuevo .csv "World Happiness Report limpio imputar mediana":** 

In [None]:
df.to_csv('files/World_Happiness_Report_limpio_imputar_original.csv', index=False)

In [20]:
df.head(5)

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support (%),Healthy Life Expectancy At Birth,Freedom To Make Life Choices (%),Generosity,Perceptions Of Corruption (%),Positive Affect (Original),Negative Affect (Original),Confidence In National Government,Social_Support_binary,Freedom_Satisfied,Perceptions_final,Positive Affect (Yes/No),Positive Affect (%),Negative Affect (Yes/No),Negative Affect (%)
0,Afghanistan,South Asia,2008,3.72359,7.350416,45.07,50.5,71.81,0.167652,88.17,0.414297,0.258195,0.612072,no,yes,yes,no,41.43,no,25.82
1,Afghanistan,South Asia,2009,4.401778,7.508646,55.23,50.799999,67.89,0.190809,85.0,0.481421,0.237092,0.611545,yes,yes,yes,no,48.14,no,23.71
2,Afghanistan,South Asia,2010,4.758381,7.6139,53.91,51.099998,60.01,0.121316,70.68,0.516907,0.275324,0.299357,yes,yes,yes,yes,51.69,no,27.53
3,Afghanistan,South Asia,2011,3.831719,7.581259,52.11,51.400002,49.59,0.163571,73.11,0.479835,0.267175,0.307386,yes,no,yes,no,47.98,no,26.72
4,Afghanistan,South Asia,2012,3.782938,7.660506,52.06,51.700001,53.09,0.237588,77.56,0.613513,0.267919,0.43544,yes,yes,yes,yes,61.35,no,26.79
