# Cleaning Happiness Index Dataset

### Data Sources
- Happiness Index dataset: https://worldhappiness.report/data/

In [1]:
import pandas as pd
import glob
import os
import country_converter as coco

### File Location:
First, let us import our happiness dataset.

In [2]:
df_csv = pd.read_csv(r'raw data/DataForTable2.1WHR2023.csv')
df_csv

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.350,0.451,50.500,0.718,0.168,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.800,0.679,0.191,0.850,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.100,0.600,0.121,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.400,0.496,0.164,0.731,0.480,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.700,0.531,0.238,0.776,0.614,0.268
...,...,...,...,...,...,...,...,...,...,...,...
2194,Zimbabwe,2018,3.616,7.783,0.775,52.625,0.763,-0.051,0.844,0.658,0.212
2195,Zimbabwe,2019,2.694,7.698,0.759,53.100,0.632,-0.047,0.831,0.658,0.235
2196,Zimbabwe,2020,3.160,7.596,0.717,53.575,0.643,0.006,0.789,0.661,0.346
2197,Zimbabwe,2021,3.155,7.657,0.685,54.050,0.668,-0.076,0.757,0.610,0.242


### Renaming Columns:
- Rename the `country` column for consistency with the suicides dataset

In [3]:
df = df_csv.rename(columns = {'Country name': 'country'})
df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.35,0.451,50.5,0.718,0.168,0.882,0.414,0.258
1,Afghanistan,2009,4.402,7.509,0.552,50.8,0.679,0.191,0.85,0.481,0.237
2,Afghanistan,2010,4.758,7.614,0.539,51.1,0.6,0.121,0.707,0.517,0.275
3,Afghanistan,2011,3.832,7.581,0.521,51.4,0.496,0.164,0.731,0.48,0.267
4,Afghanistan,2012,3.783,7.661,0.521,51.7,0.531,0.238,0.776,0.614,0.268


### Standardize Country Names:
We need to standardize all country names so we can match with the suicides dataset's country names.
- Remove entries for Somaliland because `country_converter` cannot distinguish between Somaliland and Somalia
- Standardize all country names using `country_converter` package (https://pypi.org/project/country-converter/)

In [4]:
df = df[~df['country'].isin(['Somaliland region'])]

In [5]:
countries = df['country'].unique()
standard_country_names = coco.convert(names = countries, 
                                      to = 'name_short')
country_fix_dict = dict(zip(countries, standard_country_names))
df['country'] = df['country'].replace(country_fix_dict)

### Exporting to .csv File:

In [6]:
path_export = r'cleaned data/Happiness_Index.csv'
df.to_csv(path_export)