PART 1 - DATA COLLECTION AND CLEANING

First, I read in datasets.

The first dataframe contains country information on demographic and socio-economic.

The second dataframe is compiled form 5 different csv (where each csv represented a different year).  The final dataframe merges these two dataframes, mapping countries in a given year and their various features to a happiness score (for years 2015 to 2017).

In [1]:
import pandas as pd

data_dir = "https://raw.githubusercontent.com/juliandavis7/data/master/"

df_unesco = pd.read_csv(data_dir + "unesco_orig.csv")

df_2015 = pd.read_csv(data_dir + "2015.csv")
df_2016 = pd.read_csv(data_dir + "2016.csv")
df_2017 = pd.read_csv(data_dir + "2017.csv")
df_2018 = pd.read_csv(data_dir + "2018.csv")
df_2019 = pd.read_csv(data_dir + "2019.csv")
df_2015 = df_2015[["Country", "Happiness Score"]]
df_2016 = df_2016[["Country", "Happiness Score"]]
df_2017 = df_2017[["Country", "Happiness.Score"]]
df_2018 = df_2018[["Country or region", "Score"]]
df_2019 = df_2019[["Country or region", "Score"]]

df_2017.rename(columns=
        {"Happiness.Score": "Happiness Score"}, 
        inplace=True)
df_2018.rename(columns=
        {"Country or region": "Country", "Score": "Happiness Score"},
        inplace=True)
df_2019.rename(columns=
        {"Country or region": "Country", "Score": "Happiness Score"},
        inplace=True)

df_2015["Year"] = 2015
df_2016["Year"] = 2016
df_2017["Year"] = 2017
df_2018["Year"] = 2018
df_2019["Year"] = 2019

df_hap = pd.concat([df_2015, df_2016, df_2017, df_2018, df_2019])
df_hap.reset_index(drop=True, inplace=True)

After reading in the data, I have to clean and reorganize the data.  I started by choosing for the indicators that I wanted to investigate further (and the ones that I thought might be correlated with happiness).

In [4]:
df_unesco.head()

Unnamed: 0,DEMO_IND,Indicator,LOCATION,Country,TIME,Time,Value,Flag Codes,Flags
0,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,1970,1970,2.859,,
1,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,1971,1971,2.961,,
2,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,1972,1972,2.744,,
3,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,1973,1973,2.491,,
4,SP_DYN_TFRT_IN,"Fertility rate, total (births per woman)",AUS,Australia,1974,1974,2.397,,


As you can see from the display above, the indicator data for the unesco data was stored in row format.  In this cell, I reorganize the dataframe such that each distinct row (observation) is a given country in a given year.  This requires creating new columns for all the indicators that I desire.

In [5]:
df_final = df_unesco.copy()

df_final = df_unesco[["Country", "Time"]].copy()
df_final.drop_duplicates(inplace=True)
df_final.set_index(["Country", "Time"], inplace=True)

for indicator in df_unesco["Indicator"].unique():
    df_final[indicator] = 0
    df_final[indicator] = df_final[indicator].astype(float)
    
for i, row in df_unesco.iterrows():
    i_country = row["Country"]
    i_year = row["Time"]
    i_feature = row["Indicator"]
    i_val = row["Value"]
    df_final.at[(i_country, i_year), i_feature] = i_val
    
# features I'm interested in looking at
indicators = ["Fertility rate, total (births per woman)",
              "Life expectancy at birth, total (years)",
              "Mortality rate, infant (per 1,000 live births)",
              "Population growth (annual %)",
              "Rural population (% of total population)",
              "GDP growth (annual %)",
              "GDP (current US$)",
              "GDP per capita (current US$)",
              "GDP per capita, PPP (current international $)",
              "GDP, PPP (current international $)",
              "GNI (current LCU)",
              "GNI per capita, Atlas method (current US$)",
              "GNI per capita, PPP (current international $)",
              "Population aged 14 years or younger ",
              "Population aged 15-24 years ",
              "Population aged 25-64 years ",
              "Population aged 65 years or older ",
              "Prevalence of HIV, total (% of population ages 15-49)",
              "Poverty headcount ratio at $1.90 a day (PPP) (% of population)",
              "Total population "]
# comparing GNI to GDP shows the degree to which a nation's GDP 
# represents domestic or international activity

Unnamed: 0,DEMO_IND,LOCATION,Country,TIME,Time,Value,Flag Codes,Flags,"Fertility rate, total (births per woman)","Life expectancy at birth, total (years)",...,GNI (current LCU),"GNI per capita, Atlas method (current US$)","GNI per capita, PPP (current international $)",Population aged 14 years or younger,Population aged 15-24 years,Population aged 25-64 years,Population aged 65 years or older,"Prevalence of HIV, total (% of population ages 15-49)",Poverty headcount ratio at $1.90 a day (PPP) (% of population),Total population
186335,NY_GDP_PCAP_CN,CAN,Canada,1970,1970,4307.00759,,,2.258,72.70049,...,90492160000.0,3970.0,0.0,6440.147,3925.468,9293.993,1714.718,0.0,0.0,21374.326
186336,NY_GDP_PCAP_CN,CAN,Canada,1971,1971,4631.20116,,,2.141,73.02927,...,98776600000.0,4390.0,0.0,6358.735,4079.901,9521.813,1763.011,0.0,0.0,21723.46
186337,NY_GDP_PCAP_CN,CAN,Canada,1972,1972,5089.6864,,,1.98,72.9339,...,110457100000.0,5050.0,0.0,6290.521,4213.894,9758.405,1809.374,0.0,0.0,22072.194
186338,NY_GDP_PCAP_CN,CAN,Canada,1973,1973,5871.18762,,,1.89,73.16268,...,129509400000.0,6060.0,0.0,6223.796,4328.185,10007.184,1856.157,0.0,0.0,22415.322
186339,NY_GDP_PCAP_CN,CAN,Canada,1974,1974,6888.51792,,,1.837,73.23756,...,154553200000.0,7190.0,0.0,6140.715,4428.317,10270.428,1906.443,0.0,0.0,22745.903


As you can see from the display above, the dataframe is now in the correct format.  Furthermore, I elected to rename the features just to make the dataframe look simpler and cleaner. 

In [29]:
df_unesco = final_df.copy()
df_unesco = (df_unesco[["Country", "Time"] + indicators]).copy()
df_unesco.rename(columns={"Time": "Year",
                         "Fertility rate, total (births per woman)": "fertilityRate",
                         "Life expectancy at birth, total (years)": "lifeExpectancy",
                         "Mortality rate, infant (per 1,000 live births)": "mortalityRate",
                         "Population growth (annual %)": "popGrowth",
                         "Rural population (% of total population)": "ruralPopPct",
                         "GDP growth (annual %)": "gdpGrowthPct",
                         "GDP (current US$)": "gdpUS",
                         "GDP per capita (current US$)": "gdpPerCapitaUS",
                         "GDP per capita, PPP (current international $)": "gdpPerCapita ppp",
                         "GDP, PPP (current international $)": "gdp pppInternational",
                         "GNI (current LCU)": "gni",
                         "GNI per capita, Atlas method (current US$)": "gniPerCapita",
                         "GNI per capita, PPP (current international $)": "gniPerCapita ppp",
                         "Population aged 14 years or younger ": "pop14under",
                         "Population aged 15-24 years ": "pop15to24",
                         "Population aged 25-64 years ": "pop25to64",
                         "Population aged 65 years or older ": "pop65over",
                         "Prevalence of HIV, total (% of population ages 15-49)": "hivPct",
                         "Poverty headcount ratio at $1.90 a day (PPP) (% of population)": "povertyRatio",
                         "Total population ": "totalPop"},
                inplace=True
                )
#Comparing GNI to GDP shows the degree to which a nation's GDP represents domestic or international activity.

I noticed that a lot of countries had a lot of slighly different naming conventions so I needed properly reallign all country namings so that the two dataframes, the unesco dataframe and the happiness dataframe, can be merged effectively.  This is an important step to ensure that I retain as much data as possible. The more data I have the better my machine learning models will be able to predict.

- The first cell uses set methods to detect both the countries that are already alligned (using set intersection) and those that are misaligned (using set difference).

- The second cell redefines the country names that were misalligned.

In [30]:
unesco_countries = list(df_unesco["Country"].unique())
hap_countries = list(df_hap["Country"].unique())

def diff(li1, li2): 
    return list(set(li1) - set(li2)) 

unesco_only = diff(unesco_countries, hap_countries)
unesco_only.sort()
hap_only = diff(hap_countries, unesco_countries)
hap_only.sort()

def intersection(li1, li2):
  return list(set(li1) & set(li2))

in_both = intersection(unesco_countries, hap_countries)
print("Countries in both dataframes before allignment:", len(in_both))

Countries in both dataframes before allignment: 141


In [31]:
for i, row in df_unesco.iterrows():
  if df_unesco.at[i, "Country"] == "Bolivia (Plurinational State of)":
    df_unesco.at[i, "Country"] = "Bolivia"
  elif df_unesco.at[i, "Country"] == "Czechia":
    df_unesco.at[i, "Country"] = "Czech Republic"
  elif df_unesco.at[i, "Country"] == "China, Hong Kong Special Administrative Region":
    df_unesco.at[i, "Country"] = "Hong Kong"
  elif df_unesco.at[i, "Country"] == "China, Macao Special Administrative Region":
    df_unesco.at[i, "Country"] = "Hong Kong S.A.R., China"
  elif df_unesco.at[i, "Country"] == "Iran (Islamic Republic of)":
    df_unesco.at[i, "Country"] = "Iran"
  elif df_unesco.at[i, "Country"] == "Russian Federation":
    df_unesco.at[i, "Country"] = "Russia"
  elif df_unesco.at[i, "Country"] == "Republic of Moldova":
    df_unesco.at[i, "Country"] = "Moldova"
  elif df_unesco.at[i, "Country"] == "Palestine":
    df_unesco.at[i, "Country"] = "Palestinian Territories"
  elif df_unesco.at[i, "Country"] == "Republic of Korea":
    df_unesco.at[i, "Country"] = "South Korea"
  elif df_unesco.at[i, "Country"] == "Eswatini":
    df_unesco.at[i, "Country"] = "Swaziland"
  elif df_unesco.at[i, "Country"] == "Syrian Arab Republic":
    df_unesco.at[i, "Country"] = "Syria"
  elif df_unesco.at[i, "Country"] == "United Republic of Tanzania":
    df_unesco.at[i, "Country"] = "Tanzania"
  elif df_unesco.at[i, "Country"] == "United Kingdom of Great Britain and Northern Ireland":
    df_unesco.at[i, "Country"] = "United Kingdom"
  elif df_unesco.at[i, "Country"] == "United States of America":
    df_unesco.at[i, "Country"] = "United States"
  elif df_unesco.at[i, "Country"] == "Venezuela (Bolivarian Republic of)":
    df_unesco.at[i, "Country"] = "Venezuela"
  elif df_unesco.at[i, "Country"] == "Viet Nam":
    df_unesco.at[i, "Country"] = "Vietnam"
  elif df_unesco.at[i, "Country"] == "Democratic Republic of the Congo":
    df_unesco.at[i, "Country"] = "Congo (Kinshasa)"
  elif df_unesco.at[i, "Country"] == "Congo":
    df_unesco.at[i, "Country"] = "Congo (Brazzaville)"
  elif df_unesco.at[i, "Country"] == "Côte d'Ivoire":
    df_unesco.at[i, "Country"] = "Ivory Coast"
  elif df_unesco.at[i, "Country"] == "Lao People's Democratic Republic":
    df_unesco.at[i, "Country"] = "Laos"

  
unesco_countries = list(df_unesco["Country"].unique())
hap_countries = list(df_hap["Country"].unique())

in_both = intersection(unesco_countries, hap_countries)
print("Countries in both dataframes after allignment:", len(in_both))
display(df_hap.head(1))
display(df_unesco.head(1))

Countries in both dataframes after allignment: 161


Unnamed: 0,Country,Happiness Score,Year
0,Switzerland,7.587,2015


Unnamed: 0,Country,Year,fertilityRate,lifeExpectancy,mortalityRate,popGrowth,ruralPopPct,gdpGrowthPct,gdpUS,gdpPerCapitaUS,...,gni,gniPerCapita,gniPerCapita ppp,pop14under,pop15to24,pop25to64,pop65over,hivPct,povertyRatio,totalPop
186335,Canada,1970,2.258,72.70049,18.5,1.39783,24.346,3.25561,87896100000.0,4121.93281,...,90492160000.0,3970.0,0.0,6440.147,3925.468,9293.993,1714.718,0.0,0.0,21374.326


When first trying to apply the machine learning algorithms, I noticed that some of the features I wanted to use had a few missing valus.  In an effort to retain as much data as possible, I did some research and manually filled in these values rather than simply dropping them.  (Note: I only did this for features that had < 10 missing values)

In [35]:
df_unesco.set_index(["Country", "Year"], inplace=True)

df_unesco.at[("Syria", 2015), "gdpGrowthPct"] = -6.1
df_unesco.at[("Syria", 2015), "gdpUS"] = 19090000
df_unesco.at[("Syria", 2015), "gdpPerCapitaUS"] = 890
df_unesco.at[("Syria", 2015), "gdpPerCapita ppp"] = 2900
df_unesco.at[("Syria", 2015), "gdp pppInternational"] = 61900000000
df_unesco.at[("Syria", 2015), "gniPerCapita"] = 681
df_unesco.at[("Syria", 2015), "gniPerCapita ppp"] = 2218

df_unesco.at[("Syria", 2016), "gdpGrowthPct"] = -4.0
df_unesco.at[("Syria", 2016), "gdpUS"] = 12377000
df_unesco.at[("Syria", 2016), "gdpPerCapitaUS"] = 709
df_unesco.at[("Syria", 2016), "gdpPerCapita ppp"] = 3300
df_unesco.at[("Syria", 2016), "gdp pppInternational"] = 55800000000
df_unesco.at[("Syria", 2016), "gniPerCapita"] = 377
df_unesco.at[("Syria", 2016), "gniPerCapita ppp"] = 1754

df_unesco.at[("Syria", 2017), "gdpGrowthPct"] = 1.9
df_unesco.at[("Syria", 2017), "gdpUS"] = 15183000
df_unesco.at[("Syria", 2017), "gdpPerCapitaUS"] = 890
df_unesco.at[("Syria", 2017), "gdpPerCapita ppp"] = 2800
df_unesco.at[("Syria", 2017), "gdp pppInternational"] = 50280000000
df_unesco.at[("Syria", 2017), "gniPerCapita"] = 704
df_unesco.at[("Syria", 2017), "gniPerCapita ppp"] = 2214

df_unesco.at[("Venezuela", 2015), "gdpGrowthPct"] = -6.2
df_unesco.at[("Venezuela", 2015), "gdpUS"] = 323595000000
df_unesco.at[("Venezuela", 2015), "gdpPerCapitaUS"] = 10568.1
df_unesco.at[("Venezuela", 2015), "gdpPerCapita ppp"] = 17300
df_unesco.at[("Venezuela", 2015), "gdp pppInternational"] = 531100000000
df_unesco.at[("Venezuela", 2015), "gniPerCapita"] = 11047
df_unesco.at[("Venezuela", 2015), "gniPerCapita ppp"] = 16690

df_unesco.at[("Venezuela", 2016), "gdpGrowthPct"] = -17.04
df_unesco.at[("Venezuela", 2016), "gdpUS"] = 279249000000
df_unesco.at[("Venezuela", 2016), "gdpPerCapitaUS"] = 9092.02
df_unesco.at[("Venezuela", 2016), "gdpPerCapita ppp"] = 14400
df_unesco.at[("Venezuela", 2016), "gdp pppInternational"] = 443700000000
df_unesco.at[("Venezuela", 2016), "gniPerCapita"] = 9420
df_unesco.at[("Venezuela", 2016), "gniPerCapita ppp"] = 14220

df_unesco.at[("Venezuela", 2017), "gdpGrowthPct"] = -15.67
df_unesco.at[("Venezuela", 2017), "gdpUS"] = 143841000000
df_unesco.at[("Venezuela", 2017), "gdpPerCapitaUS"] = 4755.03
df_unesco.at[("Venezuela", 2017), "gdpPerCapita ppp"] = 12500
df_unesco.at[("Venezuela", 2017), "gdp pppInternational"] = 381600000000
df_unesco.at[("Venezuela", 2017), "gniPerCapita"] = 8216
df_unesco.at[("Venezuela", 2017), "gniPerCapita ppp"] = 12010

df_unesco.at[("Somalia", 2016), "gdpGrowthPct"] = 2.89
df_unesco.at[("Somalia", 2016), "gdpPerCapita ppp"] = 1395
# 19.98 B gdp / 14.32 M total pop
df_unesco.at[("Somalia", 2016), "gdp pppInternational"] = 19980000000 
df_unesco.at[("Somalia", 2016), "gniPerCapita"] = 101
df_unesco.at[("Somalia", 2016), "gniPerCapita ppp"] = 290

df_unesco.at[("Somalia", 2017), "gdpGrowthPct"] = 1.39
df_unesco.at[("Somalia", 2017), "gdpPerCapita ppp"] = 1386
# 20.44 B gdp / 14.74 M total pop
df_unesco.at[("Somalia", 2017), "gdp pppInternational"] = 20440000000
df_unesco.at[("Somalia", 2017), "gniPerCapita"] = 102
df_unesco.at[("Somalia", 2017), "gniPerCapita ppp"] = 295


df_unesco.at[("Djibouti", 2015), "gdpPerCapita ppp"] = 3139.3
df_unesco.at[("Djibouti", 2015), "gdp pppInternational"] = 3203000000
df_unesco.at[("Djibouti", 2015), "gniPerCapita ppp"] = 3103

df_unesco.at[("Lithuania", 2015), "gni"] = 43900000000
df_unesco.at[("Lithuania", 2015), "gniPerCapita"] = 15110
df_unesco.at[("Lithuania", 2015), "gniPerCapita ppp"] = 27780

df_unesco.at[("Lithuania", 2016), "gni"] = 42460000000
df_unesco.at[("Lithuania", 2016), "gniPerCapita"] = 14800
df_unesco.at[("Lithuania", 2016), "gniPerCapita ppp"] = 29220

df_unesco.at[("Lithuania", 2017), "gni"] = 42930000000
df_unesco.at[("Lithuania", 2017), "gniPerCapita"] = 15180
df_unesco.at[("Lithuania", 2017), "gniPerCapita ppp"] = 32070

df_unesco.at[("Yemen", 2017), "gniPerCapita"] = 1060

Finally, I merge the two data frames on country and year.  My dataframe - containing the data and their corresponding labels (happiness score) - can now easily be applied to machine learning regression algorithms.

Additionally, all the columns were already associated with the correct data type so no further manipulation was needed there.

In [37]:
df_combined = df_unesco.merge(df_hap, on=["Country", "Year"])
df_combined.rename(columns={"Happiness Score": "happinessScore"}, inplace=True)
df_combined.set_index(["Country", "Year"], inplace=True)
df_combined.dtypes

fertilityRate           float64
lifeExpectancy          float64
mortalityRate           float64
popGrowth               float64
ruralPopPct             float64
gdpGrowthPct            float64
gdpUS                   float64
gdpPerCapitaUS          float64
gdpPerCapita ppp        float64
gdp pppInternational    float64
gni                     float64
gniPerCapita            float64
gniPerCapita ppp        float64
pop14under              float64
pop15to24               float64
pop25to64               float64
pop65over               float64
hivPct                  float64
povertyRatio            float64
totalPop                float64
happinessScore          float64
dtype: object

Finally, I downloaded the fully cleaned dataframe to a .csv in my data directory.

In [41]:
df_unesco.to_csv("unesco.csv")
df_combined.to_csv("unesco_train.csv")