# What factors most affect happiness?

Gapminder took a global poll where participants were asked where, on a scale of 1 being the worst possible life and 10 being the best possible life, where they felt they personally landed at the time of asking. These responses were ranked by year and by national average response score.

The responses intrigued me, as some nations ranked higher than others, and wanted to see what correlations happiness had against other metrics. I will analyze the data against the following three other Gapminder data sets, as I want to see which of the following factors has the biggest effect on happiness.

### Income - Mean household income
### Life expectancy - the average number of years a newborn child would live
### Gini - inequality coefficient (higher meaning more inequality)


First, we will import and look at our raw data sets. Note, that the happiness index started being measured in 2004, and the other data sets go back much further. For the sake of this experiment, we will have to clean our supplemental data sets to only include data from 2020.

In [185]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

happiness = pd.read_csv('happiness.csv', sep = ',')
incomeraw = pd.read_csv('income.csv', sep = ',')
lifeexpectancyraw = pd.read_csv('life_expectancy.csv', sep = ',')
giniraw = pd.read_csv('gini.csv', sep = ',')


#### Data cleanup 1:
We will want to look at data from 2020 for each metric, and compare our metrics against each other. 

In [186]:
happiness = happiness.iloc[:, np.r_[0, 17:18]] #snipping excess dates from dataframe
happiness.head()

Unnamed: 0,country,2020
0,Afghanistan,24.0
1,Angola,
2,Albania,52.0
3,United Arab Emirates,65.8
4,Argentina,59.7


We'll also be changing the 2020 column to the name of the data set we are working with and shortening our dataframe names

In [187]:
h = happiness.rename(columns = {'2020':'happiness'})
h.set_index('country')
gini = giniraw.iloc[:, np.r_[0, 222:223]]
g = gini.rename(columns = {'2020':'gini'})
g.set_index('country')
income = incomeraw.iloc[:, np.r_[0, 222:223]]
i = income.rename(columns = {'2020':'income'})
i.set_index('country')
lifeexpectancy = lifeexpectancyraw.iloc[:, np.r_[0, 222:223]]
l = lifeexpectancy.rename(columns = {'2020':'life expectancy'})
l.set_index('country')

l.head(100)

Unnamed: 0,country,life expectancy
0,Afghanistan,64.0
1,Angola,65.8
2,Albania,78.7
3,Andorra,
4,United Arab Emirates,74.2
...,...,...
95,Kuwait,81.7
96,Lao,69.6
97,Lebanon,76.8
98,Liberia,66.7


In [188]:
t = pd.DataFrame(columns={'country','happiness','income','gini','lifeexpectancy'})
t.merge(h, how='left', left_on='country', right_on='country').merge(i, how='left', left_on='country', right_on='income')

t.head()

Unnamed: 0,country,gini,happiness,lifeexpectancy,income


In [190]:
combined_data = pd.concat([h,i,l,g])
combined_data = combined_data.reset_index(drop=True)
cd = combined_data
cd.to_csv('sorted_gapminder_data.csv', index=False)

cd.head(20)


Unnamed: 0,country,happiness,income,life expectancy,gini
0,Afghanistan,24.0,,,
1,Angola,,,,
2,Albania,52.0,,,
3,United Arab Emirates,65.8,,,
4,Argentina,59.7,,,
5,Armenia,54.0,,,
6,Australia,71.6,,,
7,Austria,71.6,,,
8,Azerbaijan,51.7,,,
9,Burundi,,,,


In [194]:
df2=t.append(h)
df2=t.append(i)
df2.head()

  df2=t.append(h)
  df2=t.append(i)


Unnamed: 0,country,gini,happiness,lifeexpectancy,income
0,Afghanistan,,,,1970
1,Angola,,,,1520
2,Albania,,,,3560
3,United Arab Emirates,,,,35.3k
4,Argentina,,,,12.1k
