<a href="https://colab.research.google.com/github/pedrohortencio/data-analysis-projects/blob/main/World%20Happiness%20Report/The_World_Happiness_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sobre os Dados

![](https://olatorera.com/wp-content/uploads/2020/03/world-happiness-report.png)

Para exemplificar o uso do _pandas_ e suas funções, iremos utilizar os dados do relatório de felicidade global, feito com base em várias fontes e usado por alguns governos como guia de ações públicas.

Serão duas tabelas (que vão ser convertidas em dataframe):

* O relatório de 2021, com ranking.
* O relatório com dados da série história, em alguns países com os indicadores de 2005 a 2020.

Os dados podem ser obtidos de várias fontes confiáveis:

* [Kaggle](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv)
* Do próprio site do [World Happiness Report](https://worldhappiness.report/ed/2021/#appendices-and-data), onde também é possível ler o relatório em formato PDF.

# Importando Bibliotecas

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# Baixando os Dados

In [None]:
# O pandas é muito flexível. É possível criar dataframes inserindo apenas a URL do arquivo, sem a necessidade de baixar localmente.
df_2021 = pd.read_excel("https://happiness-report.s3.amazonaws.com/2021/DataForFigure2.1WHR2021C2.xls")
df_hist = pd.read_excel("https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls")

# EDA

> Por hora, vamos focar no relatório de 2021.

In [None]:
# A primeira coisa a se fazer é checar o DataFrame para verificar:
    # 1) Se a criação foi feita com sucesso. Se não, é necessário voltar ao comando pd.read_excel() e alterar os parâmetros
    # 2) Como os dados estão organizados

In [None]:
# vamos, simplesmente, imprimir as 5 primeiras linhas
df_2021.head()

> Nota-se que os dados estão ordenados com base nos valores da coluna _Ladder score_.

Vamos checar os últimos valores do DataFrame:

In [None]:
df_2021.tail()

In [None]:
# Vamos ver o tamanho do dataset
df_2021.shape

### Investigando as Colunas

> Sempre é uma boa ideia verificar as colunas e entender qual o dado de cada uma:

In [None]:
# Podemos fazer isso assim:
df_2021.columns

In [None]:
# Mas um jeito melhor é:
df_2021.dtypes

In [None]:
# E há um outro jeito, que mostra ainda mais informações:
df_2021.info()

Hora de consultar a documentação para entender alguns pontos-chave:

* Ladder score: _"[...] This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale."_

* Dystopia: _a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors (levels of GDP, life expectancy, generosity, social support, freedom, and corruption)_

* Colunas com "Explained by": são valores corrigidos, pois continham _missing values_, especialmente nas colunas relacionadas a dados de corrupção. A explicação oficial:

_We do not make use of any imputed missing values in our rankings of happiness and
its supporting factors. The only place where we make use of imputation is when
we try to decompose a country’s average ladder score into components explained by
six hypothesized underlying determinants (GDP per person, healthy life expectancy,
social support, perceived freedom to make life choice, generosity and perception of
corruption). A small number of countries have missing values in one or more of these
factors. The most prominent is about the perception of corruption in businesses
and governments. In several countries, the relevant questions were not asked in
the Gallup World Poll. For these countries we impute the missing values using the
“control of corruption” indicator from the Worldwide Governance Indicators (WGI)
project._ [WHR Statistical Appendix 1 - PDF](https://happiness-report.s3.amazonaws.com/2021/Appendix1WHR2021C2.pdf)

* Residuals: _The residuals, or unexplained components, differ for each country, reflecting the extent to which the six variables either over- or under-explain average 2018-2020 life evaluations. These residuals have an average value of approximately zero over the whole set of countries. The difference between what is attributed to the six factors and the total life evaluations is the sum of two parts. These are the average life evaluations in Dystopia, and each country’s residual._ 


--------------
* Upper whisker and Lower whisker: tem relação com estatística. Ver abaixo.

Todo [boxplot](https://en.wikipedia.org/wiki/Box_plot) (diagrama de caixa) possui retas que se extendem em ambos os lados, chamadas de _whisker_ (ou simplesmente retas ou fios de bigode). Eles indicam a variação fora do quartil superior e do quartil inferior.

_O que é um quartil?_

É simplesmente uma divisão dos dados seguindo uma distribuição estatística. Cada quartil tem, usualmente, 25% dos dados. No caso da distribuição chamada de _bell curve_ (gaussiana/normal), os quartis são os seguintes:

![](https://www.researchgate.net/publication/324532937/figure/fig2/AS:615815585992705@1523833285991/Relationship-of-quartiles-and-inter-quartile-range-Legends-Q-1-first-quartile-Q-3.png)

![](https://www.simplypsychology.org/boxplot.jpg)

_OK. O que os whisker indicam?_

_In summary, if there are no individual data points plotted, the whiskers indicate data’s minimum and maximum. If there are individual data points plotted, the whiskers indicate the largest/lowest points inside the range defined by 1st or 3rd quartile plus 1.5 times IQR (interquartile range)._ [Fonte](https://muse.union.edu/dvorakt/what-drives-the-length-of-whiskers-in-a-box-plot/)

### Melhores 10 Países

In [None]:
df_2021[:10]

In [None]:
df_2021[:10].groupby("Regional indicator")["Regional indicator"].count()

In [None]:
# O pandas possui várias alternativas para fazer a mesma operação. Uma alternativa para gerar a contagem de valores acima é:

df_2021[:10]["Regional indicator"].value_counts()

In [None]:
top10 = df_2021[:10].groupby("Regional indicator")["Regional indicator"].count()
ax = top10.plot.pie(autopct='%1.0f%%')
ax.set(xlabel='', ylabel='')
plt.show();

### Piores 10 Países

In [None]:
df_2021[-10:]

In [None]:
df_2021[-10:].groupby("Regional indicator")["Regional indicator"].count()

In [None]:
ultimos10 = df_2021[-10:].groupby("Regional indicator")["Regional indicator"].count()
ax = ultimos10.plot.pie(autopct='%1.0f%%')
ax.set(xlabel='', ylabel='')
plt.show();

> M.O.:
1. Filtrar os dados
2. Procurar padrões
3. Plotar
4. Investigar correlações

## Investigando as Colunas PT2

In [None]:
df_2021.head(2)

In [None]:
df_2021["Ladder score"].describe()

In [None]:
df_2021.describe()

In [None]:
df_2021.head(2)

In [None]:
df_2021[df_2021['Country name'] == 'Brazil']

In [None]:
df_2021.loc[df_2021['Country name'] == 'Brazil', ]

In [None]:
plt.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score']);

In [None]:
fig = plt.figure(figsize=(18, 9))
fig.patch.set_facecolor('#eaeaf2')

ax = plt.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score'])

plt.box(False)
plt.show();

In [None]:
fig = plt.figure(figsize=(18, 9))
fig.patch.set_facecolor('#eaeaf2')

ax = plt.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score'])

plt.xticks(rotation=30, rotation_mode="anchor", ha='right')

plt.box(False)

plt.show();

In [None]:
fig = plt.figure(figsize=(18, 9))
fig.patch.set_facecolor('#eaeaf2')


ax = plt.gca()
plot = ax.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score'])



for r1 in plot.patches:
    h1 = r1.get_height()

    ax.annotate('{:.1f}'.format(h1),
                    xy=(r1.get_x() + r1.get_width() / 2, h1),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.xticks(rotation=30, rotation_mode="anchor", ha='right')

plt.box(False)

plt.show();

In [None]:
fig = plt.figure(figsize=(18, 9))
fig.patch.set_facecolor('#eaeaf2')


ax = plt.gca()
plot = ax.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score'])



for r1 in plot.patches:
    h1 = r1.get_height()

    ax.annotate('{:.1f}'.format(h1),
                    xy=(r1.get_x() + r1.get_width() / 2, h1),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.xticks(rotation=30, rotation_mode="anchor", ha='right')

plt.box(False)
ax.set_yticks([])
ax.grid(False)

plt.show();

In [None]:
fig = plt.figure(figsize=(18, 9))
fig.patch.set_facecolor('#eaeaf2')


ax = plt.gca()
plot = ax.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score'], color = sns.color_palette("ch:start=.2,rot=-.3_r", n_colors=41))



for r1 in plot.patches:
    h1 = r1.get_height()

    ax.annotate('{:.1f}'.format(h1),
                    xy=(r1.get_x() + r1.get_width() / 2, h1),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.xticks(rotation=30, rotation_mode="anchor", ha='right')

plt.box(False)
ax.set_yticks([])
ax.grid(False)

plt.show();

In [None]:
fig = plt.figure(figsize=(18, 9))
fig.patch.set_facecolor('#eaeaf2')


ax = plt.gca()
plot = ax.bar(x=df_2021[:40]['Country name'], height=df_2021[:40]['Ladder score'], color = sns.color_palette("ch:start=.5,rot=-.75", n_colors=41))



for r1 in plot.patches:
    h1 = r1.get_height()

    ax.annotate('{:.1f}'.format(h1),
                    xy=(r1.get_x() + r1.get_width() / 2, h1),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.xticks(rotation=30, rotation_mode="anchor", ha='right')

plt.box(False)
ax.set_yticks([])
ax.grid(False)

plt.title('40 Primeiros Países no World Happiness Report', size=22, alpha=0.8, y=1.02)

plt.show();

### Boas Investigações Precisam de Plots

In [None]:
df_hist.head()

In [None]:
df_hist.query("year == 2008")['Log GDP per capita']

In [None]:
fig = plt.figure(figsize = (16, 18))

plt.subplot(321)
sns.set_style("white")
plt.title('GDP per capita', size = 17, y = 1.03, fontname = 'serif')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
a = sns.kdeplot(df_hist.query("year == 2008")['Log GDP per capita'], color = '#347373', shade = True, label = '2008', alpha = 0.7)
plt.axvline(df_hist.query("year == 2008")['Log GDP per capita'].mean(), linestyle = '--', color = '#347373')
sns.kdeplot(df_hist.query("year == 2020")['Log GDP per capita'], color = '#2D3033', shade = True, label = '2020', alpha = 0.7)
plt.axvline(df_hist.query("year == 2020")['Log GDP per capita'].mean(), linestyle = '--', color = '#2D3033')
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'serif')
plt.yticks([])
plt.legend(['Mean', 'Mean', '2008', '2020'], bbox_to_anchor = (1.4, 1.2), ncol = 2, borderpad = 3, frameon = False, fontsize = 11)

plt.subplot(322)
plt.title('Social support', size = 17, y = 1.03, fontname = 'serif')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
b = sns.kdeplot(df_hist.query("year == 2008")['Social support'], color = '#347373', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2008")['Social support'].mean(), linestyle = '--', color = '#347373')
sns.kdeplot(df_hist.query("year == 2020")['Social support'], color = '#2D3033', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2020")['Social support'].mean(), linestyle = '--', color = '#2D3033')
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'serif')
plt.yticks([])

plt.subplot(323)
plt.title('Life expectancy', size = 17, y = 1.03, fontname = 'serif')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
t = sns.kdeplot(df_hist.query("year == 2008")['Healthy life expectancy at birth'], color = '#347373', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2008")['Healthy life expectancy at birth'].mean(), linestyle = '--', color = '#347373')
sns.kdeplot(df_hist.query("year == 2020")['Healthy life expectancy at birth'], color = '#2D3033', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2020")['Healthy life expectancy at birth'].mean(), linestyle = '--', color = '#2D3033')
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'serif')
plt.yticks([])

plt.subplot(324)
plt.title('Freedom', size = 17, y = 1.03, fontname = 'serif')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
d = sns.kdeplot(df_hist.query("year == 2008")['Freedom to make life choices'], color = '#347373', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2008")['Freedom to make life choices'].mean(), linestyle = '--', color = '#347373')
sns.kdeplot(df_hist.query("year == 2020")['Freedom to make life choices'], color = '#2D3033', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2020")['Freedom to make life choices'].mean(), linestyle = '--', color = '#2D3033')
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'serif')
plt.yticks([])

plt.subplot(325)
plt.title('Generosity', size = 17, y = 1.03, fontname = 'serif')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
e = sns.kdeplot(df_hist.query("year == 2008")['Generosity'], color = '#347373', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2008")['Generosity'].mean(), linestyle = '--', color = '#347373')
sns.kdeplot(df_hist.query("year == 2020")['Generosity'], color = '#2D3033', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2020")['Generosity'].mean(), linestyle = '--', color = '#2D3033')
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'serif')
plt.yticks([])

plt.subplot(326)
plt.title('Perceptions of corruption', size = 17, y = 1.03, fontname = 'serif')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
f = sns.kdeplot(df_hist.query("year == 2008")['Perceptions of corruption'], color = '#347373', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2008")['Perceptions of corruption'].mean(), linestyle = '--', color = '#347373')
sns.kdeplot(df_hist.query("year == 2020")['Perceptions of corruption'], color = '#2D3033', shade = True, alpha = 0.7)
plt.axvline(df_hist.query("year == 2020")['Perceptions of corruption'].mean(), linestyle = '--', color = '#2D3033')
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'serif')
plt.yticks([])

plt.figtext(0.2, 1.02, 'Global dynamics of all WHR metrics 2008-2020', fontsize = 30, fontname = 'serif')
plt.figtext(0.01, 0.93, '+6.4%', fontsize = 18, fontname = 'serif')
plt.figtext(0.93, 0.93, '+7.0%', fontsize = 18, fontname = 'serif')
plt.figtext(0.01, 0.6, '+9.6%', fontsize = 18, fontname = 'serif')
plt.figtext(0.92, 0.6, '+18.8%', fontsize = 18, fontname = 'serif')
plt.figtext(0.01, 0.27, '-137.3%', fontsize = 18, fontname = 'serif')
plt.figtext(0.94, 0.27, '-7.4%', fontsize = 18, fontname = 'serif')

for i in [a,b,t,d,e,f]:
    for j in ['right', 'left', 'top']:
        i.spines[j].set_visible(False)
        i.spines['bottom'].set_linewidth(1.5)
        
fig.tight_layout(h_pad = 3)

plt.show()