# PSTAT 100 Project Plan Report

In [None]:
# libraries
import numpy as np
import pandas as pd
import altair as alt

## Group information

**Group members**: Chunting Zheng, Karen Zhao

**Contributions**:

1. Chunting Zheng studied the data documentation and prepared the data description.
2. Karen Zhao worked on tidying the dataset and creating data visualizations.

---
## 0. Background

In recent years, personal freedom has declined around the world. It can be challenging to determine which countries' citizens have the most freedom. However, we can determine which countries have the highest level of human freedom through individual indices. The [Human Freedom Index Report for 2020](https://www.cato.org/sites/cato.org/files/2021-03/human-freedom-index-2020.pdf) is co‐published by the Cato Institute and the Fraser Institute. The Human Freedom Index helps observe relationship between freedom and other socioeconomic phenomena, as well as ways in which the various dimensions of freedom interact with one another. 

To determine each country's freedom rank, each country is given a score for `Personal Freedom` and `Economic Freedom`. These scores are averaged to find the `Human Freedom` score. Countries in the top quartile of freedom enjoy a significantly higher average per capita income ($\$50,340$) than those in other quartiles; the average per capita income in the least-free quartile is $\$7,720$.

The findings in the Human Freedom Index suggest that freedom plays an important role in human well-being, and they offer opportunities for further research into the complex ways in which freedom influences, and can be influenced by, political reigmes, economic development, and the whole range of indicators of human well-being. Also, the Human Freedom Index finds a strong relationship between human freedom and democracy. 

Moreover, freedom and happiness tend to be positively coorelated. The [World Happiness Report for 2020](https://happiness-report.s3.amazonaws.com/2020/WHR20.pdf) brings together the available global data on national happiness and reviewing evidence from the emerging science of happiness. The report reminds us, happiness is based on social capital, not just financial capital. In this project, we will dig more deeply into the pattern of correlation and see how it differs across cultures and aspects of freedom in 2018. 

---
## 1. Data description

### Basic information

**General description**: The `Human Freedom Index` presents the state of human freedom of 162 countries around the world based on a broad measure with the help of 79 distinct indicators that encompasses, economic, civil, and personal freedom. The `World Happiness Report` contains happiness score for 156 countries along with the factors used to explain the score.

**Source**: The data in the `Human Freedom Index` is collected and compiled by Ryan Murphy. The data in the `World Happiness Report` is baesed on the Gallup World Poll.

The data are publicly available: 
> Ian Vasquez and Fred McMahon, The Human Freedom Index 2020: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute, Fraser Institute, and the Friedrich Naumann Foundation for Freedom, 2020).

> Helliwell, John F., Richard Layard, Jeffrey Sachs, and Jan-Emmanuel De Neve, eds. 2020. World Happiness Report 2020. New York: Sustainable Development Solutions Network

The full dataset can be downloaded [here](https://www.cato.org/human-freedom-index/2020) for `Human Freedom Index`, and [here](https://worldhappiness.report/ed/2020/) for `World Happiness Report`.


**Collection methods**: The data value in the `Human Freedom Index` is obtained or caluclated through different variables based on various dataset, such as Global Terriorism Database, Gloabl Database, United Nations, CI-RIGHTS Dataset, OECD, and so on. For example, rule of law is an average and procedural justice, civil justice, and criminal justice, and each subcomponent is calculated as an average of selected Rule of Law Index subfactors. The data value in the `World Happiness Report` is evaluated from the Gallup World Poll surveys. This data is collected by asking people about life satisfaction and happiness for each country.


**Sampling design and scope of inference**: The `Human Freedom Index` data has annual observations of freedom index for 162 countries from 2008 to 2018. Also, the `World Happiness Report` data has observations of happiness score for 156 countries from 2008 to 2018. The population is all countries existing between 2008 and 2018. The frame is all countries reporting some kind of national freedom or happiness information for some year between 2008 and 2018. The sample is equal to frame. So the frame partly overlaps population, and sample is a census of the frame. Both data are administrative data. Thus, the scope of inference is none. No information is available about the sampling design for both data.


### Data semantics and structure

**Units and observations**: The **observational units** are _**countries**_.  

**Variable descriptions**: 

Name | Variable description | Type | Units of measurement
---|---|---|---
hf_score | Human freedom score | Numeric |  Ranges from 0 to 10
pf_rol | Rule of law | Numeric | Ranges from 0 to 10
pf_ss | Security and safety | Numeric | Ranges from 0 to 10
pf_movement | Freedom of movement (travel) | Numeric | Ranges from 0 to 10
pf_religion | Religious freedom | Numeric | Ranges from 0 to 10
pf_association | Freedom to associate and assemble with peaceful individuals or organizations | Numeric | Ranges from 0 to 10
pf_expression | Freedom of expression | Numeric | Ranges from 0 to 10
pf_identity | Identity and relationships | Numeric | Ranges from 0 to 10
pf_score | Personal freedom score | Numeric | Ranges from 0 to 10
ef_government | Size of government | Numeric | Ranges from 0 to 10
ef_legal | Legal system and property rights | Numeric | Ranges from 0 to 10
ef_money | Sound money | Numeric | Ranges from 0 to 10
ef_trade | Freedom to trade internationally | Numeric | Ranges from 0 to 10
ef_regulation | Regulation of Credit, Labor, and Business | Numeric | Ranges from 0 to 10
ef_score | Economic freedom score | Numeric | Ranges from 0 to 10
life_ladder | Life evaluation score | Numeric | Ranges from 0 to 10
income | Income level | Categorical | Low, lower-middle, upper-middle, high

In [None]:
# load tidied data and print rows
hfi = pd.read_csv('data/hf-index.csv')
hf_happiness = pd.read_csv('data/hf-whr18.csv')

In [None]:
hfi.head(3)

Unnamed: 0,region,year,income,country,hf_score,pf_rol,pf_ss,pf_movement,pf_religion,pf_association,pf_expression,pf_identity,pf_score,ef_government,ef_legal,ef_money,ef_trade,ef_regulation,ef_score
0,Europe & Central Asia,2018,Upper middle,Albania,7.81,5.0,9.3,10.0,8.9,8.6,9.2,5.8,7.81,8.1,5.2,9.8,8.2,7.7,7.8
1,Europe & Central Asia,2017,Upper middle,Albania,7.78,5.3,9.3,10.0,8.7,8.4,9.2,5.8,7.86,7.4,5.4,9.6,8.2,7.9,7.7
2,Europe & Central Asia,2016,Upper middle,Albania,7.63,5.3,8.7,8.3,8.9,8.5,9.2,5.8,7.57,7.8,5.5,9.6,8.0,7.6,7.69


In [None]:
hf_happiness.head(3)

Unnamed: 0,region,year,income,country,hf_score,pf_rol,pf_ss,pf_movement,pf_religion,pf_association,pf_expression,pf_identity,pf_score,ef_government,ef_legal,ef_money,ef_trade,ef_regulation,ef_score,life_ladder
0,Europe & Central Asia,2018,Upper middle,Albania,7.81,5.0,9.3,10.0,8.9,8.6,9.2,5.8,7.81,8.1,5.2,9.8,8.2,7.7,7.8,5.004403
1,Middle East & North Africa,2018,Upper middle,Algeria,5.2,5.1,7.8,5.8,3.7,4.6,7.9,0.0,5.42,4.2,4.5,7.9,2.6,5.6,4.97,5.043086
2,Sub-Saharan Africa,2018,Lower middle,Angola,5.48,3.6,8.4,6.7,6.5,5.3,7.1,6.7,6.21,7.3,3.4,4.7,2.9,5.4,4.75,


---
## 2. Initial explorations

### Basic properties of the dataset

#### 2(a). Dimensions

In [None]:
print("the dimensions of the Human Freedom Index dataset is", hfi.shape)
print("the dimensions of the 2018 Human Freedom and Happiness dataset is:", hf_happiness.shape)

the dimensions of the Human Freedom Index dataset is (1782, 19)
the dimensions of the 2018 Human Freedom and Happiness dataset is: (162, 20)


### 2(b). Missing values

Check the missing values for each variable in the Human Freedom Index dataset (containing data from 08-18).

In [None]:
hfi.loc[:, 'hf_score':].isna().sum().reset_index().rename(columns={'index': 'variable', 0: 'missing values'})

Unnamed: 0,variable,missing values
0,hf_score,80
1,pf_rol,80
2,pf_ss,80
3,pf_movement,80
4,pf_religion,11
5,pf_association,33
6,pf_expression,80
7,pf_identity,80
8,pf_score,80
9,ef_government,6


In [None]:
# observation counts of missing values by year
hfi.isnull().groupby(hfi.year).sum().sum(axis=1).reset_index().rename(columns={0: 'count'}).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
year,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
count,238,238,101,99,99,57,37,37,4,4,4


In [None]:
# observation counts of missing values by region
hfi.isnull().groupby(hfi.region).sum().sum(axis=1).sort_values().reset_index().rename(columns={0: 'count'}).transpose()

Unnamed: 0,0,1,2,3,4,5,6
region,North America,Latin America & Caribbean,South Asia,Europe & Central Asia,East Asia & Pacific,Middle East & North Africa,Sub-Saharan Africa
count,0,44,52,110,156,235,321


There tends to be more missing values in earlier years and regions with less developed countries. This might be a result of numerous reasons. For example, it might be difficult for organizations to collect data from less developed or rural countries. Also, some countries might not report data due to conflict, lack of statistical capacity, etc. And some countries do not have data for earlier years simply because they did not exist.

Check the missing values for each variable in the 2018 Human Freedom and Happiness dataset.



In [None]:
hf_happiness.loc[:, ['pf_religion', 'pf_association', 'life_ladder']].isna().sum().reset_index(
).rename(columns={'index': 'variable', 0: 'missing values'})

Unnamed: 0,variable,missing values
0,pf_religion,1
1,pf_association,3
2,life_ladder,28


Th only additional variable in this dataset is the `life_ladder`, we will just look at the number of missing values of this variable. Since the human freedom index dataset contains data from more countries than that of the world happiness, some countries that provided data related to freedom scores might not appear in the world happiness data. 


### 2(C). Variable summaries

Summary statistics for the Human Freedom Index dataset.

In [None]:
hfi.loc[:, 'hf_score':].describe().round(4)

Unnamed: 0,hf_score,pf_rol,pf_ss,pf_movement,pf_religion,pf_association,pf_expression,pf_identity,pf_score,ef_government,ef_legal,ef_money,ef_trade,ef_regulation,ef_score
count,1702.0,1702.0,1702.0,1702.0,1771.0,1749.0,1702.0,1702.0,1702.0,1776.0,1782.0,1702.0,1701.0,1715.0,1702.0
mean,7.0076,5.2644,8.1061,7.8358,7.4171,7.2142,8.3896,7.21,7.1681,6.5565,5.2745,8.154,7.0569,7.0485,6.8423
std,1.0751,1.5519,1.4624,2.5994,1.6441,2.2893,1.3772,3.1839,1.3974,1.3038,1.396,1.4049,1.2893,1.0476,0.9294
min,3.6,1.7,3.5,0.0,0.6,0.5,0.1,0.0,2.31,0.1,2.2,0.7,1.8,2.5,2.72
25%,6.26,4.1,7.2,6.7,6.6,6.1,7.7,5.0,6.1425,5.7,4.3,7.1,6.2,6.5,6.22
50%,6.99,4.9,8.3,8.3,7.9,7.9,8.8,8.8,7.19,6.6,5.2,8.4,7.2,7.1,6.95
75%,7.96,6.5,9.5,10.0,8.6,9.0,9.5,10.0,8.39,7.5,6.2,9.4,8.1,7.8,7.57
max,8.99,8.8,10.0,10.0,9.9,10.0,10.0,10.0,9.59,9.5,8.5,9.9,9.6,9.5,8.97


In [None]:
hfi.groupby('income').size().reset_index().rename(columns={0: 'count'}).transpose()

Unnamed: 0,0,1,2,3
income,High,Low,Lower middle,Upper middle
count,555,320,446,461



Summary statistics for the 2018 Human Freedom and Happiness dataset.


In [None]:
hf_happiness.describe().round(4).loc[:,['hf_score', 'pf_score', 'ef_score', 'life_ladder']]

Unnamed: 0,hf_score,pf_score,ef_score,life_ladder
count,162.0,162.0,162.0,134.0
mean,6.933,7.0013,6.8602,5.5365
std,1.1138,1.4444,0.9545,1.0686
min,3.97,2.49,3.34,3.0575
25%,6.22,6.0025,6.2275,4.8404
50%,6.92,7.05,6.915,5.4809
75%,7.8625,8.105,7.62,6.2697
max,8.87,9.46,8.94,7.8581


In [None]:
hf_happiness.groupby('income').size().reset_index().rename(columns={0: 'count'}).transpose()

Unnamed: 0,0,1,2,3
income,High,Low,Lower middle,Upper middle
count,53,26,38,45


### Exploratory analysis

<img src = 'plan1.png' style = 'height:500px'>

Like we mentioned in the background, the personal freedom score has decreased markedly. The human freedom score also seems to decreased in recent years.

<img src = 'plan2.png' style = 'width:800px'>

The human freedom scores in high income countries tends to be much higher than other income level countries. Additionally, from the scatterplots above, we can see that coutnries in higher income groups tend to have higher human freedom score and happiness score. Looking at data just from 2018, we can see that countries with higher freedom scores also tend to have higher happiness scores.

---
## 3. Planned work

### Questions

Please propose two focused questions that you plan to explore.

1. What variables are important indicators of human freedom?
2. What is the relationship between world happiness and freedom? Can we use freedom index to predict world happiness in 2018?

### Proposed approaches

For each question, please describe an idea or two about how you might approach the question.

1. Idea 1: Construct scatterplot matrix and/or heatmap with the 12 personal and economic freedom variables. <br>
Idea 2: Perform a Principals Componenet Analysis with the 12 variables and attempt to interpret the principal components 
2. We will investigate the association between life evaluation score, human freedom, personal freedom, and economic freedom using regression. We will fit a multi linear regression model with life evaluation score as the response variable, and other variables as explanatory variables.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f67ac267-6d1b-412b-8a5a-c5b3558f5c98' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>