# Project proposal

---

Group name: Group F (Ji Soo Ha & Alexander Hörmann)


---


## Introduction

The introduction section includes

-   an introduction to the subject matter you're investigating
-   the motivation for your research question (citing any relevant literature)
-   the general research question you wish to explore
-   your hypotheses regarding the research question of interest.


## Data description

In this section, you will describe the data set you wish to explore. This includes

-   description of the observations in the data set,
-   description of how the data was originally collected (not how you found the data but how the original curator of the data collected it).


## Analysis approach

In this section, you will provide a brief overview of your analysis approach. This includes:

-   Description of the response variable.
  - Productivity: Productivity is measured as gross domestic product (GDP) per hour of work. This data is adjusted for inflation and for differences in the cost of living between countries.
-   Visualization and summary statistics for the **response variable**.

In [1]:
import numpy as np
import pandas as pd
import altair as alt
import io

In [2]:
from google.colab import files
uploaded = files.upload()

Saving Labor_Productivity_Analysis_final.csv to Labor_Productivity_Analysis_final.csv


In [3]:
df = pd.read_csv(io.BytesIO(uploaded['Labor_Productivity_Analysis_final.csv']),sep=';',decimal=',')


  - Summary statistics for the response variable

In [4]:
summary_statistics = pd.DataFrame(df['Productivity'].describe())
summary_statistics

Unnamed: 0,Productivity
count,65.0
mean,39.199087
std,23.96128
min,3.02265
25%,19.188294
50%,34.622095
75%,56.255718
max,109.488306


  - Visualization for the response variable
    - Productivity by continent
    - Productivity by country

    ->  Both of the ideas can be visualized by bar charts.

In [8]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Continent', 
           sort='-y'), # sort
    y=alt.Y('Productivity')
)

In [6]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('Country', 
           sort='-y'), # sort
    y=alt.Y('Productivity')
)

-   List of variables that will be considered as **predictors**
  - Average annual working hours per worker : Working hours are the annual average per worker.
  - GDP per capita: GDP per capita by country in 2017
  - Population: The number of population by country in 2017
  - Gini coefficient: The 2017 Gini coefficient for each country
  - Life safisfaction: The level of life satisfaction for each country has been measured by survey: “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”
  - Current health expenditure per capita: Healthcare expenditure per capita is measured in current international-$, which adjusts for price differences between countries.
-   Model type 
  - Since there are multiple predictors, we can use a **linear regression model with multiple predictors**. Of course, we cannot assume that all predictors are completely independent of each other, so we also need to check for multicollinearity in the next step.

## Data dictionary



In [7]:
data_dictionary = {'Name': list(df.columns.values),
            'Description': ['Continent name','Country name','Country codes by alpha-3','Year: Only 2017 data was considered in this analysis to satisfy the independence condition','Working hours are the annual average per worker.','The GDP per capita by country in 2017 is listed.','The number of population by country in 2017 is listed.','The 2017 Gini coefficient for each country is listed, so the level of income inequality can be determined.','The level of life satisfaction for each country has been measured by survey: “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”','Productivity is measured as gross domestic product (GDP) per hour of work. This data is adjusted for inflation and for differences in the cost of living between countries.','Healthcare expenditure per capita is measured in current international-$, which adjusts for price differences between countries.'],
            'Role': ['ID','ID','ID','ID','Predictor','Predictor','Predictor','Predictor','Predictor','Response','Predictor'],
            'Type': ['nominal','nominal','nominal','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric'],
            'Format': list(df.dtypes)}

df_dictionary = pd.DataFrame(data_dictionary)
df_dictionary

Unnamed: 0,Name,Description,Role,Type,Format
0,Continent,Continent name,ID,nominal,object
1,Country,Country name,ID,nominal,object
2,Code,Country codes by alpha-3,ID,nominal,object
3,Year,Year: Only 2017 data was considered in this an...,ID,numeric,int64
4,Average annual working hours per worker,Working hours are the annual average per worker.,Predictor,numeric,float64
5,GDP per capita,The GDP per capita by country in 2017 is listed.,Predictor,numeric,float64
6,Population,The number of population by country in 2017 is...,Predictor,numeric,int64
7,gini_coefficient,The 2017 Gini coefficient for each country is ...,Predictor,numeric,float64
8,Life satisfaction,The level of life satisfaction for each countr...,Predictor,numeric,float64
9,Productivity,Productivity is measured as gross domestic pro...,Response,numeric,float64
