# Assignment 4

**Assignment Question: How does United State's Life Satisfaction compare with other nations and is Life Satisfaction of a country related to it's GDP per capita?**

The visualization is concerned with answering the above question, how does United State's Life Satisfaction compare with other nations and is Life Satisfaction of a country related to it's GDP per capita. The two datasets for this assignment were obtained from 'Organisation for Economic Co-operation and Development' website (http://www.oecd.org/). Data (Life Satisfaction and GDP per capita) from a collection of developing and developed nations were extracted, while others dropped, so as not to clutter the resulting plot to decrease graph functionality. Relative Life Satisfaction of the selected nations, including United States, was plotted against GDP per capita for the year 2017.

The graph indicates that there is indeed a strong correlation between the two metrics (Life Satisfaction and GDP per capita) as measured by Scipy's Pearson correlation coefficient of 0.636 but the relationship is not in anyway linear as shown by the Linear Regression plot in the graph. While United State's Life Satisfaction is among the developed nations' cluster (denoted by red filled circles) and higher than most developing nations (denoted by yellow filled circles) but some developing countries like Mexico and Chile are not too far off from advanced nations like United States and indeed higher than France and Luxembourg, two nations with the highest GDP per capita among the selected nations. Thus, this visualization signifies that Life satisfaction depends on a multitude of factors, not just GDP per capita, which may include Affordable Healthcare, Good K-12 and Higher Education, Cost-of-living, affordable and quality Housing, country's infrastructure like security and transport, etc.

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np # import numpy which is fundamental package for scientific computing with Python
import pandas as pd # import pandas which is high-performance, easy-to-use data structures and data analysis framework for the Python
import sklearn # import scikit Machine Learning framework for Python
from sklearn import linear_model # import Generalized Linear Models from scikit Machine Learning framework for Python
import os # import OS module which provides a way of using operating system dependent functionality

# to make this notebook's output stable across runs
np.random.seed(42)

%matplotlib inline
import matplotlib # importing plotting library for the Python programming language
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14 # setting axes labelsize
plt.rcParams['xtick.labelsize'] = 12 # setting x axes tick label size
plt.rcParams['ytick.labelsize'] = 12 # setting y axes tick label size

In [None]:
datapath = os.path.join("datasets", "lifesat", "") # joining the directories
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',') # reading the OECD dataset and storing it in a dataframe with thousand seperator removed from the values
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"] # keeping only the nations with better index on gender equality
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value") # pivoting the dataframe with country as index, indicator as columns 

In [None]:
oecd_bli 

In [None]:
gdp_per_capita = pd.read_csv(datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a") # reading the GDP per capita from a UN dataset
gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True) # renaming a column header
gdp_per_capita.set_index("Country", inplace=True) # setting country as index

In [None]:
gdp_per_capita

In [None]:
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True) # merging the oecd and GDP per capita dataset
full_country_stats.sort_values(by="GDP per capita", inplace=True) # sort values by GDP per capita
full_country_stats

In [None]:
full_country_stats["Life satisfaction"].corr(full_country_stats["GDP per capita"], method='pearson', min_periods=None) # computing pearson correlation between "GDP per capita" and "Life Satisfaction"

In [None]:
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import numpy as np
import pandas as pd
import sklearn
from sklearn import linear_model


lin_reg_full = linear_model.LinearRegression() # importing linear regression models from scikit learn's linear model
Xfull = np.c_[full_country_stats["GDP per capita"]] # concatenate the entire values of column, "GDP per capita"
yfull = np.c_[full_country_stats["Life satisfaction"]] # concatenate the entire values of column, "Life satisfaction"
lin_reg_full.fit(Xfull, yfull) # fitting a linear regression model on "GDP per capita" as independent values and "Life satisfaction" as dependent values

In [None]:
full_country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', color='mediumblue', figsize=(8,8)) # plotting scatterplot of 'Life satisfaction' vs "GDP per capita" 
plt.axis([0, 110000, 0, 10]) # defining axes ranges
position_text = {
    "Hungary": (5000, 1),
    "Korea": (18000, 1.7),
    "France": (29000, 2.4),
    "Australia": (40000, 3.0),
    "United States": (52000, 3.8),
    "Turkey": (500, 2.0),
    "Mexico": (11000, 9.0),
    "Chile": (25000, 9.0),
    "Czech Republic": (35000, 9.0),
    "Norway": (60000, 3),
    "Switzerland": (72000, 3.0),
    "Luxembourg": (90000, 3.0),
    } # defining dictionary containing position of Nations' names positions on the graph
for country, pos_text in position_text.items(): # iterating through aforementioned selected countries for study
    pos_data_x, pos_data_y = full_country_stats[["GDP per capita", 'Life satisfaction']].loc[country] # extracting GDP per capita & Life satisfaction of each country 
    plot_colour = 'gold' if (country == "Mexico") | (country == "Chile") | (country == "Czech Republic")  | (country == "Hungary") | (country == "Turkey") else 'red' # setting plot color of each country in interation
    colour = 'orangered' if country == "United States" else 'black' # setting text color of each country in iteration
    font_size = 20 if country == "United States" else 10 # font colour of each country in iteration
    country = "U.S." if country == "United States" else country
    plt.annotate(country, color=colour, fontsize=font_size, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(color='darkgreen', facecolor='black', width=1, shrink=0.1, alpha=0.5, headwidth=5)) # annotating the graph with points belonging to aforementioned selected countries for study
    plt.plot(pos_data_x, pos_data_y, color=plot_colour, marker='o', linewidth=10) # plotting the annotated graph
    
t0full, t1full = lin_reg_full.intercept_[0], lin_reg_full.coef_[0][0] # intercept & coefficient of linear regression model
X = np.linspace(0, 110000, 1000)
plt.plot(X, t0full + t1full * X, color='gray', ls='-.')
plt.text(70000, 7.5, "Best fit Linear Regression line", ha='left', rotation=14, wrap=True) # plotting the line plot obtained from the intercept & coefficient of linear regression model
    
text =  "Pearson Correlation Coefficient = 0.636"
   
plt.text(60000, 1, text, style='italic',
        bbox={'facecolor':'lightcyan', 'alpha':0.5, 'pad':10}) # highlighting the graph with pearson coefficient between the axes variables

plt.ylabel('Relative Life Satisfaction in 2017 (OECD better life Index)', fontsize=14, style='italic') # setting y label
plt.xlabel('GDP per capita in USD in 2017', fontsize=14, style='italic') # setting x label
plt.title('Life Satisfaction vs GDP per capita', fontsize=14) #setting plot title

legend_elements = [Line2D([0], [0], marker='o', color='red', label='Countries with Higher GDP per capita', markersize=15),
                   Line2D([0], [0], marker='o', color='gold', label='Countries with Lower GDP per capita', markersize=15),
                   Line2D([0], [0], marker='o', color='mediumblue', label='Other Countries', markersize=15)] # adding plot legends

plt.legend(handles=legend_elements, loc='best') # selecting plot legend location

plt.show()