# Correlation between HDI and GDP (PPP) per capita

In this test case we'll try to use Pandas to verify if there is correlation between HDI and GDP (PPP) per capita. This is kinda obvious, but this is a way to show the power of this tool.

First let's get information about HDI.

In [1]:
import pandas as pd

hdi_url = 'https://countryeconomy.com/hdi'
hdiDFs = pd.read_html(hdi_url)
hdiDFs

[              Countries    HDI HDI Ranking  HDI Ranking.1  Ch.
 0     United States [+]  0.924         13º            NaN  1.0
 1    United Kingdom [+]  0.922         14º            NaN  0.0
 2           Germany [+]  0.936          5º            NaN  1.0
 3            France [+]  0.901         24º            NaN  0.0
 4             Japan [+]  0.909         19º            NaN  0.0
 ..                  ...    ...         ...            ...  ...
 184           Samoa [+]  0.713        104º            NaN  0.0
 185           Yemen [+]  0.452        178º            NaN  6.0
 186    South Africa [+]  0.699        113º            NaN  2.0
 187          Zambia [+]  0.588        144º            NaN  3.0
 188        Zimbabwe [+]  0.535        156º            NaN  1.0
 
 [189 rows x 5 columns]]

The DataFrame that matters to us is in the first position of the list of captured DataFrames.

In [2]:
hdi = hdiDFs[0]
hdi

Unnamed: 0,Countries,HDI,HDI Ranking,HDI Ranking.1,Ch.
0,United States [+],0.924,13º,,1.0
1,United Kingdom [+],0.922,14º,,0.0
2,Germany [+],0.936,5º,,1.0
3,France [+],0.901,24º,,0.0
4,Japan [+],0.909,19º,,0.0
...,...,...,...,...,...
184,Samoa [+],0.713,104º,,0.0
185,Yemen [+],0.452,178º,,6.0
186,South Africa [+],0.699,113º,,2.0
187,Zambia [+],0.588,144º,,3.0


Now let's capture information about about adjusted per capita income.

In [3]:
income_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita'
incomeDFs = pd.read_html(income_url)
incomeDFs

[                                                   0  \
 0  > 50,000 35,000–50,000 20,000–35,000 10,000–20...   
 
                                           1  
 0  5,000–10,000 2,000–5,000 < 2,000 No data  ,
                                                      0  \
 0                International Monetary Fund (2018)[4]   
 1    Rank Country/Territory Int$ 1  Qatar 130,475 —...   
 2                                                 Rank   
 3                                                    1   
 4                                                    —   
 ..                                                 ...   
 612                                                189   
 613                                                190   
 614                                                191   
 615                                                192   
 616                                                193   
 
                                                      1  \
 0                  

The DataFrame that matters to us is in the fifth position of the list of DFs, wel'll therefore use it as our income data.

In [4]:
income = incomeDFs[4]
income

Unnamed: 0,Rank,Country/Territory,Int$,Year
0,1,Liechtenstein,139100.0,2009 est.
1,2,Qatar,124900.0,2017 est.
2,3,Monaco,115700.0,2015 est.
3,—,Macau,114400.0,2017 est.
4,4,Luxembourg,109100.0,2017 est.
...,...,...,...,...
225,189,Liberia,900.0,2017 est.
226,190,"Congo, Democratic Republic of the",800.0,2017 est.
227,191,Burundi,800.0,2017 est.
228,192,Central African Republic,700.0,2017 est.


Now we have our two DataFrames. Let's take a look at them again.

In [5]:
hdi

Unnamed: 0,Countries,HDI,HDI Ranking,HDI Ranking.1,Ch.
0,United States [+],0.924,13º,,1.0
1,United Kingdom [+],0.922,14º,,0.0
2,Germany [+],0.936,5º,,1.0
3,France [+],0.901,24º,,0.0
4,Japan [+],0.909,19º,,0.0
...,...,...,...,...,...
184,Samoa [+],0.713,104º,,0.0
185,Yemen [+],0.452,178º,,6.0
186,South Africa [+],0.699,113º,,2.0
187,Zambia [+],0.588,144º,,3.0


In [6]:
income

Unnamed: 0,Rank,Country/Territory,Int$,Year
0,1,Liechtenstein,139100.0,2009 est.
1,2,Qatar,124900.0,2017 est.
2,3,Monaco,115700.0,2015 est.
3,—,Macau,114400.0,2017 est.
4,4,Luxembourg,109100.0,2017 est.
...,...,...,...,...
225,189,Liberia,900.0,2017 est.
226,190,"Congo, Democratic Republic of the",800.0,2017 est.
227,191,Burundi,800.0,2017 est.
228,192,Central African Republic,700.0,2017 est.


I want both DataFrames to have only two columns: the name of the country and the index. We'll have to format our DFs in order to achieve this. Let's start with the HDI DataFrame.

In [7]:
hdi.drop(['HDI Ranking', 'HDI Ranking.1', 'Ch.'] , 1, inplace=True)
hdi

Unnamed: 0,Countries,HDI
0,United States [+],0.924
1,United Kingdom [+],0.922
2,Germany [+],0.936
3,France [+],0.901
4,Japan [+],0.909
...,...,...
184,Samoa [+],0.713
185,Yemen [+],0.452
186,South Africa [+],0.699
187,Zambia [+],0.588


I want the names of the columns of my DataFrames to be standardized to *Country*, *HDI* and *GDP (PPP) percapita (U$)*, therefore I need to rename one of the columns in my HDI DataFrame.

In [8]:
hdi.rename(columns={'Countries': 'Country'}, inplace=True)
hdi

Unnamed: 0,Country,HDI
0,United States [+],0.924
1,United Kingdom [+],0.922
2,Germany [+],0.936
3,France [+],0.901
4,Japan [+],0.909
...,...,...
184,Samoa [+],0.713
185,Yemen [+],0.452
186,South Africa [+],0.699
187,Zambia [+],0.588


Finally I gotta fix the name of the countries. do you see that [+]? Well, it must be removed.

In [9]:
hdi['Country'] = hdi['Country'].str.rstrip(' [+]')

In [10]:
hdi

Unnamed: 0,Country,HDI
0,United States,0.924
1,United Kingdom,0.922
2,Germany,0.936
3,France,0.901
4,Japan,0.909
...,...,...
184,Samoa,0.713
185,Yemen,0.452
186,South Africa,0.699
187,Zambia,0.588


Alright! Our HDI DataFrame is formatted. Let's format the income DataFrame.

In [11]:
income.drop(['Rank', 'Year'] , 1, inplace=True)
income.columns = ['Country', 'GDP (PPP) percapita (U$)']
income

Unnamed: 0,Country,GDP (PPP) percapita (U$)
0,Liechtenstein,139100.0
1,Qatar,124900.0
2,Monaco,115700.0
3,Macau,114400.0
4,Luxembourg,109100.0
...,...,...
225,Liberia,900.0
226,"Congo, Democratic Republic of the",800.0
227,Burundi,800.0
228,Central African Republic,700.0


Now let's merge our two DataFrames. Also let's take a look at the full result of the merge, sorted by HDI. We'll also set the country name as index because the initial numeric index is not usefull at this point.

In [12]:
pd.set_option('display.max_rows', 200)

mergedDF = hdi.merge(income, left_on='Country', right_on='Country')
mergedDF.sort_values('HDI', ascending=False, inplace=True)
mergedDF.set_index('Country', inplace=True)
mergedDF

Unnamed: 0_level_0,HDI,GDP (PPP) percapita (U$)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Norway,0.953,70600.0
Switzerland,0.944,61400.0
Australia,0.939,49900.0
Ireland,0.938,72600.0
Germany,0.936,50200.0
Iceland,0.935,52100.0
Hong Kong,0.933,61000.0
Sweden,0.933,51300.0
Singapore,0.932,90500.0
Netherlands,0.931,53600.0


Finally let's check the correlation between the two values. We'll use the `corr()` function, which calculates the correlation of all columns agains all columns.

In [13]:
mergedDF.corr()

Unnamed: 0,HDI,GDP (PPP) percapita (U$)
HDI,1.0,0.727729
GDP (PPP) percapita (U$),0.727729,1.0


The `corr()` function returned a DataFrame. Since we're interested only in the correlation of the two variables, let's pinpoint what matters to us.

In [14]:
corr = mergedDF.corr().iloc[0][1]
print(f'The correlation between HDI and GDP (PPP) per capita is {corr}')

The correlation between HDI and GDP (PPP) per capita is 0.7277286705440962


There it is. As expected, there's a strong correlation between the two variables.

# Bonus: Scatter plot with Plotly
Why don't we create and nice interactive scatter plot with the information we gathered? It will allow us to visualize graphically the discovered high correlation between HDI and per capita income.

In [26]:
import plotly.offline as pyo
import plotly.graph_objs as go

x_values = mergedDF['HDI']
y_values = mergedDF['GDP (PPP) percapita (U$)']
hover_values = mergedDF.index

#Defining layout
marker = dict(size=5, line={'width':1})
layout = go.Layout(title="HDI x GDP (PPP) percapita (U$)",
                   xaxis=dict(title="HDI"),
                   yaxis=dict(title="Per capita income"))

#Defining graphical object, determining its layout and data
data = [go.Scatter(x=x_values, y=y_values, hovertext=hover_values, mode='markers',  marker=marker)]
fig = go.Figure(data=data, layout=layout)

#Generating the graph
pyo.iplot(fig)