Han's Roshling visualization of life expectancy vs GDP is one of the most popular visualizations in the last twentry years. 
https://www.youtube.com/watch?v=Z8t4k0Q8e8Y

In his video he animates about 120 thousand points. I believe it it was Machromedia flash that what was used back in 2006 to create the beautiful animation. From the viewers perspective, this is way too much information to absorb. Adding filters for selective focus is proably the best way to visualize the data. 

We will take an alternative approach to this visual. Instead of animating scatterplots we will use 
* A connected scatterplot to show the trajectory of selected countries 
* An HD visual that will give us an idea of the entire space of countiries, outliers, and will answer the question which country is similar to which 
* We will see how clusters, obtained by using spectral clustering, are shown on the HD map 

In [89]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

The data is a subset of the version that Hans Rosling used. We have information for 1952 countries starting from  1952 

In [90]:
df = pd.read_csv("../data/gapminder_after1952.csv")

In [91]:
df.head(20)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4
5,Afghanistan,Asia,1977,38.438,14880372,786.11336,AFG,4
6,Afghanistan,Asia,1982,39.854,12881816,978.011439,AFG,4
7,Afghanistan,Asia,1987,40.822,13867957,852.395945,AFG,4
8,Afghanistan,Asia,1992,41.674,16317921,649.341395,AFG,4
9,Afghanistan,Asia,1997,41.763,22227415,635.341351,AFG,4


To add continent information we download an association between country code and continent from https://www.kaggle.com/datasets/andradaolteanu/country-mapping-iso-continent-region?resource=download

In [92]:
continents = pd.read_csv("../data/continents2.csv")

In [93]:
continents.head()

Unnamed: 0,name,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,intermediate-region,region-code,sub-region-code,intermediate-region-code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142.0,34.0,
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,,150.0,154.0,
2,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150.0,39.0,
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,,2.0,15.0,
4,American Samoa,AS,ASM,16,ISO 3166-2:AS,Oceania,Polynesia,,9.0,61.0,


In [94]:
continents = (continents[['name', 'alpha-3', 'region', 'sub-region']]
              .rename(columns={"alpha-3":"iso_alpha", "sub-region":"subcontinent",
                           "name":"country", "region":"continent"})
)

In [95]:
continents.head()

Unnamed: 0,country,iso_alpha,continent,subcontinent
0,Afghanistan,AFG,Asia,Southern Asia
1,Åland Islands,ALA,Europe,Northern Europe
2,Albania,ALB,Europe,Southern Europe
3,Algeria,DZA,Africa,Northern Africa
4,American Samoa,ASM,Oceania,Polynesia


In [96]:
continents['subcontinent'].value_counts()

Sub-Saharan Africa                 53
Latin America and the Caribbean    52
Western Asia                       18
Southern Europe                    16
Northern Europe                    16
South-eastern Asia                 11
Polynesia                          10
Eastern Europe                     10
Southern Asia                       9
Western Europe                      9
Eastern Asia                        8
Micronesia                          8
Northern Africa                     7
Australia and New Zealand           6
Melanesia                           5
Northern America                    5
Central Asia                        5
Name: subcontinent, dtype: int64

In [97]:
df.country.unique().shape

(141,)

In our data we have 141 countries. We have data starting from 1952 and ending at 2007. There are no missing values which is important for the HD and clustering applications. 

In [98]:
df.year.value_counts().reset_index()

Unnamed: 0,index,year
0,1952,141
1,1957,141
2,1962,141
3,1967,141
4,1972,141
5,1977,141
6,1982,141
7,1987,141
8,1992,141
9,1997,141


In [99]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


In [100]:
df = df.merge(continents[['iso_alpha', 'subcontinent']], on='iso_alpha')

### EDA on the dataset is always informative  

In [101]:

#info on how to construct these histograms here 
#https://plotly.com/python/histograms/
#https://plotly.com/python/subplots/

fig = go.Figure()


from plotly.subplots import make_subplots


fig = make_subplots(rows=3, cols=2, shared_xaxes=True, shared_yaxes=True,
    subplot_titles=("Europe", "Africa", "Asia", "Americas", "Oceania"))

start_bin = 0
end_bin = 90
bin_size = 5 
variable = "lifeExp"

trace0 = go.Histogram(x=df[df.continent == 'Europe'][variable], 
                histnorm='percent', 
                name="Europe",
                #xbins=dict(start=start_bin, end=end_bin, size=bin_size)
                )
              
trace1 = go.Histogram(x=df[df.continent == 'Africa'][variable], 
                histnorm='percent', 
                name="Africa",
                #xbins=dict(start=start_bin, end=end_bin, size=bin_size)
                )
              

trace2 = go.Histogram(x=df[df.continent == 'Asia'][variable], 
                histnorm='percent', 
                name="Asia",
                #xbins=dict(start=start_bin, end=end_bin, size=bin_size)
                )
              
trace3 = go.Histogram(x=df[df.continent == 'Americas'][variable], 
                histnorm='percent', 
                name="Americas",
                #xbins=dict(start=start_bin, end=end_bin, size=bin_size)
                )
              

trace4 = go.Histogram(x=df[df.continent == 'Oceania'][variable], 
                histnorm='percent', 
                name="Oceania",
                #xbins=dict(start=start_bin, end=end_bin, size=bin_size)
                )
              

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 2)
fig.append_trace(trace4, 3, 1)

# fig.update_yaxes(title_text="Percent", range=[0, 50], row=1, col=1)
# fig.update_yaxes(title_text="Percent", range=[0, 50], row=2, col=1)
# fig.update_yaxes(title_text="Percent", range=[0, 50], row=3, col=1)

fig.update_layout(bargap=0.05)

fig.update_layout(height=600, width=1000, title_text="Life Expectancy By Continet")
fig.update_layout(showlegend=False)

fig.show()

Let's visualize the trajectories of some countries. 

In [102]:
tmp = df.query("country in ['Norway',   'China', 'Saudi Arabia']")

One should improve the chart below by providing better labels and titles

In [103]:
# visuals constructed based on info 
# https://plotly.com/python/line-and-scatter/
# https://plotly.com/python/marker-style/ 

tmp = df.query("country in ['Norway',   'China', 'Saudi Arabia']")

fig = px.line(tmp, x="gdpPercap", y="lifeExp", 
              color="country", 
              log_x=True, 
              #text="year",
              )

fig.update_traces(textposition="bottom right")
fig.update_layout(height=600, width=1000, title_text="Trajectories of Selected Countries")
fig.show()

I am calculating below the average life expectancy and GDP per capita for all five continents and visualize those trajectories

In [104]:
tmp = (df[['year', 'continent', 'lifeExp', 'gdpPercap']]
       .groupby(['year', 'continent']).mean().reset_index()
) 


fig = px.line(tmp, x="gdpPercap", y="lifeExp", 
              color="continent", 
              log_x=True, 
              #text="year",
              )

fig.update_traces(textposition="bottom right")
fig.update_layout(title_text="Trajectories of Continent Centroids")
fig.show()

But within each continent there is significant variability. I am showing bewlow the trajectories of subcontinents. 

In [105]:
subcontinent_centroids = df[['year', 'subcontinent', 'lifeExp',
                         'gdpPercap']].groupby(['subcontinent', 'year']).mean().reset_index()

fig = px.line(subcontinent_centroids, x="gdpPercap", y="lifeExp", color="subcontinent",  log_x=True, 
             title="Cluster Trajectories. Each line represents the average gpb per capita and life by year",
             width=800, height=600)

fig.update_traces(textposition="bottom right")

fig.show()

## Prepare the dataset for HD visualization and cluster Analyses 

We need to reformat the dataset so that 
* Each country or feature is a line. So we need to pivot the dataset so that the life expectancy and GPD per capita for a particular year become columns. This can be easily done by unstacking the data. 
* Scale the data. This is because we have two variables that that have different domains or scales. If we don't scale the data the GDP per capita will dominate the distance calculations. By scaling the data each variable/column has an equal weight in the distance calculations 

In [106]:
lifeExp = df[['country', 'year', 'lifeExp']].set_index(['country', 'year']).unstack().round(0)
lifeExp.head()

Unnamed: 0_level_0,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp
year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Afghanistan,29.0,30.0,32.0,34.0,36.0,38.0,40.0,41.0,42.0,42.0,42.0,44.0
Albania,55.0,59.0,65.0,66.0,68.0,69.0,70.0,72.0,72.0,73.0,76.0,76.0
Algeria,43.0,46.0,48.0,51.0,55.0,58.0,61.0,66.0,68.0,69.0,71.0,72.0
Angola,30.0,32.0,34.0,36.0,38.0,39.0,40.0,40.0,41.0,41.0,41.0,43.0
Argentina,62.0,64.0,65.0,66.0,67.0,68.0,70.0,71.0,72.0,73.0,74.0,75.0


In [107]:
gdpPercap = df[['country', 'year', 'gdpPercap']].set_index(['country', 'year']).unstack().round(0)
gdpPercap.head()

Unnamed: 0_level_0,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap
year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Afghanistan,779.0,821.0,853.0,836.0,740.0,786.0,978.0,852.0,649.0,635.0,727.0,975.0
Albania,1601.0,1942.0,2313.0,2760.0,3313.0,3533.0,3631.0,3739.0,2497.0,3193.0,4604.0,5937.0
Algeria,2449.0,3014.0,2551.0,3247.0,4183.0,4910.0,5745.0,5681.0,5023.0,4797.0,5288.0,6223.0
Angola,3521.0,3828.0,4269.0,5523.0,5473.0,3009.0,2757.0,2430.0,2628.0,2277.0,2773.0,4797.0
Argentina,5911.0,6857.0,7133.0,8053.0,9443.0,10079.0,8998.0,9140.0,9308.0,10967.0,8798.0,12779.0


The two datasets above need to be concatenated to produce our input dataset 

In [108]:
df2 = pd.concat([lifeExp,gdpPercap], axis=1) 
df2.head()

Unnamed: 0_level_0,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,...,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap
year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,...,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Afghanistan,29.0,30.0,32.0,34.0,36.0,38.0,40.0,41.0,42.0,42.0,...,853.0,836.0,740.0,786.0,978.0,852.0,649.0,635.0,727.0,975.0
Albania,55.0,59.0,65.0,66.0,68.0,69.0,70.0,72.0,72.0,73.0,...,2313.0,2760.0,3313.0,3533.0,3631.0,3739.0,2497.0,3193.0,4604.0,5937.0
Algeria,43.0,46.0,48.0,51.0,55.0,58.0,61.0,66.0,68.0,69.0,...,2551.0,3247.0,4183.0,4910.0,5745.0,5681.0,5023.0,4797.0,5288.0,6223.0
Angola,30.0,32.0,34.0,36.0,38.0,39.0,40.0,40.0,41.0,41.0,...,4269.0,5523.0,5473.0,3009.0,2757.0,2430.0,2628.0,2277.0,2773.0,4797.0
Argentina,62.0,64.0,65.0,66.0,67.0,68.0,70.0,71.0,72.0,73.0,...,7133.0,8053.0,9443.0,10079.0,8998.0,9140.0,9308.0,10967.0,8798.0,12779.0


As stated above, the new dataset has variables with different scales. It needs to be standardized or normalized. This can be easily done using the standard scaler object in scikit learn

In [109]:
from sklearn.preprocessing import StandardScaler

In [110]:
df2_scaled =  StandardScaler().fit(df2.values).transform(df2.values)

In [111]:
df2_scaled.shape

(141, 24)

df2_scaled is a numpy array. The number of rows is equal to the number of countries and the number of columns equal to the total number of variables. If we put the scaled values into a dataframe we see that that the standard scaling algorithm replaced each value with its z-score calculated column-wise. 

In [112]:
df2_scaled = pd.DataFrame(df2_scaled, columns=df2.columns, index=df2.index)

In [113]:
df2_scaled.head()

Unnamed: 0_level_0,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,...,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap
year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,...,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Afghanistan,-1.631762,-1.75595,-1.778137,-1.841151,-1.907761,-1.922778,-2.004456,-2.108653,-1.975678,-1.983276,...,-0.799292,-0.823961,-0.865787,-0.865335,-0.853348,-0.851461,-0.837129,-0.838165,-0.823101,-0.836024
Albania,0.492542,0.62156,0.946154,0.887914,0.918181,0.853721,0.794768,0.841706,0.712362,0.699763,...,-0.437925,-0.43851,-0.445675,-0.479365,-0.498047,-0.495491,-0.62578,-0.577771,-0.469008,-0.439142
Algeria,-0.487906,-0.44422,-0.457269,-0.391335,-0.229858,-0.131488,-0.044999,0.270669,0.353957,0.353565,...,-0.379018,-0.340946,-0.303624,-0.285888,-0.214931,-0.25604,-0.336892,-0.41449,-0.406537,-0.416266
Angola,-1.550058,-1.591984,-1.613028,-1.670584,-1.73114,-1.833214,-2.004456,-2.203825,-2.065279,-2.069826,...,0.046207,0.115024,-0.092996,-0.55299,-0.615097,-0.656892,-0.610798,-0.671016,-0.636236,-0.530324
Argentina,1.06447,1.031475,0.946154,0.887914,0.82987,0.764157,0.794768,0.746533,0.712362,0.699763,...,0.755081,0.621879,0.555214,0.440388,0.220725,0.170458,0.153167,0.213591,-0.085961,0.10811


# Project the data using MDS

In [114]:
from sklearn.manifold import MDS

In [115]:
coords_MDS = MDS(n_components=2).fit_transform(df2_scaled.values)

In [116]:
coords_MDS[0:5]

array([[-0.70098679,  7.26249124],
       [-2.42489374, -1.92400287],
       [-0.83010636,  0.52901975],
       [ 1.30590571,  6.66382063],
       [ 1.23174584, -2.80557549]])

In [117]:
coords_MDS.shape

(141, 2)

The MDS algorithm produced x,y coordinates for each and every country. Each coordinate pair is the projection of the country to the plane. Let's create a dataset that contains the MDS projections. 

In [118]:
HDmap = pd.DataFrame(index=df2_scaled.index)

HDmap['MDS_x'] = coords_MDS[:,0]
HDmap['MDS_y'] = coords_MDS[:,1]

In [119]:
HDmap

Unnamed: 0_level_0,MDS_x,MDS_y
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,-0.700987,7.262491
Albania,-2.424894,-1.924003
Algeria,-0.830106,0.529020
Angola,1.305906,6.663821
Argentina,1.231746,-2.805575
...,...,...
Vietnam,-2.978504,1.481026
West Bank and Gaza,-1.458950,0.246175
"Yemen, Rep.",-0.879406,4.528705
Zambia,-2.551803,4.956984


We will visualize this projection as a scatterplot. But it should be interpreted as a map. Before we do so we will add country information.

In [120]:
HDmap.shape

(141, 2)

In [121]:
HDmap = HDmap.merge(continents, on="country")
HDmap.head()

Unnamed: 0,country,MDS_x,MDS_y,iso_alpha,continent,subcontinent
0,Afghanistan,-0.700987,7.262491,AFG,Asia,Southern Asia
1,Albania,-2.424894,-1.924003,ALB,Europe,Southern Europe
2,Algeria,-0.830106,0.52902,DZA,Africa,Northern Africa
3,Angola,1.305906,6.663821,AGO,Africa,Sub-Saharan Africa
4,Argentina,1.231746,-2.805575,ARG,Americas,Latin America and the Caribbean


In [122]:
#Country flags obtained from wikipedia
flags = pd.read_csv("../data/Country_Flags.csv")
flags.head()

Unnamed: 0,country,Images File Name,ImageURL
0,Afghanistan,Flag_of_Afghanistan.svg,https://upload.wikimedia.org/wikipedia/commons...
1,Albania,Flag_of_Albania.svg,https://upload.wikimedia.org/wikipedia/commons...
2,Algeria,Flag_of_Algeria.svg,https://upload.wikimedia.org/wikipedia/commons...
3,Andorra,Flag_of_Andorra.svg,https://upload.wikimedia.org/wikipedia/commons...
4,Angola,Flag_of_Angola.svg,https://upload.wikimedia.org/wikipedia/commons...


In [123]:
HDmap = HDmap.merge(flags, on=['country'])

In [124]:
HDmap.head()

Unnamed: 0,country,MDS_x,MDS_y,iso_alpha,continent,subcontinent,Images File Name,ImageURL
0,Afghanistan,-0.700987,7.262491,AFG,Asia,Southern Asia,Flag_of_Afghanistan.svg,https://upload.wikimedia.org/wikipedia/commons...
1,Albania,-2.424894,-1.924003,ALB,Europe,Southern Europe,Flag_of_Albania.svg,https://upload.wikimedia.org/wikipedia/commons...
2,Algeria,-0.830106,0.52902,DZA,Africa,Northern Africa,Flag_of_Algeria.svg,https://upload.wikimedia.org/wikipedia/commons...
3,Angola,1.305906,6.663821,AGO,Africa,Sub-Saharan Africa,Flag_of_Angola.svg,https://upload.wikimedia.org/wikipedia/commons...
4,Argentina,1.231746,-2.805575,ARG,Americas,Latin America and the Caribbean,Flag_of_Argentina.svg,https://upload.wikimedia.org/wikipedia/commons...


In [125]:
fig = px.scatter(HDmap, x="MDS_x", y="MDS_y", hover_name="country", 
                  color="continent",
                 width=1000, height=700, 
                 title="MDS projection of the country trajectories")



for x, y, img_url in zip(HDmap.MDS_x, HDmap.MDS_y, HDmap["ImageURL"]): 
  fig.add_layout_image(
          x=x,
          sizex=0.5,
          y=y,
          sizey=1,
          xref="x",
          yref="y",
          opacity=1.0,
          layer="below",
          source=img_url
  )

fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
  )

#legend
fig.update_layout(showlegend=False, 
                 margin=dict(l=20, r=20, t=40, b=20),
                 )

fig.update_xaxes(visible=False)  
fig.update_yaxes(visible=False)


fig.show()

Countries that are close in the MDS projetion are have similar life expectancy and GDP per capita values over time 

In [126]:
tmp = df.query("country in ['Israel', 'Greece', 'Ethiopia', 'Burkina Faso', 'Ecuador', 'Brazil', 'Australia', 'United Kingdom']")

fig = px.line(tmp, x="gdpPercap", y="lifeExp", 
              color="country", 
              log_x=True, 
              #text="year",
              )

fig.update_traces(textposition="bottom right")
fig.update_layout(title_text="Trajectories of Similar Countries from the MDS map")
fig.show()

### Project the countries using TSNE

In this particular application TSNE is not as appropriate as MS. We don't have that many variables and the data are easy to visualize because the first two principal components represent about 90% of the total variability in the data. 

# How are clusters shown on the HDmap? 

In [127]:
from sklearn.cluster import KMeans
from sklearn.cluster import SpectralClustering

Here we will just try spectral or Kmeans clustering. The optimal number of clusters is something to investigate. Here we will just try a few values to investiagate how the clusters of countries are shown on the HD map. 

In [128]:
clustering = SpectralClustering(n_clusters=6,
         assign_labels='discretize',
         random_state=0).fit(df2_scaled.values)
clustering.labels_



array([5, 1, 1, 5, 2, 3, 3, 3, 0, 3, 0, 0, 1, 0, 1, 2, 5, 5, 0, 0, 3, 5,
       5, 2, 1, 1, 0, 5, 0, 2, 0, 2, 2, 4, 3, 0, 1, 1, 0, 1, 5, 5, 5, 4,
       3, 0, 5, 3, 0, 4, 1, 5, 5, 0, 1, 4, 2, 3, 0, 0, 1, 0, 4, 4, 4, 2,
       4, 1, 0, 1, 1, 2, 0, 5, 0, 0, 5, 1, 5, 0, 1, 2, 0, 2, 1, 5, 0, 0,
       0, 3, 3, 1, 5, 5, 3, 1, 0, 2, 1, 1, 1, 2, 4, 4, 1, 2, 5, 0, 3, 0,
       2, 5, 4, 2, 4, 5, 0, 4, 1, 0, 0, 3, 3, 1, 4, 5, 1, 0, 2, 1, 1, 5,
       3, 3, 2, 2, 0, 1, 0, 5, 0])

The algorithm attaches cluster lable to each country. We need to merge this information to the HDmap

In [129]:
HDmap = HDmap.merge(pd.DataFrame({"country":df2_scaled.index, "scluster6":clustering.labels_}), on="country")
HDmap['scluster6'] =  HDmap['scluster6'].astype('str')
HDmap.head()

Unnamed: 0,country,MDS_x,MDS_y,iso_alpha,continent,subcontinent,Images File Name,ImageURL,scluster6
0,Afghanistan,-0.700987,7.262491,AFG,Asia,Southern Asia,Flag_of_Afghanistan.svg,https://upload.wikimedia.org/wikipedia/commons...,5
1,Albania,-2.424894,-1.924003,ALB,Europe,Southern Europe,Flag_of_Albania.svg,https://upload.wikimedia.org/wikipedia/commons...,1
2,Algeria,-0.830106,0.52902,DZA,Africa,Northern Africa,Flag_of_Algeria.svg,https://upload.wikimedia.org/wikipedia/commons...,1
3,Angola,1.305906,6.663821,AGO,Africa,Sub-Saharan Africa,Flag_of_Angola.svg,https://upload.wikimedia.org/wikipedia/commons...,5
4,Argentina,1.231746,-2.805575,ARG,Americas,Latin America and the Caribbean,Flag_of_Argentina.svg,https://upload.wikimedia.org/wikipedia/commons...,2


In [130]:
#trying kmeans with six clusters is easy
kmeans = KMeans(n_clusters = 6, random_state = 0)
km2 = kmeans.fit(df2_scaled)
km2.labels_

array([2, 3, 0, 2, 3, 1, 1, 1, 4, 1, 4, 4, 3, 4, 0, 3, 2, 2, 4, 4, 1, 2,
       4, 3, 0, 0, 4, 2, 4, 3, 4, 3, 3, 5, 1, 4, 0, 0, 0, 0, 2, 2, 2, 5,
       1, 0, 2, 1, 4, 5, 0, 2, 2, 4, 0, 5, 3, 1, 4, 4, 0, 0, 5, 5, 5, 3,
       5, 0, 4, 0, 0, 3, 4, 2, 0, 4, 2, 0, 2, 4, 0, 3, 4, 3, 0, 2, 4, 4,
       4, 1, 1, 0, 2, 2, 1, 0, 4, 3, 3, 0, 0, 3, 3, 3, 3, 3, 2, 0, 1, 4,
       3, 2, 5, 3, 5, 2, 0, 5, 0, 4, 4, 1, 1, 0, 3, 4, 0, 4, 3, 0, 0, 4,
       1, 1, 3, 3, 0, 0, 4, 2, 4], dtype=int32)

In [131]:
HDmap = HDmap.merge(pd.DataFrame({"country":df2_scaled.index, "kmeans6": km2.labels_}), on="country")
HDmap['scluster6'] =  HDmap['scluster6'].astype('str')

Differences in the clustering methods can be seen by using a crosstab. The clusters are not identical but there are only small differences in the arrangement of clusters. 

In [132]:
pd.crosstab(HDmap.scluster6, HDmap.kmeans6)

kmeans6,0,1,2,3,4,5
scluster6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,7,0,0,0,25,0
1,23,0,0,2,0,0
2,0,0,0,18,0,0
3,0,17,0,0,0,0
4,0,0,0,1,0,10
5,0,0,19,0,3,0


In [133]:
HDmap.head()

Unnamed: 0,country,MDS_x,MDS_y,iso_alpha,continent,subcontinent,Images File Name,ImageURL,scluster6,kmeans6
0,Afghanistan,-0.700987,7.262491,AFG,Asia,Southern Asia,Flag_of_Afghanistan.svg,https://upload.wikimedia.org/wikipedia/commons...,5,2
1,Albania,-2.424894,-1.924003,ALB,Europe,Southern Europe,Flag_of_Albania.svg,https://upload.wikimedia.org/wikipedia/commons...,1,3
2,Algeria,-0.830106,0.52902,DZA,Africa,Northern Africa,Flag_of_Algeria.svg,https://upload.wikimedia.org/wikipedia/commons...,1,0
3,Angola,1.305906,6.663821,AGO,Africa,Sub-Saharan Africa,Flag_of_Angola.svg,https://upload.wikimedia.org/wikipedia/commons...,5,2
4,Argentina,1.231746,-2.805575,ARG,Americas,Latin America and the Caribbean,Flag_of_Argentina.svg,https://upload.wikimedia.org/wikipedia/commons...,2,3


In [134]:

#colors from colorbrewer 
#https://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=6
cluster_colors = ['#1b9e77','#d95f02','#7570b3','#e7298a','#66a61e','#e6ab02']

fig = go.Figure() 

for cluster_id, cluster in HDmap.groupby("scluster6"): 


    fig.add_trace(go.Scatter(x=cluster.MDS_x,
                             y=cluster.MDS_y,
                             mode='text',
                            textfont=dict(
                                family="sans serif",
                                size=12,
                                color=cluster_colors[int(cluster_id)]
                                ), 

                             text=cluster.iso_alpha))

fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
  )

fig.update_xaxes(visible=False)  
fig.update_yaxes(visible=False)
fig.update_layout(showlegend=False, 
                 margin=dict(l=20, r=20, t=40, b=20),
                 )

#plotly setting figure size 
# https://plotly.com/python/setting-graph-size/
fig.update_layout(height=600, width=1000, title_text="Countries on the MDS map color-coded by spectral cluster membership")

fig.show()

In [135]:
#colors from colorbrewer 
#https://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=6
cluster_colors = ['#1b9e77','#d95f02','#7570b3','#e7298a','#66a61e','#e6ab02']

fig = go.Figure() 

for cluster_id, cluster in HDmap.groupby("kmeans6"): 


    fig.add_trace(go.Scatter(x=cluster.MDS_x,
                             y=cluster.MDS_y,
                             mode='text',
                            textfont=dict(
                                family="sans serif",
                                size=12,
                                color=cluster_colors[int(cluster_id)]
                                ), 

                             text=cluster.iso_alpha))

fig.update_yaxes(
    scaleanchor = "x",
    scaleratio = 1,
  )

fig.update_xaxes(visible=False)  
fig.update_yaxes(visible=False)
fig.update_layout(showlegend=False, 
                 margin=dict(l=20, r=20, t=40, b=20),
                 )

#plotly setting figure size 
# https://plotly.com/python/setting-graph-size/
fig.update_layout(height=600, width=1000, title_text="Countries on the MDS map color-coded by Kmeans cluster membership")

fig.show()

Let's show some countries from each cluster

In [136]:
tmp = df.query("iso_alpha in ['ETH', 'SEN', 'THA', 'JAM',  'ESP', 'CAN']")

fig = px.line(tmp, x="gdpPercap", y="lifeExp", color="country",  log_x=True, width=800, height=400)

fig.update_traces(textposition="bottom right")

fig.update_layout(height=600, width=1000, title_text="Representative countries from each cluster")

fig.show()

### Calculate Cluster Centroids 

We need to add cluster information to calculate the average GDP per capita and life expectancy 

In [137]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num,subcontinent
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4,Southern Asia
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4,Southern Asia
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4,Southern Asia
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4,Southern Asia
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4,Southern Asia


In [138]:
HDmap.columns

Index(['country', 'MDS_x', 'MDS_y', 'iso_alpha', 'continent', 'subcontinent',
       'Images File Name', 'ImageURL', 'scluster6', 'kmeans6'],
      dtype='object')

In [139]:
df3 = df.merge(HDmap[['country', 'scluster6', 'kmeans6']], on="country")

In [140]:
df3.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num,subcontinent,scluster6,kmeans6
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4,Southern Asia,5,2
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4,Southern Asia,5,2
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4,Southern Asia,5,2
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4,Southern Asia,5,2
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4,Southern Asia,5,2


Calculate the average life expectancy and GDP per capita per year

In [141]:
cluster_centroids = df3[['year', 'scluster6', 'lifeExp',
                         'gdpPercap']].groupby(['scluster6', 'year']).mean().reset_index()

In [142]:
cluster_centroids.head()

Unnamed: 0,scluster6,year,lifeExp,gdpPercap
0,0,1952,40.502594,1422.501842
1,0,1957,42.894125,1611.693209
2,0,1962,45.228812,1935.797979
3,0,1967,47.632219,2515.24313
4,0,1972,49.779,2855.862159


In [143]:

fig = px.line(cluster_centroids, x="gdpPercap", y="lifeExp", color="scluster6",  log_x=True, 
             title="Cluster Centroid Trajectories. Each line represents the average gpb per capita and life by year",
             width=800, height=600)

fig.update_traces(textposition="bottom right")

fig.show()