<a href="https://www.kaggle.com/code/mikedelong/exploratory-data-visualization?scriptVersionId=136299974" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

filename = '../input/global-peace-index/peace_index.csv'
df = pd.read_csv(filepath_or_buffer=filename, delimiter=';', decimal=',')
group_filename = '../input/country-mapping-iso-continent-region/continents2.csv'
group_df = pd.read_csv(filepath_or_buffer=group_filename)
merged_columns = ['alpha-3', 'region']
merged_df = df.merge(right=group_df[merged_columns], how='inner', left_on='iso3c', right_on='alpha-3',).drop(columns=['alpha-3'])
merged_df.sample(10)

Unnamed: 0,Country,iso3c,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,region
129,Singapore,SGP,1.201,1.201,1.201,1.011,1.023,1.034,1.046,1.057,1.046,1.034,1.023,1.011,1.0,1.0,1.0,1.0,Asia
76,Kenya,KEN,2.095,2.134,2.309,2.195,2.165,2.408,2.233,2.184,2.247,2.326,2.259,2.262,2.268,2.264,2.21,2.25,Africa
84,Libya,LBY,1.604,1.604,1.604,2.409,2.208,2.381,2.123,2.661,2.927,3.023,3.143,3.122,3.103,3.236,3.064,2.419,Africa
94,Mali,MLI,1.628,2.039,2.051,2.055,2.131,1.838,2.126,2.431,2.153,2.177,2.177,2.307,2.523,2.556,2.826,3.085,Africa
87,Lithuania,LTU,1.256,1.275,1.293,1.293,1.293,1.282,1.27,1.46,1.46,1.448,1.442,1.439,1.437,1.435,1.444,1.448,Europe
96,Montenegro,MNE,1.403,1.403,1.403,1.403,1.414,1.426,1.437,1.448,1.46,1.448,1.437,1.426,1.414,1.403,1.403,1.403,Europe
27,Cote d' Ivoire,CIV,1.897,1.674,1.647,1.821,2.016,2.257,1.843,1.646,1.669,1.692,1.881,1.702,1.725,1.729,1.729,1.76,Africa
18,Bolivia,BOL,1.403,1.604,1.604,1.604,1.604,1.604,1.403,1.403,1.403,1.421,1.421,1.421,1.421,1.421,1.403,1.417,Americas
39,Dominican Republic,DOM,1.213,1.213,1.213,1.302,1.302,1.403,1.403,1.403,1.403,1.403,1.403,1.403,1.403,1.403,1.403,1.604,Americas
1,Angola,AGO,1.655,1.827,1.615,1.816,1.615,1.615,1.609,1.408,1.403,1.403,1.61,1.615,1.413,1.621,1.608,1.639,Africa


Let's make a year x country frame for plotting

In [2]:
# we have a few NAs so we need to fill/interpolate them before we try to run a clustering model
tsne_df = merged_df.drop(columns=['Country', 'iso3c', 'region']).fillna(method='pad', axis=0,)
tsne_df.isna().sum().sum()

0

Let's use t-SNE to project into a space we can visualize and then cluster using k-Means

In [3]:
from math import sqrt
from plotly.express import scatter
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, random_state=1)
tsne_results = tsne.fit_transform(X=tsne_df.values)
kmeans = KMeans(n_init=10, n_clusters=int(sqrt(len(df))), random_state=1)
kmeans.fit(X=tsne_results)
tsne_plot_df = pd.DataFrame(data={'Country': merged_df['Country'], 'region' : merged_df['region'], 'k-means': kmeans.labels_,
                                  'tsne 0': tsne_results[:, 0], 'tsne 1': tsne_results[:, 1], })
scatter(data_frame=tsne_plot_df, x='tsne 0', y='tsne 1', hover_data='Country', color='k-means')

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 162 samples in 0.001s...
[t-SNE] Computed neighbors for 162 samples in 0.073s...
[t-SNE] Computed conditional probabilities for sample 162 / 162
[t-SNE] Mean sigma: 0.562058
[t-SNE] KL divergence after 250 iterations with early exaggeration: 47.819195
[t-SNE] KL divergence after 1000 iterations: 0.195188


Let's look at the t-SNE clusters on a regional (continental) basis

In [4]:
scatter(data_frame=tsne_plot_df, x='tsne 0', y='tsne 1', hover_data='Country', color='region')

Rather than look at the whole series let's plot the mean and variance so we can get all the points on one plot

In [5]:
# we need the transpose to get statistics in the year direction
plot_df = merged_df.drop(columns=['iso3c', 'region']).set_index(keys=['Country'], drop=True).T
cluster_df = pd.concat([plot_df.mean(), plot_df.std()], axis=1).reset_index()
cluster_df.columns = ['Country', 'mean', 'stddev']
cluster_df['cluster'] = kmeans.labels_

In [6]:
scatter(data_frame=cluster_df, x='mean', y='stddev', hover_data='Country', color='cluster')