<a href="https://www.kaggle.com/code/mikedelong/exploratory-data-visualization?scriptVersionId=136482920" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='../input/global-peace-index/peace_index.csv', delimiter=';', decimal=',')
group_df = pd.read_csv(filepath_or_buffer='../input/country-mapping-iso-continent-region/continents2.csv')
merged_columns = ['alpha-3', 'region']
merged_df = df.merge(right=group_df[merged_columns], how='inner', left_on='iso3c', right_on='alpha-3',).drop(columns=['alpha-3'])
merged_df.sample(5)

Unnamed: 0,Country,iso3c,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,region
148,Turkiye,TUR,2.309,2.467,2.46,2.109,2.13,2.375,2.275,2.123,2.319,2.767,3.256,3.088,3.021,3.033,3.016,3.088,Asia
12,Burkina Faso,BFA,1.604,1.604,1.604,1.403,1.403,1.403,1.403,1.407,1.43,1.458,1.669,1.927,2.016,2.459,2.91,3.005,Africa
44,Spain,ESP,1.264,1.271,1.278,1.263,1.057,1.057,1.057,1.057,1.057,1.046,1.236,1.404,1.28,1.296,1.324,1.352,Europe
129,Singapore,SGP,1.201,1.201,1.201,1.011,1.023,1.034,1.046,1.057,1.046,1.034,1.023,1.011,1.0,1.0,1.0,1.0,Asia
19,Brazil,BRA,1.027,1.022,1.016,1.216,1.005,1.214,1.266,1.016,1.029,1.048,1.042,1.764,1.769,1.729,1.802,1.9,Americas


Let's make a year x country frame for plotting

In [2]:
# we have a few NAs so we need to fill/interpolate them before we try to run a clustering model
tsne_df = merged_df.drop(columns=['Country', 'iso3c', 'region']).fillna(method='pad', axis=0,)
tsne_df.isna().sum().sum()

0

Let's use t-SNE to project into a space we can visualize and then cluster using k-Means and tie the number of clusters to the number of regions

In [3]:
from math import sqrt
from plotly.express import scatter
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, random_state=1)
tsne_results = tsne.fit_transform(X=tsne_df.values)
kmeans = KMeans(n_init=10, n_clusters=merged_df['region'].nunique(), random_state=1)
kmeans.fit(X=merged_df.drop(columns=['Country', 'iso3c', 'region']).fillna(method='pad', axis=0,).values)
scatter(data_frame=pd.DataFrame(data={'Country': merged_df['Country'], 'region' : merged_df['region'], 'k-means': kmeans.labels_,
                                  'tsne 0': tsne_results[:, 0], 'tsne 1': tsne_results[:, 1], }), x='tsne 0', y='tsne 1', 
        hover_data='Country', color='k-means')

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 162 samples in 0.000s...
[t-SNE] Computed neighbors for 162 samples in 0.065s...
[t-SNE] Computed conditional probabilities for sample 162 / 162
[t-SNE] Mean sigma: 0.562058
[t-SNE] KL divergence after 250 iterations with early exaggeration: 47.819199
[t-SNE] KL divergence after 1000 iterations: 0.195188


For the most part the t-SNE projection and the k-means clusters find the same signal in the data, so we see the more peaceful countries mostly in the lower left (t-SNE) and near each other (k-means) and the less peaceful countries in the upper right (t-SNE) and near each other (k-means). Let's look at the t-SNE clusters on a regional (continental) basis.

In [4]:
scatter(data_frame=pd.DataFrame(data={'Country': merged_df['Country'], 'region' : merged_df['region'], 'k-means': kmeans.labels_,
                                  'tsne 0': tsne_results[:, 0], 'tsne 1': tsne_results[:, 1], }), x='tsne 0', y='tsne 1', 
        hover_data='Country', color='region')

If we squint we can see the European countries are more in the lower left; there's not much else we can say from looking at a plot. This really looks like a job for a violin plot.

In [5]:
# we need the transpose to get statistics in the year direction
plot_df = merged_df.drop(columns=['iso3c', 'region']).set_index(keys=['Country'], drop=True).T
violin_df = pd.concat([plot_df.mean(), plot_df.std()], axis=1).reset_index()
violin_df.columns = ['Country', 'mean', 'stddev']
violin_df['cluster'] = kmeans.labels_
violin_df = violin_df.merge(right=merged_df[['Country', 'region']], on='Country', how='inner')
violin_df.sample(5)

Unnamed: 0,Country,mean,stddev,cluster,region
89,Morocco,1.889063,0.102834,0,Africa
77,Kyrgyz Republic,2.025562,0.102193,0,Asia
141,Togo,1.594812,0.247775,1,Africa
151,Uganda,2.042625,0.16584,0,Africa
148,Turkiye,2.61475,0.409513,4,Asia


In [6]:
from plotly.express import violin
violin(data_frame=violin_df, x='mean', color='region')

This captures some of the flavor we're looking for: different regions tend toward their means differently; Europe is generally more peaceful despite being home to Belarus, Russia, and Ukraine.

In [7]:
scatter(data_frame=violin_df, x='mean', y='stddev', hover_data='Country', color='region')