We're going to make a scatter plot using dimension reduction, so let's install UMAP.

In [1]:
!pip install --quiet umap-learn
print('installed UMAP')

installed UMAP


Let's load up our data. 

In [2]:
import pandas as pd

DATA = '/kaggle/input/cleaned-life-expectancy-dataset/Cleaned-Life-Exp.csv'

df = pd.read_csv(filepath_or_buffer=DATA)
df.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,1.621762,-0.459399,-0.443691,0.790238,0.268824,-1.133571,-0.33557,-0.635971,-0.110384,...,-3.268019,0.889486,-0.730578,-0.323445,-0.483546,0.343993,2.796805,2.757185,-0.704483,-0.563614
1,Afghanistan,1.404986,-0.459399,-0.979279,0.854614,0.285786,-1.133571,-0.334441,-0.755661,-0.168124,...,-1.048077,0.897493,-0.857092,-0.323445,-0.481553,-0.203706,2.864687,2.80155,-0.71871,-0.593391
2,Afghanistan,1.18821,-0.459399,-0.979279,0.830473,0.302749,-1.133571,-0.334594,-0.675868,-0.173531,...,-0.877312,0.877476,-0.772749,-0.323445,-0.480218,0.311126,2.909942,2.845914,-0.747164,-0.623168
3,Afghanistan,0.971434,-0.459399,-1.021286,0.86266,0.328193,-1.133571,-0.332096,-0.556178,0.032045,...,-0.663856,1.033609,-0.646235,-0.323445,-0.477539,-0.148469,2.955197,2.912461,-0.78036,-0.652944
4,Afghanistan,0.754658,-0.459399,-1.052791,0.886801,0.345155,-1.133571,-0.367862,-0.516281,0.051757,...,-0.621165,0.773387,-0.604064,-0.323445,-0.520044,-0.160246,3.023079,2.956826,-0.823042,-0.742275


Our data has been normalized; we don't know how the data has been normalized. Let's use our data to build a scatter plot and see what we expect, which is that countries tend to be pretty stable in all indicators year to year.

In [3]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
plot_df = df[['Country']].copy()
plot_df[['x', 'y']] = umap.fit_transform(X=df.drop(columns=['Country']))
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:20.882543


In [4]:
from plotly import express

express.scatter(data_frame=plot_df, x='x', y='y', color='Country', height=800)

What do we see?
1. We really have too many countries and too few rows per country for the whole dataset to tell a single story
1. Because we can double-click on a country name to see the data from a single country, we see that most countries tend to look similar year to year, but rarely very similar. India is essentially the same year to year; but a lot of countries move over time.
1. A lot of countries in Europe are very similar to each other and different from the other countries.