[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nunezmatias/SGD_machineLearning/blob/main/SusteinableVisions.ipynb)


# Machine Learning  methodology  for

[Sustainable Visions: Unsupervised Machine Learning Insights on Global Development Goals](https://arxiv.org/abs/2409.12427)




In this analysis, we used unsupervised machine learning techniques to examine progress toward the Sustainable Development Goals (SDGs) across various countries. By applying Principal Component Analysis (PCA), we reduced the dimensionality of the SDG data, making it easier to visualize complex patterns. We began by preprocessing the data, filtering out irrelevant entries and focusing on a select group of countries. The goal columns were standardized to have a mean of zero and a standard deviation of one, ensuring uniformity. We then applied PCA, reducing the dataset to five principal components that captured the majority of variability. This allowed us to calculate the variance explained by each component, providing insights into the data's structure. The visualization step involved creating a PCA plot of the first two components, offering a clear view of the countries' trajectories over time. Arrows represented annual changes, highlighting dynamic shifts in sustainable development. This approach effectively revealed patterns and trends, underscoring similarities and differences in countries' progress toward the SDGs.

In [15]:
# @title Load SDG Data
# Descargar archivo desde GitHub en Colab
!wget -q -O SDR2023-data.xlsx https://raw.githubusercontent.com/nunezmatias/SGD_machineLearning/main/SDR2023-data.xlsx


In [16]:
# @title PCA
import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA

df = pd.read_excel('SDR2023-data.xlsx', sheet_name='Backdated SDG Index')
goal_columns = [col for col in df.columns if 'Goal' in col]

df = df[~df['Country'].str.contains('income', case=False)]

country_list = [
    'Cyprus', 'United Arab Emirates', 'Bangladesh', 'Benin', 'Malta',
    'Algeria', 'Cabo Verde', 'Egypt (Arab Rep.)', 'Gabon', 'Morocco',
    'Sao Tome and Principe', 'Senegal', 'Tunisia', 'El Salvador', 'Panama',
    'Uruguay', 'Georgia', 'Iran (Islamic Rep.)', 'Jordan', 'Lebanon',
    'Malaysia', 'Maldives', 'Russian Federation', 'Thailand', 'Türkiye',
    'Albania', 'Bosnia and Herzegovina', 'Bulgaria', 'Montenegro', 'Fiji',
    'Angola', 'Cameroon', 'Congo (Dem. Rep.)', 'Congo (Rep.)',
    "Cote d'Ivoire", 'Gambia (The)', 'Guinea', 'Kenya', 'Liberia',
    'Madagascar', 'Mozambique', 'Nigeria', 'Sierra Leone', 'Tanzania',
    'Togo', 'Haiti', 'Papua New Guinea', 'Argentina', 'Brazil', 'Chile',
    'Colombia', 'Costa Rica', 'Dominican Republic', 'Ecuador', 'Jamaica',
    'Mexico', 'Peru', 'Venezuela (RB)', 'Philippines', 'Canada', 'Belgium',
    'Croatia', 'Denmark', 'Estonia', 'Finland', 'Germany', 'Greece',
    'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Netherlands',
    'Norway', 'Poland', 'Romania', 'Slovenia', 'Sweden', 'United Kingdom',
    'Australia', 'Comoros', 'Djibouti', 'Ghana', 'Mauritania', 'Mauritius',
    'Namibia', 'Somalia', 'South Africa', 'Sudan', 'Guatemala', 'Honduras',
    'Nicaragua', 'China', 'India', 'Indonesia', 'Myanmar', 'Pakistan',
    'Sri Lanka', 'Vietnam', 'Yemen (Rep.)', 'United States', 'Israel',
    'Japan', 'Korea (Rep.)', 'France', 'Portugal', 'Spain'
]

df = df[df['Country'].isin(country_list)]
data = df[['Country', 'year'] + goal_columns].dropna()

normalized_data = (data[goal_columns] - data[goal_columns].mean()) / data[goal_columns].std()

CP = 5
pca = PCA(n_components=CP)
pca_result = pca.fit_transform(normalized_data)

print("Variance explained by each principal component:")
print(pca.explained_variance_ratio_)

print("Cumulative explained variance:")
print(sum(pca.explained_variance_ratio_))

pca_result_df = pd.DataFrame(pca_result, columns=[f'PC{i+1}' for i in range(CP)])
pca_result_df['Country'] = data['Country'].values
pca_result_df['year'] = data['year'].values

fig = px.scatter(
    pca_result_df,
    x='PC1',
    y='PC2',
    color='Country',
    hover_data={'Country': True, 'year': True, 'PC1': False, 'PC2': False},
    title='PCA Plot of Selected Countries Over Time',
    labels={
        'PC1': f"PC1 ({pca.explained_variance_ratio_[0]*100:.2f}%)",
        'PC2': f"PC2 ({pca.explained_variance_ratio_[1]*100:.2f}%)"
    }
)

fig.update_traces(mode='markers+lines', line=dict(dash='dot'))
fig.show()

Variance explained by each principal component:
[0.5738875  0.08633333 0.07386728 0.04843817 0.04079792]
Cumulative explained variance:
0.8233242016016824


Following the PCA dimensionality reduction, we further analyzed the data using t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE is a powerful technique for visualizing high-dimensional data by giving each data point a location in a low-dimensional map (in this case, two dimensions). This approach helps reveal clusters and relationships between countries based on their SDG progress that might not be apparent in the PCA plot. The interactive_tsne_plot function takes the PCA results and original data as input, along with parameters for perplexity (which influences the local neighborhood size considered) and the number of components (output dimensions). The function then performs t-SNE and generates an interactive scatter plot where each point represents a country, colored by country name, and hovering over a point reveals the country and year. This visualization provides another perspective on the similarities and differences in countries' SDG trajectories, potentially uncovering more nuanced groupings and patterns. The perplexity parameter is set to 50 for this visualization, allowing for a balance between local and global structure preservation.

In [3]:
# @title t-SNE
import pandas as pd
import plotly.express as px
from sklearn.manifold import TSNE

def interactive_tsne_plot(pca_result, data, perplexity=30, n_components=2):
    tsne = TSNE(n_components=n_components, perplexity=perplexity, random_state=28)
    tsne_result = tsne.fit_transform(pca_result)

    tsne_columns = [f'Dimension {i+1}' for i in range(n_components)]
    tsne_df = pd.DataFrame(tsne_result, columns=tsne_columns)
    tsne_df['Country'] = data['Country'].values
    tsne_df['year'] = data['year'].values

    fig = px.scatter(tsne_df, x='Dimension 1', y='Dimension 2', color='Country',
                     hover_data=['Country', 'year'], title=f't-SNE Plot (Perplexity {perplexity}, Components {n_components})')

    fig.update_traces(marker=dict(size=5), selector=dict(mode='markers'))
    fig.update_layout(legend_title_text='Country', legend=dict(orientation="v", x=1.05, y=1))

    fig.show()

interactive_tsne_plot(pca_result, data, perplexity=50, n_components=2)

We conducted further analysis using t-SNE on the PCA results and DBSCAN clustering to gain deeper insights into the data.  The results, including cluster assignments, are saved to a CSV file. This visualization effectively captures local relationships, allowing for the selection of groups even manually, as t-SNE maintains neighborhoods.

Next, we generated parallel coordinates plots. This visualization highlights the differences in SDG progress among clusters. Each plot represents a cluster, displaying normalized scores across various goals, allowing us to observe distinctive patterns and trends. The consistent color mapping ensures clear identification of clusters in both plots. This approach provides a comprehensive view of the countries' sustainable development trajectories, revealing intricate relationships and groupings.



In [7]:
# @title Clustering
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import pandas as pd
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from pandas.plotting import parallel_coordinates
from sklearn.preprocessing import MinMaxScaler
import math

def get_cluster_color_map(clusters):
    cmap = plt.get_cmap('tab10')
    colors = [mcolors.rgb2hex(cmap(i % cmap.N)) for i in range(len(clusters))]
    color_map = {str(cluster): colors[i] for i, cluster in enumerate(sorted(clusters))}
    return color_map

def interactive_tsne_plot_with_dbscan(pca_result, data, perplexity=30, n_components=2, eps=0.5, min_samples=5, init='random', output_filename='cluster_assignments.csv'):
    tsne = TSNE(n_components=n_components, perplexity=perplexity, init=init, random_state=42)
    tsne_result = tsne.fit_transform(pca_result)

    tsne_columns = [f'Dimension {i+1}' for i in range(n_components)]
    tsne_df = pd.DataFrame(tsne_result, columns=tsne_columns)
    tsne_df['Country'] = data['Country'].values
    tsne_df['year'] = data['year'].values

    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    labels = dbscan.fit_predict(tsne_df[['Dimension 1', 'Dimension 2']])
    tsne_df['Cluster'] = labels

    tsne_df.to_csv(output_filename, index=False)
    print(f"Cluster assignments saved to {output_filename}")

    color_map = get_cluster_color_map(tsne_df['Cluster'].unique())

    fig = px.scatter(
        tsne_df, x='Dimension 1', y='Dimension 2', color=tsne_df['Cluster'].astype(str),
        color_discrete_map=color_map,
        hover_data=['Country', 'year'],
        title=f't-SNE Plot with DBSCAN Clusters (Perplexity {perplexity}, Components {n_components})'
    )

    fig.update_traces(marker=dict(size=5))
    fig.update_layout(
        legend_title_text='Cluster',
        legend=dict(orientation="v", x=1.05, y=1)
    )

    fig.show()
    return tsne_df, color_map


def interactive_parallel_coordinates_plots_for_clusters(cluster_assignments_file, data, goal_columns, color_map):
    tsne_df = pd.read_csv(cluster_assignments_file)
    merged_df = pd.merge(tsne_df[['Country', 'year', 'Cluster']], data, on=['Country', 'year'])
    scaler = MinMaxScaler()
    merged_df[goal_columns] = scaler.fit_transform(merged_df[goal_columns])
    clusters = merged_df['Cluster'].unique()
    clusters = [c for c in clusters if c != -1]
    num_clusters = len(clusters)
    n_cols = math.ceil(math.sqrt(num_clusters))
    n_rows = math.ceil(num_clusters / n_cols)
    fig = go.Figure()
    for idx, cluster in enumerate(clusters):
        cluster_data = merged_df.loc[merged_df['Cluster'] == cluster].copy()
        dimensions = []
        for i, col in enumerate(goal_columns):
            dimensions.append(
                dict(
                    range=[0, 1],
                    label=f'G{i+1}',
                    values=cluster_data[col].values,
                )
            )
        fig.add_trace(
            go.Parcoords(
                line=dict(
                    color=color_map[str(cluster)],
                    showscale=False
                ),
                dimensions=dimensions,
                labelangle=30,
                labelside='bottom',
                meta=dict(cluster=cluster)
            )
        )
    buttons = []
    for i, cluster in enumerate(clusters):
        buttons.append(
            dict(
                label=f'Cluster {cluster}',
                method='update',
                args=[{'visible': [False]*i + [True] + [False]*(num_clusters-i-1)}]
            )
        )
    fig.update_layout(
        height=800,
        width=1000,
        showlegend=False,
        title='Parallel Coordinates Plots for Clusters',
        plot_bgcolor='white',
        paper_bgcolor='white',
        updatemenus=[
            dict(
                type='buttons',
                buttons=buttons,
                direction='left',
                pad={'r': 10, 't': 10},
                showactive=True,
                x=0.5,
                xanchor='center',
                y=1.2,
                yanchor='top'
            )
        ]
    )
    fig.show()

tsne_df, color_map = interactive_tsne_plot_with_dbscan(
    pca_result, data, perplexity=50, n_components=2, eps=6, min_samples=4, init='random'
)

interactive_parallel_coordinates_plots_for_clusters("cluster_assignments.csv", data, goal_columns, color_map)


Cluster assignments saved to cluster_assignments.csv


It's important to note that minor variations can occur depending on the seed used for t-SNE and the parameters of the clustering algorithm. However, the central point is the map presented by t-SNE, which, even when done manually, allows for the selection of groups since t-SNE maintains the neighborhoods. Although t-SNE loses much of the global structure, it effectively captures local relationships. In contrast, PCA captures the global linear structure but may miss fine details and non-linearities. For more details and analysis see [Sustainable Visions: Unsupervised Machine Learning Insights on Global Development Goals](https://arxiv.org/abs/2409.12427)

Methodology and code

by

Dr. Matias Nuñez

matias.nunez2@gmail.com

[Linkedin](https://www.linkedin.com/in/matias-nu%C3%B1ez-6b53544/)

[ORCID Number](https://orcid.org/0009-0005-0790-5405)

