# Visual Assessment for Tendency
 Garrett McCue

Goal of the assignment is to apply the VAT and the iVAT algorithms on a dataset in order to asses its clustering ability. The dataset that I will be using is the [2022 World Happiness Report Dataset from Kaggle](https://www.kaggle.com/datasets/hemil26/world-happiness-report-2022). This dataset contains the scores and rankings 146 countries based on individuals' own assessments of their lives and their overall happiness. For an indepth explanation of the data and its aquisition please see [here](https://happiness-report.s3.amazonaws.com/2022/Appendix_1_StatiscalAppendix_Ch2.pdf).

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import pairwise_distances

## Load Data


In [2]:
hap_df = pd.read_csv("data/world_happiness_rankings_2022.csv")
hap_df.head()

Unnamed: 0,RANK,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
1,2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
2,3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191
3,4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,0.147,0.461
4,5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419


In [3]:
ranking_df = hap_df[['RANK', 'Country']]
metrics_df = hap_df.drop(['RANK', 'Country'], axis=1)



## Dimensionality Reduction

In [4]:
print("The data feature set has a shape of: {}".format(metrics_df.shape))

The data feature set has a shape of: (146, 10)


In [5]:
# Standardize Dataset
metrics_df = StandardScaler().fit_transform(metrics_df)

# apply 2D PCA 
pca_2 = PCA(n_components=2)
pca_2_data = pca_2.fit_transform(metrics_df)
pca_2_df = pd.DataFrame(data=pca_2_data, columns=['PC1', 'PC2'])
pca_2_ranking_df = pd.concat([ranking_df, pca_2_df], axis=1)

pca_2_ranking_df.head()

Unnamed: 0,RANK,Country,PC1,PC2
0,1,Finland,-4.966929,-0.421283
1,2,Denmark,-4.6971,-0.285749
2,3,Iceland,-4.234076,-1.113467
3,4,Switzerland,-4.463696,0.126756
4,5,Netherlands,-4.092789,-0.592264


In [6]:
# apply 3D PCA 
pca_3 = PCA(n_components=3)
pca_3_data = pca_3.fit_transform(metrics_df)
pca_3_df = pd.DataFrame(data=pca_3_data, columns=['PC1', 'PC2', 'PC3'])
pca_3_ranking_df = pd.concat([ranking_df, pca_3_df], axis=1)
pca_3_ranking_df.head()

Unnamed: 0,RANK,Country,PC1,PC2,PC3
0,1,Finland,-4.966929,-0.421283,0.555072
1,2,Denmark,-4.6971,-0.285749,1.383118
2,3,Iceland,-4.234076,-1.113467,0.64867
3,4,Switzerland,-4.463696,0.126756,0.707251
4,5,Netherlands,-4.092789,-0.592264,1.576686


## Matrix Ordering Algorithm

In [7]:
def reorder_pairwise_matrix(dissim_matrix):
    ''' Computes the reordering of the pairwise distance matrix, or dissimilarity matrix, for VAT/iVAT 
    Parameters:
    dissim_matrix : pairwise distance/dissimilarity matrix 
        np.array

    Returns:
    reordered_matrix : reordered pairwise distance/dissimilarity matrix
    '''
       # initialize arrays as zeros to hold row indices of the dissimilarity matrix
    # the index array will be used to reorder the dissimilarity matrix
    idx_list = np.zeros(dissim_matrix.shape[0], dtype=int)
    path = np.zeros_like(idx_list)
    # select the largest value and get the column index
    idx_max = np.argmax(dissim_matrix)
    # set the column index of the largest value as the first item in both zero initialized arrays
    col_idx_max = (idx_max // dissim_matrix.shape[1])
    idx_list[0] = col_idx_max
    path[0] = col_idx_max
    # create K to hold indices of all rows from dissimilarity matrix
    K = np.linspace(
        0, dissim_matrix.shape[0] - 1, dissim_matrix.shape[0]).astype(np.int32)
    # create J, from K, excluding the max value
    # J will be looped through to find the smallest values and add their row index to the zero initialized arrays
    J = np.delete(K, col_idx_max)

    # loop through each row finding the next smallest values
    for r in range(1, dissim_matrix.shape[0]):
        p, q = (-1, -1)
        # initialize min_temp as the max of the matrix, 
        # loop through updating each time there is a smaller value until the smallest is found
        min_temp = np.max(dissim_matrix)
        for potential_p in path[0:r]:
            for potential_j in J:
                # if min_temp is larger than dissim_matrix @ potential_p, potential_j update min_temp, p, and q with current vals
                if dissim_matrix[potential_p, potential_j] < min_temp:
                    p = potential_p
                    q = potential_j
                    # update mintemp with value at current indices
                    min_temp = dissim_matrix[p, q]
        # set the column index of the smallest value to the rth index of the zero initialized arrays that will be used for reordering
        idx_list[r] = q
        path[r] = q
        # remove the index in J that contains q, and continue looping until all indices are accounted for.
        # np.where(array==X)[0] -> produces a list of indices where the condition is true,
        # first element of the index list is set to q_idx
        q_idx = np.where(J == q)[0][0]
        J = np.delete(J, q_idx)

    # reorder dissimilarity matrix using the ordered indices array
    reordered_matrix = np.zeros_like(dissim_matrix)

    for col_idx_max in range(reordered_matrix.shape[0]):  # loops through rows
        for j in range(reordered_matrix.shape[1]):  # loops through cols
            # assign vals from dissimilarity matrix to each idx in reordered matrix
            # reordering is based on the index arrays holding the new row order
            reordered_matrix[col_idx_max,
                             j] = dissim_matrix[idx_list[col_idx_max], idx_list[j]]

    return reordered_matrix


## VAT

In [8]:


def VAT(data, title):
    ''' Computes the visual assesment tendency of clustering for the dataset

    Parameters:
    data : data matrix that has been scaled/normalized already 
    title : title for image that will be returned

    Returns:
    vat_img :  intensity image corresponding to the computed reordered dissimilarity matrix for the data
    '''
    # compute dissimilarity matrix
    dissim_matrix = pairwise_distances(data)
    # reorder the dissimilarity matrix for visualization
    reordered_matrix = reorder_pairwise_matrix(dissim_matrix)
    # plot ordered dissimilarity matrix as an intensity image
    vat_img = px.imshow(reordered_matrix,
                        zmin=0,
                        zmax=np.max(reordered_matrix),
                        #color_continuous_scale='gray',
                        binary_string=True,
                        width=600, height=600,
                        title=title)

    return vat_img


In [9]:
vat_2d_img = VAT(pca_2_data, title='VAT: 2D-PCA')
vat_2d_img.show()

In [10]:
vat_3d_img = VAT(pca_3_data, title='VAT: 3D-PCA')
vat_3d_img.show()

## iVAT


In [11]:
def iVAT(data, title):
    ''' Computes the improved visual assesment tendency of clustering for the dataset. This method improves on
    VAT by further ordering of the pairwise distance matrix resulting in easier visual interpretation of potential clustering. 

    Parameters:
    data : data matrix that has been scaled/normalized already
    title : title for image that will be returned

    Returns:
    ivat_img :  intensity image corresponding to the computed reordered dissimilarity matrix for the data
    '''
    # compute dissimilarity matrix
    dissim_matrix = pairwise_distances(data)
    reordered_matrix = reorder_pairwise_matrix(dissim_matrix)
    # initialize the square ivat_ordered_matrix with zeros and num rows x num rows as the shape
    ivat_ordered_matrix = np.zeros((reordered_matrix.shape[0], reordered_matrix.shape[0]))
    # loop through the ordered matrix rows
    for r in range(1, reordered_matrix.shape[0]):
        # find the index of the smallest value within the current row (r), from index 1 to r-1 of the row
        j = np.argmin(reordered_matrix[r, 0:r])
         # set min value in zero initialized matrix
        ivat_ordered_matrix[r,j] = reordered_matrix[r,j]
        # set the other portion of the ivat ordered matrix as a reflection over the top left to bottom right
        ivat_ordered_matrix[j,r] = reordered_matrix[r,j] 
        # create an array from values in range 0 to the current row num to represent row/col indices
        col_temp = np.array(range(0, r))
        # and drop @ index j to remove the minimum value from being considered within this temp index array
        col_temp = col_temp[col_temp!=j]
        # loop through col_temp and update ivat_ordered_matrix to improve visual representation of portential clustering
        for c in col_temp:
            # update ivat_ordered_matrix 
            #  matrix [2,4] == matrix [4,2]
            ivat_ordered_matrix[r, c] = max(reordered_matrix[r, j], ivat_ordered_matrix[j, c])
            ivat_ordered_matrix[c, r] = ivat_ordered_matrix[r, c]
    # plot newly ordered matrix
    # plot ordered dissimilarity matrix as an intensity image
    ivat_img = px.imshow(ivat_ordered_matrix,
                        zmin=0,
                        zmax=np.max(ivat_ordered_matrix),
                        #color_continuous_scale='gray',
                        binary_string=True,
                        width=600, height=600,
                        title=title)


    return ivat_img

In [12]:
ivat_2d_img = iVAT(pca_2_data, title='iVAT: 2D-PCA')
ivat_2d_img.show()

In [13]:
ivat_3d_img = iVAT(pca_3_data, title='iVAT: 3D-PCA')
ivat_3d_img.show()

## Analysis of VAT and iVAT