# Clustering

Using clustering, we may be able to find particular clusters of patients that have a higher amount of Alzheimer's disease than others. Then, based on the differences between clusters, we might be able to find a pattern. For the clustering, $k$-prototypes clustering is used, since it is easy to implement, fast and it works on mixed data, which is the case for our data set.

In [None]:
import data_reader
import numpy as np
from scipy.stats import mode

data = data_reader.get_data_dict('./data/alzheimers_disease_data.csv')
num_cols = ['BMI', 'SystolicBP', 'DiastolicBP', 'CholesterolTotal', 'CholesterolLDL',
    'CholesterolHDL', 'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment',
    'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality',
    'ADL']
cat_cols = ['FamilyHistoryAlzheimers', 'CardiovascularDisease', 'Diabetes',
      'Depression', 'Hypertension', 'MemoryComplaints', 'BehavioralProblems',
      'Confusion', 'Disorientation', 'PersonalityChanges', 'DifficultyCompletingTasks',
      'Forgetfulness', 'HeadInjury', 'Smoking', 'Ethnicity', 'Gender',
      'EducationLevel', 'Diagnosis']

num_matrix = np.vstack(tuple(data[col] for col in num_cols)).T
cat_matrix = np.vstack(tuple(data[col] for col in cat_cols)).T


def dissimilarity_score(v1, v2):
    # Calculate the amount of differences between categorical features
    differences = v1 != v2
    return np.sum(differences)


def euclidian_distance(v1, v2):
    # Calculate the euclidian distance between numerical features
    return np.sqrt(np.sum((v1 - v2)**2))


def assign_cluster(prototypes, clusters, point):
    dissimilarities = []
    for pt in prototypes:
        v_n = num_matrix[point]
        v_c = cat_matrix[point]
        pt_v_n = pt[0]
        pt_v_c = pt[1]
        dissimilarity = dissimilarity_score(pt_v_n, v_n) + euclidian_distance(pt_v_c, v_c)
        dissimilarities.append(dissimilarity)

    dissimilarities = np.array(dissimilarities)
    cluster = np.where(dissimilarities == dissimilarities.min())[0][0]
    clusters[cluster].append((num_matrix[point], cat_matrix[point]))

    return clusters


def calc_prototype(cluster):
    # Calculate a new prototype
    num_cluster = np.vstack([point[0] for point in cluster])
    num_prototype = np.mean(num_cluster, axis=0)
    cat_cluster = np.vstack([point[1] for point in cluster])
    cat_prototype = mode(cat_cluster, axis=0).mode

    return (num_prototype, cat_prototype)


def cluster_data(k, num_matrix, cat_matrix, verbose=False):
    prototypes = np.random.randint(0, len(num_matrix), k)
    if verbose:
        print(f'Selected initial prototypes: {prototypes}')

    # Initialize prototypes and clusters
    prototypes = [(num_matrix[i], cat_matrix[i]) for i in prototypes]
    clusters = [[] for _ in range(k)]

    while True:
        for i in range(len(num_matrix)):
            clusters = assign_cluster(prototypes, clusters, i)

        # Calculate new prototypes
        new_prototypes = [calc_prototype(cluster) for cluster in clusters]

        # Check for convergence
        done = all(
            np.array_equal(prototypes[i][0], new_prototypes[i][0]) and
            np.array_equal(prototypes[i][1], new_prototypes[i][1])
            for i in range(len(prototypes))
        )
        if done:
            break

        prototypes = new_prototypes
        clusters = [[] for _ in range(k)]

    if verbose:
        print(f'---  {len(clusters)} clusters found  ---')
        for i, cluster in enumerate(clusters):
            print(f' - Cluster {i + 1} with size={len(cluster)}')

    return clusters

# k indicates the amount of clusters
k = 5
clusters = cluster_data(k, num_matrix, cat_matrix, True)



Since there are now $k$ clusters, we can try to find differences between the clusters. The first thing to do is to find the ratio of people with and without a diagnosis per cluster, from there, we can see which other variables show behaviour that is distinct for a cluster.

In [None]:
# The columns are organized in the same order as the num_cols and cat_cols
# variables defined in the cell above
ratios = []
for cluster in clusters:
    diagnosis_amt = 0
    for item in cluster:
        if item[1][-1] == 1:
            diagnosis_amt += 1
    ratios.append(diagnosis_amt/len(cluster))

print('Percentage of patients with a diagnosis')
for i in range(len(ratios)):
    print(f' - in cluster {i + 1}: {ratios[i]*100}%')


Depending on the initial prototypes, the clusters have a different percentage of patients on each execution of the algorithm. However, in most of the runs of the algorithm there are a few clusters that have a diagnosis percentage of over 50%, while there are others with a percentage at 25% or lower. Occasionally, each cluster has about the same ratio of patients with a diagnosis compared to patients without one, which is not that useful.

It might be best to run the clustering algorithm a few times to find initial prototypes that produce clusters with significant differences in the percentage of patients with a diagnosis, so we can look at which variables are also responsible for that difference. These same prototypes can then be stored so they can be re-used later.