# Clustering with the Basic Sequential Clustering Algorithm

#### Garrett McCue


Goal of the assignment is to apply BSCA, testing 3 different values for both of the clustering hyperparameters, $\alpha$ and $M$ , to the 1D and 2D projections of the [Stellar Classification Dataset](https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17) from both PCA and LD techniques which was apart of the [Dimensionality Reduction assignment](https://nbviewer.org/github/mcqueg/Unsupervised_ML_Assignments/blob/main/Dimensionality_Reduction.ipynb).


### Table of Contents
[1. What is Clustering?](#Clustering)

[2. BSCA Logic](#bsca)

[3. BSCA Code](#code)

[4. Load and Transform Data](#data)

[5. 1D Projections with BSCA](#1d)

[6. 2D Projections with BSCA](#2d)

[7. Clustering Analysis](#analsis)


## What is Clustering? <a class="anchor" id="Clustering"></a>

Clustering is an unsupervised technique that aims to form groupings of data points without any class knowledge. This technique is not classification, because we want to discover groupings of data independent of their respective classes. Since this process is unsupervised it can be common to get results that have a different number of groupings in comparison with the amount of unique classes within the data. Determining what data point belong within each cluster and the number of cluster is based on a chosen distance measurement between data points and the clusters as well as a specified distance threshold and max cluster number. When implementing clustering techniques the high dimensionality of data can make the process difficult. Sometimes it might be necessary to perform dimensionality reduction techniques before the application of a clustering technique. It can also be tricky to find the correct number of clusters as well as choosing the distance threshold. When implementing clustering the threshold of distance ($\alpha$) parameter and max number of clusters ($M$) parameter must be specified. You can initialize the first point as the first cluster and then loop through the rest of the points, or samples until the end. While iterating through the points the minimum distance between each sample point and the cluster(s) will be computed, and if it falls beyond the specified distance threshold and the current cluster total is less than $M$ then a new cluster can be formed from that point. If the minimum distance of a point with one of the clusters cluster falls below the distance threshold, then that point is then added to the cluster it is closest too. This process iterates through all samples adding each one to a cluster or creating a new cluster until all samples distances have been measured and grouped. 


## BSCA Logic <a class="anchor" id="bsca"></a>
User defined parameters...
$$\alpha = \text{distance threshold}$$
$$ M  = \text{maximum cluster count} $$
Initialize the cluster total $t$ as 1 and the first sample, $\vec{X_i}$ as the first cluster, $C_1$
$$ t = 1;\quad \text{where}\; t = \text{cluster total} $$
$$\vec{X_1} = C_1$$
For all samples, $\vec{X_2}...\vec{X_N}$, find the minimum distance to each cluster $d_{min}(\vec{X_i}, C_j)$
$$ d_{min}(\vec{X_i}, C_j)\;; \quad i = 2,...,N $$
If the distance is beyond the threshold $\alpha$, and the cluster total ($t$) is less than the threshold $M$ then create a new cluster from the data point and update the total cluster number.
$$ IF \quad d(\vec{X_i}, C_j) > \alpha \quad AND\quad t < M $$
then, 
$$ t = t + 1 $$
$$ C_t = \vec{X_i} $$
If the distance is not beyond the threshold ($\alpha$) then add that point to the corresponding cluster ($C_j$) with the smallest distance measure
$$ IF \quad d(\vec{X_i}, C_j) < \alpha $$
then,
$$ C_j = C_j \cup \vec{X_i} $$

Continue finding the minimum distances and assigning samples to the closest cluster or create a new cluster from samples until all samples have been assigned to a cluster



## Load and Transform Data <a class="anchor" id="data"></a>

In [11]:
# import libraries and data

import numpy as np
import pandas as pd
from statistics import mean
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from dims_reduction import PCA, LDA

# Load data
DATA = 'data/star_classification.csv'
stellar_df = pd.read_csv(DATA)
labels_df = pd.DataFrame(stellar_df['class'])


### PCA Projection of Data

In [2]:

X_pca = stellar_df.drop(columns='class')
 
# 1D PCA
# create the one dimensional data and set the column to be the 1st principal component
pca_one = pd.DataFrame(PCA(X_pca, 1), columns=['PC1'])
# join the classes back to the observations
#pca_one_df = pd.concat([pca_one_df, labels_df], axis=1)

# 2D PCA
# create the one dimensional data and set the column to be the 1st principal component
pca_two = pd.DataFrame(PCA(X_pca, 2), columns=['PC1', 'PC2'])
# join the classes back to the observations
#pca_two_df = pd.concat([pca_two_df, labels_df], axis=1)

print("1D PCA shape: {}\n2D PCA shape: {}".format(pca_one.shape, pca_two.shape))

1D PCA shape: (100000, 1)
2D PCA shape: (100000, 2)


### LDA Projection of Data

In [3]:

# drop id fields
columns = ['class','obj_ID', 'run_ID','rerun_ID', 'field_ID', 'spec_obj_ID', 'MJD', 'fiber_ID', 'cam_col', 'plate']
X_lda = stellar_df.drop(columns=columns)
labels = stellar_df['class']
#1D LDA
lda_one = LDA(data=X_lda, labels=labels, n=1)
#2D LDA
lda_two = LDA(data=X_lda, labels=labels, n=2)

print("1D LDA shape: {}\n2D LDA shape: {}".format(lda_one.shape, lda_two.shape))

1D LDA shape: (100000, 1)
2D LDA shape: (100000, 2)


## BSCA Code <a class="anchor" id="code"></a>

In [4]:
def BSCA(data, m, alpha):
    '''
    Arguments: 
    data -> df of data to cluster, observations row wise
    -> int, number of clusters to find within data
      alpha -> int, distance threshold, 
            determines how far away a point can be before it must create a new cluster
    Returns:
      cluster_df -> df with a value columns and a column corresponding to its cluster
    '''

    # Initialize the cluster total (t) as 1
    t = 1
    # initialize cluster dictionary that will hold the samples in each cluster
    clusters = dict(zip(range(m), [[] for i in range(m)]))
    # initialize the first sample, X_i as the first cluster, C_1
    clusters[0].append(data.loc[0, :].to_numpy())
    # For all samples (X_2 ... X_n) find the min distance to each cluster d_min(X_i, C_j)
    for row in data.loc[1:, :].to_numpy():
        c = 0
        curr_clusters = t
        # init distance dict for this sample
        distance = dict(zip(range(curr_clusters), [
                        [] for i in range(curr_clusters)]))
        # loop through current clusters computing the distance to each and appending to the distance dict
        for c in range(curr_clusters):
            # ?? if more than one column find the mean for each column one at a time
         # c_temp = [int(i) for i in clusters[c]]
            distance[c].append(abs(row-np.mean(clusters[c])))
        # calculate the shortest distance to a cluster
        shortest_dist = min(distance.values())
        # find the closest cluster  based using shortest distance
        closest_cluster = list(distance.keys())[list(
            distance.values()).index(shortest_dist)]
        # If the distance is large than alpha and t<m set the sample to the new cluster
        if t < m and shortest_dist[0] > alpha:
            print('A new cluster has been made: cluster ' + str(c + 1))
            clusters[curr_clusters].append(row)
            t = t+1
        # Else -> add the sample to the clusters dictionary under the key for the closest cluster
        else:
            clusters[closest_cluster].append(row)
    print("Clustering complete...")

    # create df from cluster dictionary
    cluster_df = pd.DataFrame()
    print("Preparing dataframe...")
    x = 0  # counter for c
    for c in clusters:
        j = 0  # counter for number of row
        for i in clusters[c]:
            row = pd.DataFrame(np.append(clusters[x][j], x)).T
            cluster_df = pd.concat([cluster_df, row], ignore_index=True)
            j += 1
        x += 1

    return cluster_df


In [5]:
pca_one_clust = BSCA(pca_one, m=3, alpha=0.4)

A new cluster has been made: cluster 1
A new cluster has been made: cluster 2
Clustering complete...
Preparing dataframe...


In [12]:

fig=go.Figure()
fig=px.scatter(x=pca_one_clust[0], color=pca_one_clust[1], color_continuous_scale=['red','green','blue'], title="BSCA with LDA Iris Data alpha=" + str(0.4) + ", max clusters=" + str(3))
fig.update_layout(coloraxis_colorbar=dict(
    title="Clusters",
    tickvals=[0,1,2,3, 4],
    ticktext=["c1","c2","c3","c4", "c5"],
    lenmode="pixels", len=500))

In [22]:
from sklearn.utils import shuffle
pca_one_clust = shuffle(pca_one_clust)
fig=go.Figure()
fig=px.scatter(x=pca_one_clust[0], color=pca_one_clust[1], color_continuous_scale=['red','green','blue'], title="BSCA with PCA with Stellar Data alpha=" + str(0.4) + ", max clusters=" + str(3))
fig.update_layout(coloraxis_colorbar=dict(
    title="Clusters",
    tickvals=[0,1,2,3, 4],
    ticktext=["c1","c2","c3","c4", "c5"],
    lenmode="pixels", len=500))

fig.write_html('result2.html', auto_open=True)



distance(row, clusters_dict[c]):
- input as array
- if row colums == 1 
    - find mean of c 
    - distance = row val  - mean of c
- if row columns == 2 
    - take mean of columns in c to get x mean and y mean 
    - find distance using xrow xmean yrow and ymean

- return: distance

In [7]:
m = 3
clusters = dict(zip(range(m), [[] for i in range(m)]))
clusters[0].append(pca_one.loc[0,:].to_numpy())
clusters[0].append(pca_one.loc[1,:].to_numpy())
clusters[1].append(pca_one.loc[2,:].to_numpy())
cluster_df = pd.DataFrame()
x=0 # counter for c
for c in clusters:
    j = 0 # counter for number of row
    for i in clusters[c]:
        row = pd.DataFrame(np.append(clusters[x][j], x)).T
        cluster_df = pd.concat([cluster_df, row], ignore_index=True)
        j +=1
    x += 1
cluster_df

Unnamed: 0,0,1
0,-1.191556,0.0
1,-1.838312,0.0
2,-1.067475,1.0


In [8]:
m = 3
clusters = dict(zip(range(m), [[] for i in range(m)]))
clusters[0].append(pca_two.loc[0,:].to_numpy())
clusters[0].append(pca_two.loc[1,:].to_numpy())
clusters[1].append(pca_two.loc[2,:].to_numpy())
row = pd.DataFrame(np.append(clusters[0][0], 0)).T
df = pd.DataFrame()
pd.concat([df, row], ignore_index=True)