# Clustering with the Basic Sequential Clustering Algorithm

#### Garrett McCue


Goal of the assignment is to apply BSCA, testing 3 different values for both of the clustering hyperparameters, $\alpha$ and $M$ , to the 1D and 2D projections of the [Stellar Classification Dataset](https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17) from both PCA and LD techniques which was apart of the [Dimensionality Reduction assignment](https://nbviewer.org/github/mcqueg/Unsupervised_ML_Assignments/blob/main/Dimensionality_Reduction.ipynb).


### Table of Contents
[1. What is Clustering?](#Clustering)</br>
[2. BSCA Logic](#bsca)</br>
[3. BSCA Code](#code)</br>
[4. Load and Transform Data](#data)</br>
[5. 1D Projections with BSCA](#1d) </br>
[6. 2D Projections with BSCA](#2d) </br>
[7. Clustering Analysis](#analsis) </br>


## What is Clustering? <a class="anchor" id="Clustering"></a>

Clustering is an unsupervised technique that aims to form groupings of data points without any class knowledge. This technique is not classification, because we want to discover groupings of data independent of their respective classes. Since this process is unsupervised it can be common to get results that have a different number of groupings in comparison with the amount of unique classes within the data. Determining what data point belong within each cluster and the number of cluster is based on a chosen distance measurement between data points and the clusters as well as a specified distance threshold and max cluster number. When implementing clustering techniques the high dimensionality of data can make the process difficult. Sometimes it might be necessary to perform dimensionality reduction techniques before the application of a clustering technique. It can also be tricky to find the correct number of clusters as well as choosing the distance threshold. When implementing clustering the threshold of distance ($\alpha$) parameter and max number of clusters ($M$) parameter must be specified. You can initialize the first point as the first cluster and then loop through the rest of the points, or samples until the end. While iterating through the points the minimum distance between each sample point and the cluster(s) will be computed, and if it falls beyond the specified distance threshold and the current cluster total is less than $M$ then a new cluster can be formed from that point. If the minimum distance of a point with one of the clusters cluster falls below the distance threshold, then that point is then added to the cluster it is closest too. This process iterates through all samples adding each one to a cluster or creating a new cluster until all samples distances have been measured and grouped. 


## BSCA Logic <a class="anchor" id="bsca"></a>
User defined parameters...
$$\alpha = \text{distance threshold}$$
$$ M  = \text{maximum cluster count} $$
</br >Initialize the cluster total $t$ as 1 and the first sample, $\vec{X_i}$ as the first cluster, $C_1$
$$ t = 1;\quad \text{where}\; t = \text{cluster total} $$
$$\vec{X_1} = C_1$$
</br >For all samples, $\vec{X_2}...\vec{X_N}$, find the minimum distance to each cluster $d_{min}(\vec{X_i}, C_j)$
$$ d_{min}(\vec{X_i}, C_j)\;; \quad i = 2,...,N $$
</br >If the distance is beyond the threshold $\alpha$, and the cluster total ($t$) is less than the threshold $M$ then create a new cluster from the data point and update the total cluster number.
$$ IF \quad d(\vec{X_i}, C_j) > \alpha \quad AND\quad t < M $$
then, 
$$ t = t + 1 $$
$$ C_t = \vec{X_i} $$
</br >If the distance is not beyond the threshold ($\alpha$) then add that point to the corresponding cluster ($C_j$) with the smallest distance measure
$$ IF \quad d(\vec{X_i}, C_j) < \alpha $$
then,
$$ C_j = C_j \cup \vec{X_i} $$

</br >Continue finding the minimum distances and assigning samples to the closest cluster or create a new cluster from samples until all samples have been assigned to a cluster



## Load and Transform Data <a class="anchor" id="data"></a>

In [12]:
# import libraries and data
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from dims_reduction import PCA, LDA

# Load data
DATA = 'data/star_classification.csv'
stellar_df = pd.read_csv(DATA)
labels_df = pd.DataFrame(stellar_df['class'])


### PCA Projection of Data

In [22]:

X_pca = stellar_df.drop(columns='class')
 
# 1D PCA
# create the one dimensional data and set the column to be the 1st principal component
pca_one_df = pd.DataFrame(PCA(X_pca, 1), columns=['PC1'])
# join the classes back to the observations
#pca_one_df = pd.concat([pca_one_df, labels_df], axis=1)

# 2D PCA
# create the one dimensional data and set the column to be the 1st principal component
pca_two_df = pd.DataFrame(PCA(X_pca, 2), columns=['PC1', 'PC2'])
# join the classes back to the observations
#pca_two_df = pd.concat([pca_two_df, labels_df], axis=1)

print("1D PCA shape: {}\n2D PCA shape: {}".format(pca_one_df.shape, pca_two_df.shape))

1D PCA shape: (100000, 1)
2D PCA shape: (100000, 2)


### LDA Projection of Data

In [9]:

# drop id fields
columns = ['class','obj_ID', 'run_ID','rerun_ID', 'field_ID', 'spec_obj_ID', 'MJD', 'fiber_ID', 'cam_col', 'plate']
X_lda = stellar_df.drop(columns=columns)
labels = stellar_df['class']
#1D LDA
lda_one = LDA(data=X_lda, labels=labels, n=1)
#2D LDA
lda_two = LDA(data=X_lda, labels=labels, n=2)

print("1D LDA shape: {}\n2D LDA shape: {}".format(lda_one.shape, lda_two.shape))

1D LDA shape: (100000, 1)
2D LDA shape: (100000, 2)
