## Part 1 Principle Component Analysis Network
The dataset "data/sound.csv" contains two sounds recorded by the two microphones.
The purpose of this notebook is to use a PCA network to find the approximation of the first principal component.
- Build a PCA network to reduce the number of features from 2 to 1
- Train the model and generate the processed data 
- Save the data into output.wav and output.csv files
- Compare the sound_o.wav (audio with noise) and output.wav (audio is denoised)  

In [1]:
import numpy as np
import pandas as pd
from scipy.io import wavfile
np.random.seed(1)

In [2]:
sample_rate = 8000

In [3]:
# read csv into array
txt_data = np.genfromtxt('data/sound.csv', delimiter=',')
txt_data.shape

(50000, 2)

In [4]:
# save array to WAV file
scaled_data = np.int16(txt_data * sample_rate)
wavfile.write('data/sound_o.wav', sample_rate, scaled_data)

In [5]:
# read WAV file into array
# The data in sound.csv is processed
# If you use the data generated here, you need to process the data by adding wav_data = wav_data / sample_rate
sample_rate, wav_data = wavfile.read('data/sound_o.wav')
sample_rate, wav_data.shape

(8000, (50000, 2))

In [6]:
# save array to csv file
np.savetxt('data/sound_o.csv', txt_data, delimiter=',')

In [7]:
# build PCA model and only Numpy can be used
class PCA(object):
    def __init__(self, lr, epoch):
        self.lr = lr
        self.epoch = epoch

    def train(self, x, n_components):
        # create empty array for weight history
        history_W = []
        # initialize random weights and bias for the 
        W = np.random.rand(1, np.shape(x)[1])
        # standardize input data 
        X = (x - np.mean(x, axis=0)) / np.std(x, axis=0)

        # find the covariance matrix
        covar = np.cov(X, rowvar=False)
        # get normalized eigenvalues and eigenvectors
        eigen_val, eigen_vec = np.linalg.eig(covar)
        # sort indices by descending eigenvalues
        eigen_val_index = np.argsort(eigen_val)[::-1]
        # sort eigenvalues and eigenvectors
        eigen_val_sorted = eigen_val[eigen_val_index]
        eigen_vec_sorted = eigen_vec[:,eigen_val_index]
        # select eigenvectors only for the number of features specified
        eigen_vector_subset = eigen_vec_sorted[:,0:n_components]

        # iterate for specified number of epochs
        for e in range(self.epoch):
            # calculate output given normalized input data and weights
            Y = np.matrix(np.sum(np.dot(W, X.T), axis=0))
            # update and normalize the weight
            W += np.sum((self.lr * (Y * X - Y * Y.T * W)), axis=0)
            W = np.abs(W / np.linalg.norm(W))

            # append weight to history
            history_W.append(W.copy())
            # output results
            print("epoch: " , (e + 1) , ", weights: " , W.tolist())
        return W

In [8]:
# initialize and train the model
pca_model = PCA(lr=0.003,epoch=20)
pca_model_weights = pca_model.train(wav_data,1)

epoch:  1 , weights:  [[0.508607336897197, 0.860998592829478]]
epoch:  2 , weights:  [[0.7933350073520881, 0.6087853202153142]]
epoch:  3 , weights:  [[0.6470477539563819, 0.7624494764245048]]
epoch:  4 , weights:  [[0.7436742280051928, 0.6685421771293718]]
epoch:  5 , weights:  [[0.6818461639673468, 0.7314955971726788]]
epoch:  6 , weights:  [[0.723420685569196, 0.6904074968383487]]
epoch:  7 , weights:  [[0.6960160981670509, 0.718026177163698]]
epoch:  8 , weights:  [[0.714414386643725, 0.6997228623937265]]
epoch:  9 , weights:  [[0.702183973952445, 0.7119955524610755]]
epoch:  10 , weights:  [[0.7103762284640246, 0.7038221465918986]]
epoch:  11 , weights:  [[0.704914119398801, 0.7092926647528599]]
epoch:  12 , weights:  [[0.7085679100712491, 0.7056426268425559]]
epoch:  13 , weights:  [[0.7061289025734288, 0.7080833093290967]]
epoch:  14 , weights:  [[0.707759363723263, 0.7064535958306404]]
epoch:  15 , weights:  [[0.7066704434840628, 0.7075428498020723]]
epoch:  16 , weights:  [[0.

In [10]:
# get the output and scaled output
output = np.dot(txt_data, pca_model_weights.T)
scaled_output = np.int16(output * sample_rate)

# save the data to a wav and csv file
wavfile.write('output.wav', sample_rate, scaled_output)
np.savetxt('output.csv', output, delimiter=',')

## Part 2 K-Means Clustering Algorithm
The dataset is [Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data) which has 6 features and 1 label called species (Chinstrap, Adélie, or Gentoo)

The dataset is saved in the "data/penguins_size.csv" file and preprocessed into x_train, x_test, y_train, y_test  
- Build a K-Means clustering algorithm to cluster the preprocessed data (2 points)  
- Standardize the data and train the model with the training set (1 point)  
- Evaluate the model and print the confusion matrixes with both training and test sets (2 points)  

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [12]:
# load the dataset
data = pd.read_csv('data/penguins_size.csv')
data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [13]:
# data preprocessing
data = data.dropna()
data = data[data['sex'] != '.']

cleanup_nums = {"species": {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2},
                "island": {"Biscoe": 0, "Dream": 1, "Torgersen": 2},
                "sex": {"MALE": 0.0, "FEMALE": 1.0}}
data = data.replace(cleanup_nums)

data.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181.0,3750.0,0.0
1,0,2,39.5,17.4,186.0,3800.0,1.0
2,0,2,40.3,18.0,195.0,3250.0,1.0
4,0,2,36.7,19.3,193.0,3450.0,1.0
5,0,2,39.3,20.6,190.0,3650.0,0.0


In [14]:
x = np.array(data.drop(['species'], axis=1).copy())
y = np.array(data['species'].copy()).astype(int)

In [15]:
# data standardization
x = (x - x.mean(axis=0)) / x.std(axis=0)

In [16]:
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((266, 6), (67, 6), (266,), (67,))

In [17]:
# calculate the confusion matrix
def evaluator(y_test, y_pred):
    # find the number of classes
    num_classes = np.max(np.concatenate((y_test, y_pred))) + 1

    # initialize an empty array for the confusion matrix
    confusion_matrix = np.zeros((num_classes, num_classes), dtype=int)

    # fill out confusion matrix
    for test, pred in zip(y_test, y_pred):
        confusion_matrix[test, pred] += 1

    print('Confusion matrix:\n', confusion_matrix)

In [18]:
# setup a baseline model
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3) # n_clusters - the number of clusters
km.fit(x_train)

# predict y values from the training data
y_pred = km.predict(x_train)

# make predictions on training set
print("Predictions from training data using sklearn")
evaluator(y_train, y_pred)

# make predictions on test set
print("\nPredictions from testing data using sklearn")
y_pred = km.predict(x_test)
evaluator(y_test, y_pred)

Predictions from training data using sklearn
Confusion matrix:
 [[ 47   0  60]
 [ 26   0  32]
 [  0 101   0]]

Predictions from testing data using sklearn
Confusion matrix:
 [[26  0 13]
 [ 8  0  2]
 [ 0 18  0]]


  super()._check_params_vs_input(X, default_n_init=10)


In [19]:
# build K-means model and only Numpy can be used
class KMeans(object):
    def __init__(self, n_clusters):
        self.n_clusters = n_clusters
        self.centroids = None
        
    def train(self, x, learning_rate, n_iters):
        # initialize random values for centroids
        self.centroids = x[np.random.choice(x.shape[0], self.n_clusters, replace=False)]

        for _ in range(n_iters):
            # calculate normalized euclidean distance for each centroid
            distances = np.linalg.norm(x[:, np.newaxis, :] - self.centroids, axis=2)

            # determine the centroids with the minimum distance for each point
            centroid_index = np.argmin(distances, axis=1)

            # update centroids based on the mean of data points in each cluster
            for i in range(self.n_clusters):
                self.centroids[i] = np.mean(x[centroid_index == i], axis=0)

    def predict(self, x):
        # calculate Euclidean distance from the point to each centroid
        distances = x[:, np.newaxis, :] - self.centroids

        # normalize the distance
        norm_distances = np.linalg.norm(distances, axis=2)

        # determine closest centroid and return centroid index
        return np.argmin(norm_distances, axis=1)

In [22]:
# initialize and train the model
kmeans_model = KMeans(n_clusters=3)
kmeans_model.train(x_train, learning_rate=0.01, n_iters=100)

In [23]:
# make predictions on training set
print("Predictions from training data using custom model")
y_pred_train = kmeans_model.predict(x_train)
evaluator(y_train, y_pred_train)

Predictions from training data using custom model
Confusion matrix:
 [[ 47   0  60]
 [ 26   0  32]
 [  0 101   0]]


In [24]:
# make predictions on test set
print("Predictions from test data using custom model")
y_pred_test = kmeans_model.predict(x_test)
evaluator(y_test, y_pred_test)

Predictions from test data using custom model
Confusion matrix:
 [[26  0 13]
 [ 8  0  2]
 [ 0 18  0]]
