Author: Peter Svenningsson

Email: p.o.svenningsson@tudelft.nl

Updated: 2021-04-28

# Representation learning

In radar practicum 1 we explored classification where a spectrogram was characterized by a set of features which we had defined. In challanging classification tasks it may be  difficult to engineer features which sufficiently separates the classes in the feature space.

If we have a large dataset available we may instead try to learn features directly from the data and in this practicum we will cover the two most popular approaches:

*   Deep neural networks
*   Principal Component Analysis (PCA)

<br>

We will be working with the dataset described in radar practicum 1 - which is downloaded by the cell below.


In [None]:
import pickle
import numpy as np
import urllib.request

urllib.request.urlretrieve('https://github.com/petersvenningsson/student-resources-EE4675/blob/main/NetRad_dataframe?raw=true', 'NetRad_dataframe')
dataset_dataframe = pickle.load(open( 'NetRad_dataframe', 'rb' ))

dataset_dataframe.head()

Please copy to this file the following functions which you implemented in classification practicum 1: 

*get_splits*, *standardize_features*, *calculate_recall*




In [None]:
import random

def get_splits(dataset_dataframe):
    """ Returns training and validation set indices.

        training_indices: List of integers
        test_indices: List of integers
    """
    # Your code

    # Answer

    indices = list( range( len(dataset_dataframe) ) )
    random.shuffle(indices)

    n_samples = len(dataset_dataframe)

    training_indices = indices[0:int(n_samples*0.8)]
    test_indices = indices[int(n_samples*0.8):]
    return training_indices, test_indices


In [None]:
def standardize_features(feature_array, training_indices):
    """ Standardizes the feature array based on the moments estimated
        from the training split.

        normalized_feature_array: np.array of shape (n_samples, n_features)
    """

    # Your code here

    return normalized_array


In [None]:
def calculate_recall(labels, predictions, averaging_type = 'macro'):
    """ Calculates and returns the recall metric in the multi-class setting.

        predictions: List or 1D numpy array of integers
        labels: List or 1D numpy array of integers
    """
    # Cast labels and predictions to numpy arrays
    labels = np.array(labels)
    predictions = np.array(predictions)

    if averaging_type == 'macro':
    
        # Your code

        return recall

    if averaging_type == 'micro':

        # Your code

        return recall

# Exercise 1: Artificial neural networks
A deep neural network is a compound function of parametrized linear functions (neurons) and non-linear activation functions. The parameters of the linear functions are optimized with respect to a loss function which acts as a suitable proxy task for the task we want to complete. 

For classification tasks a common loss function is the negative log likelihood defined as, $-\sum_{i=j}^{C} y_{j} \log \hat{p}_{j}$, where $y_j \in \{0,1\}$ denotes the true label and $\hat{p}_{j}$ the predicted probability of the sample belonging to class $j$. This loss function is differentiable which allows us to use gradient based optimizers like stochastic gradient descent to fit the neural network to training data.

The input to the neural network is first passed through a set of neurons which maps the input to a high dimensional feature space. This feature space is the learnt representation where classes should be well separated. The last layer of the neural network consists simply of a logistic regression model which takes as input the learnt representation. 

Run the two cells below to train a fully connected deep neural network to classify the previously discussed micro-Doppler signatures.




In [None]:
%tensorflow_version 2.x
import tensorflow as tf

import keras
from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.models import Sequential

training_indices, test_indices = get_splits(dataset_dataframe)

# Extract feature array with dimensions (n_samples, n_doppler_bins, n_time_bins, 1)
feature_array = np.stack( dataset_dataframe['Spectrogram'].to_numpy() )
n_samples, n_doppler_bins, n_time_bins = feature_array.shape

# Collapse the feature array into a 1D vector, standardize the data and reshape back into spectrograms
feature_array = feature_array.reshape(n_samples, n_doppler_bins * n_time_bins)
feature_array = standardize_features(feature_array, training_indices)
feature_array = feature_array.reshape( n_samples, n_doppler_bins, n_time_bins, 1)

# labels are one-hot endcoded as required by Keras
num_classes = 3
labels = keras.utils.to_categorical(dataset_dataframe['Class index'].to_numpy(), num_classes)

In [None]:
# Define model
input_shape = (n_doppler_bins, n_time_bins, 1)

device_indicator = '/device:GPU:0' # Change to '/device:cpu:0' to perform the computations on CPU
with tf.device(device_indicator):
    model = Sequential()
    model.add(Flatten())
    model.add(Dense(2000, activation='relu'))
    model.add(Dense(2000, activation='relu'))
    model.add(Dense(2000, activation='relu'))
    model.add(Dense(2000, activation='relu'))
    model.add(Dense(2000, activation='relu')) # Experiment with changing the number and size of the dense layers
    model.add(Dense(num_classes, activation='softmax'))

    model.compile(loss=keras.losses.categorical_crossentropy,
                optimizer=keras.optimizers.Adam(),
                metrics='categorical_accuracy')

    batch_size = 16
    epochs = 15
    model.fit(feature_array[training_indices, :, :], labels[training_indices,:],
            batch_size=batch_size,
            epochs=epochs,
            verbose=1,
            validation_data=(feature_array[test_indices, :, :], labels[test_indices,:]))

    score = model.evaluate(feature_array[test_indices, :, :], labels[test_indices,:], verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])

print(f'The model has {model.count_params()} number of parameters')

## Exercise 1: Questions

### 5.1
Experiment with changing the number and the size of the Dense layers in the neural network. What happens if you include too many or too few?

What happens if you select the size of one layer to 1? Why do you think that happens?


Student's answer:

### 5.3
Try to run the optimization on CPU. What changes? 


Student's answer:

<br> <br> <br> 

# Exercise 2: Convolutional neural networks

Convolutional neural networks (CNNs) consists of a set of parametrized filters which stride across the input array/matrix. This technique has the following effect:


*   We bias the network to extract features/patterns which are invariant to translations in the input data.
*   The network is able to be more parameter-efficient as the parameters in a given layer are fit based on a larger propotion of the input data. 


This is in contrast to the fully connected neural network, which is not able to exploit the structure in the input data.

A convolutional layer consisting of 32 individual filters of size 2,2 is initialized by the method:

```
model.add(Conv2D(32, (2, 2), activation='relu'))
```
The response from each filter is here passed through the activation function *relu* which simply truncates negative values to $0$ and is the identity function for non-negative values. For reasons related to numerical stability it is common that artificial neural networks do not have negative thoughts.

![](https://www.researchgate.net/profile/Hossam-H-Sultan/publication/333411007/figure/fig7/AS:766785846525952@1559827400204/ReLU-activation-function.png)

Run the cell below to train a CNN to classify the micro-Doppler signatures.

In [None]:
# Define model
input_shape = (n_doppler_bins, n_time_bins, 1)

device_indicator = '/device:GPU:0' # Change to '/device:cpu:0' to remove GPU acceleration
with tf.device(device_indicator):
    model = Sequential()
    model.add(Conv2D(32, (2, 2),
                    activation='relu',
                    input_shape=input_shape))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(32, (2, 2), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Flatten())
    model.add(Dense(500, activation='relu'))
    model.add(Dense(num_classes, activation='softmax'))

    model.compile(loss=keras.losses.categorical_crossentropy,
                optimizer=keras.optimizers.Adam(),
                metrics='categorical_accuracy')

    batch_size = 16
    epochs = 15
    model.fit(feature_array[training_indices, :, :], labels[training_indices,:],
            batch_size=batch_size,
            epochs=epochs,
            verbose=1,
            validation_data=(feature_array[test_indices, :, :], labels[test_indices,:]))

    score = model.evaluate(feature_array[test_indices, :, :], labels[test_indices,:], verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])

print(f'The model has {model.count_params()} number of parameters')



## Exercise 2: Questions

### 5.1
Experiment with changing the number and size of the convolutional filters. What is the lowest number of model parameters you can achieve while retaining  high classification accuracy?


Student's answer:

<br> <br> <br> 

# Exercise 3: Principal Component Analasys (PCA)
Another popular and effective approach to learn features from data is Principal Component Analysis (PCA).

### Change of basis 


---


In the general case we have a dataset $X\in R^{ n_{\text{samples}}\times n_{\text{features}}}$.

We are interested in reducing the dimensionality of the data, $n_{\text{features}}$, while still retaining the structure in the data. If limit ourselves to a new basis which is a linear combination of the existing feature set we can describe the transformation as

$\tilde{X} = X P$, where the matrix $P\in R^{n_\text{features}, \tilde{n}_\text{features}}$ describes a linear transformation. 

The transformed data $\tilde{X}\in R^{ n_{\text{samples}}\times \hat{n}_{\text{features}}}$ has the reduced dimensionality $\hat{n}_{\text{features}} < n_{\text{features}}$.

<br>

---
### Basis of principle components

But how do we choose an appropriate transformation $P$ ? 

PCA proposes that the covariance matrix of the features $\Sigma$ completely describes the structure in the data and we should select a lower dimensional basis which is optimally able to reconstruct the covariance matrix. 

<br>

The covariance matrix $\Sigma$ is real and symmetric and so it is diagonalizable by the eigendecomposition,

$\Sigma = Q \Lambda Q^T$, where $\Lambda$ is a diagonal matrix of the eigenvalues of $\Sigma$ and $Q$ is an orthonormal matrix where the columns are the eigenvectors of $\Lambda$.

<br>

If we write out the matrix multiplication $ Q \Lambda Q^T$ we find that if we remove eigenvectors with a small corresponding eigenvalue little information is lost. Therefore we select the change of bases $P$ as the $\hat{n}_{\text{features}}$ eigenvectors with the largest eigenvalues.

<br>

$\tilde{X} = X P$, where $ P = \left[\begin{array}{ccc}
\mid & & \mid \\
v_{1} & \cdots & v_{\hat{n}_{\text{features}}} \\
\mid & & \mid
\end{array}\right]$ and $v$ denotes a eigenvector of $\Sigma$ which are also called the principal components of the data $X$.

<br>

---

### Reconstruction

With the above technique we are able to reduce the dimensionality of the data. But how much information is lost? We can attempt to reconstruct the original data by the inverse transformation,

$X = \tilde{X}P^{-1} = \tilde{X}P^{T}$.

Run the cells below to plot the orignal data and reconstructed data for a given number of principal components.  



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def PCA_decomposition(spectrograms, PCA_COMPONENTS, training_indices):
    pca = PCA(n_components = PCA_COMPONENTS)
    scaler = StandardScaler()

    n_samples, n_doppler_bins, n_time_bins = spectrograms.shape
    # Collapse the feature array into a 1D vector, standardize the data
    feature_array = spectrograms.reshape(n_samples, n_doppler_bins * n_time_bins)

    # fit the standardization and PCA transforms
    scaler.fit(feature_array[training_indices,:])
    spectrograms_standardized = scaler.transform(feature_array)
    pca.fit(spectrograms_standardized[training_indices,:])

    # Transform the data to its principal components
    feature_array_PCA = pca.transform(spectrograms_standardized)
    # Transform the data from principal components back to spectrograms
    feature_array_standardized_filtererd = pca.inverse_transform(feature_array_PCA)
    feature_array_filtererd = scaler.inverse_transform(feature_array_standardized_filtererd)

    filtered_spectrograms = feature_array_filtererd.reshape(n_samples, n_doppler_bins, n_time_bins )

    return feature_array_PCA, filtered_spectrograms, spectrograms_standardized, pca


In [None]:
import matplotlib.pyplot as plt

I_SAMPLE = 1
PCA_COMPONENTS = 9
training_indices, test_indices = get_splits(dataset_dataframe)


# Generate filtered spectrograms from PCA
spectrograms = np.stack(dataset_dataframe['Spectrogram'].to_numpy())
_, filtered_spectrograms, _, pca  = PCA_decomposition(spectrograms, PCA_COMPONENTS, training_indices)

# Plot the spectrogram filtered by PCA and the unfiltered spectrogram
plt.rcParams["figure.figsize"] = (20,5)
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle(f"Variance explained by {PCA_COMPONENTS} components is {round(sum(pca.explained_variance_ratio_)*100,1)}%. Class is {dataset_dataframe['Class'][I_SAMPLE]}")
ax1.imshow(filtered_spectrograms[I_SAMPLE,:,:], aspect='auto', cmap='jet', origin='lower', vmin=-40, interpolation='Nearest')
ax2.imshow(spectrograms[I_SAMPLE,:,:], aspect='auto', cmap='jet', origin='lower', vmin=-40, interpolation='Nearest')

ax1.set_ylabel('Doppler bin')
ax2.set_ylabel('Doppler bin')
ax1.set_xlabel('Time bin')
ax2.set_xlabel('Time bin')
ax2.set_title('Unfiltered spectrogram')
ax1.set_title('PCA filtered spectrogram')
plt.show()

# Exercise 3: Questions
### 3.1 
Change the sample index *I_SAMPLE* and the number of principal components *PCA_COMPONENTS* in the above script. How many principal components is adequate to represent the micro-Doppler signatures? How many principal components do you need to differentiate the classes?



Student's answer:

<br> <br> <br>
# Exercise 4: PCA and classification

The reduced feature set in $\tilde{X}$ can be used to classify the samples. 

We have defined a function PCA_decomposition() which returns $\tilde{X}$.

**Fit and evaluate the GaussianNaiveBayes classification model on the reduced feature set**

In [None]:
from sklearn.naive_bayes import GaussianNB as GaussianNaiveBayes
from sklearn.metrics import recall_score
PCA_COMPONENTS = 10

training_indices, test_indices = get_splits(dataset_dataframe)
spectrograms = np.stack(dataset_dataframe['Spectrogram'].to_numpy())
feature_array_PCA, _, _, _  = PCA_decomposition(spectrograms, PCA_COMPONENTS, training_indices)
labels = dataset_dataframe['Class index'].to_numpy()

model_NB = GaussianNaiveBayes()
# Your code here


# Exercise 4: Questions
### 4.1
How many principal components do you need to reach high accuracy?

Student's answer

### 4.1
What happens if you use too many or too few principal components? Why does this happen?

Student's answer


<br> <br> <br>
---


Reference: 

A more detailed description and instructive visualizations of PCA can be found [here](https://arxiv.org/pdf/1404.1100.pdf) 

